Generate PCA Biplot for NULISAseq Data with PCAtools

Performs Principal Component Analysis (PCA) and generates a biplot for visualizing sample relationships based on gene expression data. Uses PCAtools for analysis and ggplot2 for visualization. Colors are automatically generated from RColorBrewer palettes.

generate_pca(
  data,
  sampleInfo,
  sampleName_var,
  annotate_sample_by = NULL,
  label_points = FALSE,
  sample_subset = NULL,
  target_subset = NULL,
  shape_by = NULL,
  ellipse = TRUE,
  ellipseType = "t",
  ellipseAlpha = 0.15,
  ellipseFill = TRUE,
  ellipseLineSize = 0,
  sample_colors = NULL,
  components = c(1, 2),
  output_dir = NULL,
  plot_name = NULL,
  plot_title = NULL,
  plot_width = 10,
  plot_height = 8,
  ...
)

Arguments

data: A matrix with targets in rows, samples in columns. Row names should be the target names, and column names are the sample names. It is assumed that data has already been transformed using log2(x + 1) for each NULISAseq normalized count value x, i.e. NPQ.
sampleInfo: A data frame with sample metadata. Rows are samples, columns are sample metadata variables.
sampleName_var: Character string specifying the name of the column in sampleInfo that matches the column names of data.
annotate_sample_by: Character string specifying the column name from sampleInfo to use for coloring points. Only one variable is allowed; defaults to NULL.
label_points: Logical indicating whether to add sample labels to the plot; defaults to FALSE.
sample_subset: Vector of sample names for selected samples to include in PCA, should match the existing column names of data; defaults to NULL (all samples).
target_subset: Vector of target names for selected targets to include in PCA, should match the existing row names of data; defaults to NULL (all targets).
shape_by: Character string specifying the column name from sampleInfo to use for point shapes; defaults to NULL.
ellipse: Logical indicating whether to draw ellipses around groups; defaults to TRUE.
ellipseType: Character string specifying the type of ellipse. Options include "t" for t-distribution and "norm" for normal distribution; defaults to "t".
ellipseAlpha: Numeric value between 0 and 1 for ellipse transparency; defaults to 0.15.
ellipseFill: Logical indicating whether to fill the ellipses; defaults to TRUE.
ellipseLineSize: Numeric value for the ellipse border line width; defaults to 0 (no border).
sample_colors: Named vector of custom colors for sample groups. Names should match the levels in annotate_sample_by; defaults to NULL.
components: Integer vector of length 2 specifying which principal components to plot. For example, c(1, 2) plots PC1 vs PC2, c(2, 3) plots PC2 vs PC3; defaults to c(1, 2).
output_dir: Character string specifying the directory path to save the plot. If NULL, the plot is not saved; defaults to NULL. If provided without plot_name, a default filename with timestamp will be generated.
plot_name: Character string specifying the filename for the saved plot, including file extension (.pdf, .png, .jpg, or .svg). If NULL and output_dir is provided, a default filename with timestamp will be used; defaults to NULL.
plot_title: Character string for the title of the PCA plot; defaults to NULL.
plot_width: Numeric value for the width of the saved plot in inches; defaults to 10.
plot_height: Numeric value for the height of the saved plot in inches; defaults to 8.
...: Additional arguments passed to PCAtools::biplot function.

Value

A list containing:

targets_used: Character vector of target names used in the PCA after filtering.
pca_results: The PCAtools PCA object containing all PCA results.
rotated: Data frame containing the PC scores (PC1, PC2, etc.) for each sample.
plot: The ggplot2 object of the PCA biplot.
output_path: Character string of the full path to the saved file, or NULL if not saved.

Details

The function performs the following steps:

Filters data to specified samples and targets
Removes targets with all zero values
Scales data by row (Z-score transformation)
Removes rows with NA, NaN, or Inf values after scaling
Performs PCA using PCAtools
Generates biplot with specified aesthetics
Optionally saves to file

Custom Colors

To specify custom colors for sample groups, use the sample_colors parameter:


my_colors <- c("Control" = "#FF0000", "Treatment" = "#0000FF")

Examples

if (FALSE) { # \dontrun{
# Basic PCA plot
result <- generate_pca(
  data = Data_NPQ,
  sampleInfo = sample_metadata,
  sampleName_var = "SampleName",
  annotate_sample_by = "Group"
)

# PCA with sample labels and custom shapes
result <- generate_pca(
  data = Data_NPQ,
  sampleInfo = sample_metadata,
  sampleName_var = "SampleName",
  annotate_sample_by = "Group",
  shape_by = "Batch",
  label_points = TRUE
)

# PCA with custom colors
custom_colors <- c("Control" = "blue", "Treatment" = "red")
result <- generate_pca(
  data = Data_NPQ,
  sampleInfo = sample_metadata,
  sampleName_var = "SampleName",
  annotate_sample_by = "Group",
  sample_colors = custom_colors
)

# Save PCA plot to file
result <- generate_pca(
  data = Data_NPQ,
  sampleInfo = sample_metadata,
  sampleName_var = "SampleName",
  annotate_sample_by = "Group",
  output_dir = "output/figures",
  plot_name = "pca_analysis.pdf",
  plot_title = "PCA Analysis of Expression Data",
  plot_width = 12,
  plot_height = 10
)
} # }