Generalized Linear Model for NULISAseq Data - targets as predictors

Fits generalized linear model to each target in the NULISAseq data set, using univariate targets as predictors in the model, supporting various response distributions through the family parameter. Outputs coefficients, odds ratios, z-statistics, unadjusted and adjusted p-values.

glmNULISAseq_predict(
  data,
  sampleInfo,
  sampleName_var,
  response_var,
  modelFormula,
  reduced_modelFormula = NULL,
  exclude_targets = NULL,
  exclude_samples = NULL,
  target_subset = NULL,
  sample_subset = NULL,
  family = binomial(),
  return_model_fits = FALSE
)

Arguments

data

A matrix of normalized NULISAseq data with targets in rows, samples in columns. Row names should be the target names, and column names are the sample names. It is assumed that data has already been transformed using log2(x + 1) for each NULISAseq normalized count value x.

sampleInfo

A data frame with sample metadata including the response variable and covariates. Rows are samples, columns are sample metadata variables. Generalized linear models will only be done on the samples in sampleInfo, or a subset of those samples as specified using arguments exclude_samples or sample_subset. sampleInfo should have a column for each variable included in the generalized linear models. String variables will be automatically treated as factors, and numeric variables will be treated as numeric.

sampleName_var

The name of the column of sampleInfo that matches the column names of data. This variable will be used to merge the target expression data with the sample metadata.

response_var

The name of the column of sampleInfo specifying the response variable.

modelFormula

A string that represents the right hand side of the model formula (everything after the ~) used for the generalized linear model. The main effect of target expression will be automatically added as a predictor. Any interactions need to be specified in the model formula as "covariate * target". For example when response_var = disease, family = binomial(), modelFormula = "age + sex + plate" tests for differences in the log-odds of the binary outcome defined by the response variable (e.g., "disease"), adjusted for age, sex, and plate. modelFormula = "sex * target + age + plate" includes both main and interaction effects for sex and target expression. See ?glm().

reduced_modelFormula

Optional reduced model formula that contains only a subset of the terms in modelFormula. The reduced model serves as null model for a likelihood ratio test (LRT, which is a Chi-square test) using anova(). This could be useful for testing the overall significance of factor variables with more than 2 levels.

exclude_targets

A vector of target names for targets that will be excluded from the the generalized linear models as predictors. Internal control targets, for example, should probably always be excluded.

exclude_samples

A vector of sample names for samples that will be excluded from the generalized linear models. External control wells (IPCs, NCs, SC,) should usually be excluded.

target_subset

Overrides exclude_targets. A vector of target names for targets that will be included in the generalized linear models as predictors.

sample_subset

Overrides exclude_samples. A vector of sample names for samples that will be included in the generalized linear models.

family

A family object for the glm() function specifying the distribution and link function.

binomial(link = "logit"): For binary response data (default). Reports odds ratios (OR).
gaussian(link = "identity"): For continuous normally-distributed data. Reports coefficients.
poisson(link = "log"): For count data. Reports rate ratios (RR).
Gamma(link = "inverse"): For positive continuous data with constant coefficient of variation. Reports inverse ratios (IR).
inverse.gaussian(link = "1/mu^2"): For positive continuous data with variance increasing with mean^3. Reports inverse ratios (IR).
quasibinomial(link = "logit"): For overdispersed binary data. Reports odds ratios (OR).
quasipoisson(link = "log"): For overdispersed count data. Reports rate ratios (RR).
quasi(link = "identity", variance = "constant"): For custom quasi-likelihood models.

The function automatically selects appropriate test statistics (z/t-values) and effect size measures based on the family. (see ?family())

return_model_fits

Logical TRUE or FALSE (default). Should a list of the model fits be returned? Might be useful for more detailed analyses and plotting. However, also requires using more memory.

Value

A list including the following:

modelStats: A data frame with rows corresponding to targets and columns corresponding to estimated model coefficients, effect sizes, standard errors, test statistics, unadjusted p-values, Bonferroni adjusted p-values, and Benjamini-Hochberg false discovery rate adjusted p-values (see ?p.adjust()).
LRTstats: A data frame with rows corresponding to targets and columns with likelihood ratio test statistics including Chi-square statistic, degrees of freedom, unadjusted p-values, Bonferroni adjusted p-values, and Benjamini-Hochberg false discovery rate adjusted p-values. (Only included when reduced_modelFormula is specified.)
modelFits: A list of length equal to number of targets containing the model fit output from glm(). Only returned when return_model_fits=TRUE.
reducedFits: A list of length equal to number of targets containing the reduced model fit output from anova(). Only returned when return_model_fits=TRUE and reduced_modelFormula is specified.