Skip to contents

An optimized implementation of scPAS for identifying phenotype-associated cell subpopulations from single-cell RNA-seq data by integrating bulk transcriptomic data. This version includes performance optimizations, memory-efficient matrix operations, and enhanced statistical testing.

Usage

scPAS.optimized(
  bulk_dataset,
  sc_dataset,
  phenotype,
  assay = "RNA",
  tag = NULL,
  nfeature = NULL,
  imputation = TRUE,
  imputation_method = c("KNN", "ALRA"),
  alpha = NULL,
  cutoff = 0.2,
  network_class = c("SC", "bulk"),
  independent = TRUE,
  family = c("gaussian", "binomial", "cox"),
  permutation_times = 2000,
  FDR.threshold = 0.05,
  verbose = TRUE,
  ...
)

Arguments

bulk_dataset

A matrix or data frame containing bulk expression data. Each row represents a gene and each column represents a sample. Expression values should be continuous

sc_dataset

A Seurat object or matrix containing single-cell RNA-seq expression data. If a matrix is provided, it will be automatically processed using Seurat's default pipeline.

phenotype

Phenotype annotation for bulk samples. The format depends on the regression family:

  • For family = "gaussian": A continuous numeric vector

  • For family = "binomial": A binary group indicator vector (0/1 encoded) or factor with two levels

  • For family = "cox": A two-column matrix with columns named 'time' and 'status' (1 = event, 0 = censored)

assay

Character string specifying the assay name in the Seurat object to use for analysis. Default: 'RNA'.

tag

Optional character vector of length 2 specifying names for each phenotypic group. Used only for logistic regression (family = "binomial").

nfeature

Numeric value or character vector specifying the number of variable features to select, or a custom set of feature names. If NULL, all common genes between bulk and single-cell data are used.

imputation

Logical indicating whether to perform imputation on single-cell data. Default: TRUE.

imputation_method

Character string specifying the imputation method. One of: 'KNN', 'ALRA'. Default: 'KNN'.

alpha

Numeric value or vector specifying the regularization parameter balancing L1 and network-based penalties. If NULL, a default sequence from 0.001 to 0.9 is used.

cutoff

When alpha = NULL, the threshold for selecting the optimal alpha value. Default: 0.2

network_class

Character string specifying the source for constructing the gene-gene similarity network. One of: 'SC' (single-cell data), 'bulk' (bulk data). Default: 'SC'.

independent

Logical indicating whether to construct background distributions independently for each cell. Default: TRUE.

family

Character string specifying the regression family. One of: "gaussian" (linear regression), "binomial" (logistic regression), "cox" (Cox regression). Default: "gaussian".

permutation_times

Numeric value specifying the number of permutations for statistical testing. Default: 2000.

FDR.threshold

Numeric value specifying the false discovery rate threshold for identifying phenotype-associated cells. Default: 0.05.

verbose

Logical indicating whether to print progress messages. Default: TRUE.

...

Additional arguments to be passed to scPAS.optimized(). Currently none are supported.

Value

Returns the input Seurat object with the following additions:

  • Metadata columns:

    • scPAS_RS - Raw risk scores for each cell

    • scPAS_NRS - Normalized risk scores (Z-statistics)

    • scPAS_Pvalue - P-values from permutation testing

    • scPAS_FDR - False discovery rate adjusted p-values

    • scPAS - Cell classification labels: "Positive", "Negative", or "Neutral"

  • Miscellaneous slot (sc_dataset@misc$scPAS_para):

    • alpha - Alpha values used in model optimization

    • lambda - Lambda values used in model optimization

    • family - Regression family used

    • Coefs - Final model coefficients for each gene

    • bulk - Processed bulk expression matrix

    • phenotype - Processed phenotype vector

    • Network - Gene-gene similarity network used

Details

This optimized implementation of scPAS integrates bulk and single-cell transcriptomic data to identify phenotype-associated cell subpopulations through a comprehensive analytical workflow:

Workflow Overview:

  1. Data Preprocessing:

    • Identifies common genes between bulk and single-cell datasets

    • Filters ribosomal and mitochondrial genes

    • Performs quantile normalization on bulk data

    • Optionally imputes single-cell data using specified methods

  2. Network Construction:

    • Builds gene-gene similarity networks from either single-cell or bulk data

    • Uses correlation-based similarity measures

    • Applies sparse neighborhood network (SNN) construction

  3. Regularized Regression:

    • Implements network-regularized sparse regression (APML0)

    • Optimizes alpha and lambda parameters through cross-validation

    • Supports multiple regression families (gaussian, binomial, cox)

  4. Risk Score Calculation:

    • Computes phenotype-associated risk scores for each cell

    • Uses matrix optimizations for efficient computation

  5. Statistical Validation:

    • Performs permutation testing to assess significance

    • Calculates Z-statistics and false discovery rates

    • Classifies cells based on statistical thresholds

Note

The function requires both bulk and single-cell data from related biological conditions. For survival analysis (family = "cox"), the phenotype must be a properly formatted survival object or matrix with 'time' and 'status' columns.

References

Xie A, Wang H, Zhao J, Wang Z, Xu J, Xu Y. scPAS: single-cell phenotype-associated subpopulation identifier. Briefings in Bioinformatics. 2024 Nov 22;26(1):bbae655.

See also

Other scPAS: DoscPAS()

Examples

if (FALSE) { # \dontrun{
# Example with continuous phenotype (linear regression)
result <- scPAS.optimized(
  bulk_dataset = bulk_expr_matrix,
  sc_dataset = seurat_obj,
  phenotype = continuous_phenotype,
  family = "gaussian"
)

# Example with binary phenotype (logistic regression)
result <- scPAS.optimized(
  bulk_dataset = bulk_expr_matrix,
  sc_dataset = seurat_obj,
  phenotype = binary_groups,
  family = "binomial",
  tag = c("Control", "Disease")
)

# Example with custom parameters
result <- scPAS.optimized(
  bulk_dataset = bulk_expr_matrix,
  sc_dataset = seurat_obj,
  phenotype = survival_data,
  family = "cox",
  nfeature = 2000,
  permutation_times = 5000,
  FDR.threshold = 0.01
)
} # }