scPAS: Single-Cell Phenotype-Associated Subpopulations (Optimized)

An optimized implementation of scPAS for identifying phenotype-associated cell subpopulations from single-cell RNA-seq data by integrating bulk transcriptomic data. This version includes performance optimizations, memory-efficient matrix operations, and enhanced statistical testing.

Usage

scPAS.optimized(
  bulk_dataset,
  sc_dataset,
  phenotype,
  assay = "RNA",
  tag = NULL,
  nfeature = NULL,
  imputation = TRUE,
  imputation_method = c("KNN", "ALRA"),
  alpha = NULL,
  cutoff = 0.2,
  network_class = c("SC", "bulk"),
  independent = TRUE,
  family = c("gaussian", "binomial", "cox"),
  permutation_times = 2000,
  FDR.threshold = 0.05,
  verbose = TRUE,
  ...
)

Arguments

bulk_dataset

A matrix or data frame containing bulk expression data. Each row represents a gene and each column represents a sample. Expression values should be continuous

sc_dataset

A Seurat object or matrix containing single-cell RNA-seq expression data. If a matrix is provided, it will be automatically processed using Seurat's default pipeline.

phenotype

Phenotype annotation for bulk samples. The format depends on the regression family:

For family = "gaussian": A continuous numeric vector
For family = "binomial": A binary group indicator vector (0/1 encoded) or factor with two levels
For family = "cox": A two-column matrix with columns named 'time' and 'status' (1 = event, 0 = censored)

assay

Character string specifying the assay name in the Seurat object to use for analysis. Default: 'RNA'.

tag

Optional character vector of length 2 specifying names for each phenotypic group. Used only for logistic regression (family = "binomial").

nfeature

Numeric value or character vector specifying the number of variable features to select, or a custom set of feature names. If NULL, all common genes between bulk and single-cell data are used.

imputation

Logical indicating whether to perform imputation on single-cell data. Default: TRUE.

imputation_method

Character string specifying the imputation method. One of: 'KNN', 'ALRA'. Default: 'KNN'.

alpha

Numeric value or vector specifying the regularization parameter balancing L1 and network-based penalties. If NULL, a default sequence from 0.001 to 0.9 is used.

cutoff

When alpha = NULL, the threshold for selecting the optimal alpha value. Default: 0.2

network_class

Character string specifying the source for constructing the gene-gene similarity network. One of: 'SC' (single-cell data), 'bulk' (bulk data). Default: 'SC'.

independent

Logical indicating whether to construct background distributions independently for each cell. Default: TRUE.

family

Character string specifying the regression family. One of: "gaussian" (linear regression), "binomial" (logistic regression), "cox" (Cox regression). Default: "gaussian".

permutation_times

Numeric value specifying the number of permutations for statistical testing. Default: 2000.

FDR.threshold

Numeric value specifying the false discovery rate threshold for identifying phenotype-associated cells. Default: 0.05.

verbose

Logical indicating whether to print progress messages. Default: TRUE.

...

Additional arguments to be passed to scPAS.optimized(). Currently none are supported.

Value

Returns the input Seurat object with the following additions:

Metadata columns:
- scPAS_RS - Raw risk scores for each cell
- scPAS_NRS - Normalized risk scores (Z-statistics)
- scPAS_Pvalue - P-values from permutation testing
- scPAS_FDR - False discovery rate adjusted p-values
- scPAS - Cell classification labels: "Positive", "Negative", or "Neutral"
Miscellaneous slot (sc_dataset@misc$scPAS_para):
- alpha - Alpha values used in model optimization
- lambda - Lambda values used in model optimization
- family - Regression family used
- Coefs - Final model coefficients for each gene
- bulk - Processed bulk expression matrix
- phenotype - Processed phenotype vector
- Network - Gene-gene similarity network used

Details

This optimized implementation of scPAS integrates bulk and single-cell transcriptomic data to identify phenotype-associated cell subpopulations through a comprehensive analytical workflow:

Workflow Overview:

Data Preprocessing:
- Identifies common genes between bulk and single-cell datasets
- Filters ribosomal and mitochondrial genes
- Performs quantile normalization on bulk data
- Optionally imputes single-cell data using specified methods
Network Construction:
- Builds gene-gene similarity networks from either single-cell or bulk data
- Uses correlation-based similarity measures
- Applies sparse neighborhood network (SNN) construction
Regularized Regression:
- Implements network-regularized sparse regression (APML0)
- Optimizes alpha and lambda parameters through cross-validation
- Supports multiple regression families (gaussian, binomial, cox)
Risk Score Calculation:
- Computes phenotype-associated risk scores for each cell
- Uses matrix optimizations for efficient computation
Statistical Validation:
- Performs permutation testing to assess significance
- Calculates Z-statistics and false discovery rates
- Classifies cells based on statistical thresholds

Note

The function requires both bulk and single-cell data from related biological conditions. For survival analysis (family = "cox"), the phenotype must be a properly formatted survival object or matrix with 'time' and 'status' columns.

References

Xie A, Wang H, Zhao J, Wang Z, Xu J, Xu Y. scPAS: single-cell phenotype-associated subpopulation identifier. Briefings in Bioinformatics. 2024 Nov 22;26(1):bbae655.

Examples

if (FALSE) { # \dontrun{
# Example with continuous phenotype (linear regression)
result <- scPAS.optimized(
  bulk_dataset = bulk_expr_matrix,
  sc_dataset = seurat_obj,
  phenotype = continuous_phenotype,
  family = "gaussian"
)

# Example with binary phenotype (logistic regression)
result <- scPAS.optimized(
  bulk_dataset = bulk_expr_matrix,
  sc_dataset = seurat_obj,
  phenotype = binary_groups,
  family = "binomial",
  tag = c("Control", "Disease")
)

# Example with custom parameters
result <- scPAS.optimized(
  bulk_dataset = bulk_expr_matrix,
  sc_dataset = seurat_obj,
  phenotype = survival_data,
  family = "cox",
  nfeature = 2000,
  permutation_times = 5000,
  FDR.threshold = 0.01
)
} # }