Skip to contents

Full Tutorial for SigBridgeR

0. Preface

0.1 Introduction to SigBridgeR

SigBridgeR (short for Significant cell-to-phenotype Bridge in R) is an R package for screening cells highly associated with phenotype data using single-cell RNA-seq, bulk expression and sample_related phenotype data at a pan-cancer level. It integrates functionality from these R packages: Github-sunduanchen/Scissor, Github-Qinran-Zhang/scAB, Github-WangX-Lab/ScPP and Github-aiminXie/scPAS.


1. Installation

Install SigBridgeR using one of these methods:

1.1 Release from GitHub

Usually we recommend installing the latest release from GitHub because of the latest features and bug fixes.

if(!requireNamespace("remotes")) {
  install.packages("remotes")
}
remotes::install_github("WangLabCSU/SigBridgeR")

1.2 Release from r-universe

install.packages("SigBridgeR", repos = "https://wanglab.r-universe.dev")

1.8 Check Dependencies

You can use this function to quickly verify installed dependencies and their versions:

CheckPkgs <- function(packages) {
  CheckSinglePkg <- function(pkg_spec) {
    installed <- requireNamespace(pkg_spec$pkg, quietly = TRUE)
    current_version <- NA_character_
    
    if (installed) {
      current_version <- as.character(utils::packageVersion(pkg_spec$pkg))
      
      if (!is.null(pkg_spec$version)) {
        installed <- current_version >= pkg_spec$version
      }
    }
    
    data.frame(
      Package = pkg_spec$pkg,
      Required_Version = if (is.null(pkg_spec$version)) "Any" else pkg_spec$version,
      Installed = installed,
      Current_Version = if (installed) current_version else NA,
      stringsAsFactors = FALSE
    )
  }
  
  result <- do.call(rbind, lapply(packages, CheckSinglePkg))
  rownames(result) <- NULL  
    
  return(result)
}

CheckPkgs(list(
  list(pkg = "Seurat", version = "5.3.0"),
  list(pkg = "dplyr", version = "1.1.4"),
  list(pkg = "cli", version = "3.6.5"),
  list(pkg = "AUCell", version = "1.26.0"),
  list(pkg = "future", version = "1.67.0"),
  list(pkg = "IDConverter", version = "0.3.5"),
  list(pkg = "Matrix", version = "1.7.3"),
  list(pkg = "scAB", version = "1.0.0"),
  list(pkg = "Scissor", version = "2.0.0"),
  list(pkg = "scPAS", version = "0.2.0"),
  list(pkg = "ScPP", version = "0.0.0.9000"),
  list(pkg = "tibble", version = "3.3.0"),
  list(pkg = "tidyr", version = "1.3.1"),
  list(pkg = "data.table", version = "1.17.8"),
  list(pkg = "ggforce", version = "0.5.0"),
  list(pkg = "ggplot2", version = "4.0.0"),
  list(pkg = "ggupset", version = "0.4.1"),
  list(pkg = "purrr", version = "1.1.0"),
  list(pkg = "edgeR", version = "4.2.2"),
  list(pkg = "scales", version = "1.4.0"),
  list(pkg = "preprocessCore", version = "1.66.0"),
  # suggest
  list(pkg = "randomcoloR", version = "1.1.0.1"),
  list(pkg = "reticulate", version = "1.43.0"),
  list(pkg = "patchwork", version = "1.3.2"),
  list(pkg = "org.Hs.eg.db", verion = "3.19.1"),
  list(pkg = "zeallot", version = "0.1.0"),
  list(pkg = "ggVennDiagram", version = "1.5.4")
  )
)

If you encounter compatibility issues, you can install the version of the package indicated here.


2. Loading and preprocessing data

First load the packaged:

2.1 Single-cell RNA-seq Data

You can use function SCPreProcess to preprocess your single-cell RNA-seq data. Here are some options:

2.1.1 (Option A) Start from Raw Matrix

When starting from a raw count matrix (data.frame, matrix or dgCMatrix), SCPreProcess will automatically perform Seurat preprocess (including NormalizeData, FindVariableFeatures, ScaleData, RunPCA, FindNeighbors, FindClusters, RunTSNE, RunUMAP). SCPreProcess returns a fully preprocessed Seurat object for downstream use.

your_seurat <- SCPreProcess(
  your_matrix,
  meta_data = NULL,
  project = "Screen_Single_Cell", # Parameters used in Seurat preprocessing pipeline
  min_cells = 400,
  min_features = 0,
  normalization_method = "LogNormalize",
  scale_factor = 10000,
  scale_features = NULL,
  selection_method = "vst",
  resolution = 0.6,
  dims = 1:10,
  verbose = TRUE,
  future_global_maxsize = 6 * 1024^3,
  quality_control = TRUE,
  quality_control.pattern = c("^MT-"), # human for '^MT-', mouse for '^mt-', or specify your own pattern
  data_filter = TRUE,
  data_filter.nFeature_RNA_thresh = c(200, 6000), # [min, max]
  data_filter.percent.mt = 20,
  column2only_tumor = NULL,
  ...
)
  1. Build the initial Seurat object

    • Imports the count matrix.
    • Applies the first, coarse filters: genes must be detected in >= min_cells cells; cells must contain >= min_features genes.
    • Attaches optional sample metadata.
  2. Mitochondrial QC (default: ON)

    • A single regular expression (^MT- or ^mt-) is used to compute the percentage of mitochondrial reads per cell.
    • The metric is stored in object$percent.mt for later filtering.
  3. Cell filtering (default: ON)
    Cells are retained only if they satisfy all of the following:

    • nFeature_RNA lies between the lower and upper bounds defined by data_filter.nFeature_RNA_thresh.
    • percent.mt is below the user-defined cut-off (data_filter.percent.mt).
  4. Normalisation, variable-feature selection and scaling
    Encapsulated in ProcessSeuratObject():

    • Counts are normalised with the chosen method (default: LogNormalize).
    • Highly variable features are identified using the requested algorithm (default: vst).
    • Expression values of those features are scaled to unit variance.
  5. Append extra observations
    If the input object contains an obs slot, it is merged into object@meta.data.

  6. Dimensionality reduction and clustering
    ClusterAndReduce() runs:

    • PCA -> nearest-neighbour graph -> Louvain clustering (resolution parameterised).
    • UMAP for two-dimensional visualization.
    • Optionally removes very small or low-quality clusters.
  7. Tumour-cell flagging
    FilterTumorCell() creates a binary label (1 = Tumour, 0 = Normal) from the column specified by column2only_tumor.
    The new label can be used downstream to restrict analyses to malignant cells.

2.1.2 (Option B) Start from AnnDataR6 Object

SCPreProcess also supports AnnData objects. You may reference and use the following code:

reticulate::use_pythonenv("The_path_to_your_python") 
# reticulate::use_condaenv("conda_env_name")

anndata_obj <- anndata::read_h5ad("path_to_your_file.h5ad") # Or other formats, make sure the matrix is in obj$X.

your_seurat <- SCPreProcess(
  anndata_obj,
  meta_data = NULL,
  project = "Screen_Single_Cell", # Parameters used in Seurat preprocessing pipeline
  min_cells = 400,
  min_features = 0,
  normalization_method = "LogNormalize",
  scale_factor = 10000,
  scale_features = NULL,
  selection_method = "vst",
  resolution = 0.6,
  dims = 1:10,
  verbose = TRUE,
  future_global_maxsize = 6 * 1024^3,
  quality_control = TRUE, 
  quality_control.pattern = c("^MT-"), # human for '^MT-', mouse for '^mt-', or specify your own pattern
  data_filter = TRUE,
  data_filter.nFeature_RNA_thresh = c(200, 6000), # [min, max]
  data_filter.percent.mt = 20,
  column2only_tumor = NULL,
  ...
)

The description of data (meta.data) in anndata_obj$obs will be add to your_seurat@meta.data.

helpful documentation:

2.1.8 (Optional) Filter Out Tumor Cells

If you aim to filter out phenotype-associated cells in all tumor cells. With a preprocessed Seurat object (containing results from ), SCPreProcess will filter out tumor cells using the specified metadata column:

your_seurat <- SCPreProcess(your_seurat, column2only_tumor = "Tissue")

Note: I don’t recommend using columns like column2only_tumor = "Celltype" as tumor cell identities vary across tissues. instead:

  • Create a Dedicated Column: Add a new metadata column (e.g., is_tumor) to explicitly label cells:“Tumo(u)r”/“Normal”

  • Code Example:

# For glioblastoma (GBM)
seurat_obj@meta.data$is_tumor <- ifelse(
 grepl("GBM|glioblastoma|astrocytoma_grade_IV", seurat_obj@meta.data$Celltype, ignore.case = TRUE),
 "Tumor",  # or "Tumour" 
 "Normal"  # or "Non-Tumor" 
)

2.2 Bulk expression data

Here are some methods for processing bulk RNA-seq gene expression data matrices.

2.2.1 Evaluate the quality of your bulk RNA-seq data

BulkPreProcess performs comprehensive quality control on bulk RNA-seq data.

Key parameters for BulkPreProcess:

  • data: Expression matrix with genes as rows and samples as columns, or a list containing count_matrix and sample_info.
  • sample_info: Sample information data frame (optional), ignored if data is a list.
  • gene_symbol_conversion: Whether to convert Ensembl version IDs and TCGA version IDs to genes with SymbolConvert in Section 2.2.2, default TRUE.
  • check: Whether to perform detailed quality checks, default TRUE.
  • min_count_threshold: Minimum count threshold for gene filtering, default 10.
  • min_gene_expressed: Minimum number of samples a gene must be expressed in, default 3.
  • min_total_reads: Minimum total reads per sample, default 1e6.
  • min_genes_detected: Minimum number of genes detected per sample, default 10000.
  • min_correlation: Minimum correlation threshold between samples, default 0.8.
  • n_top_genes: Number of top variable genes for PCA analysis, default 500.
  • show_plot_results: Whether to generate PCA visualization plots, default TRUE.
  • verbose: Whether to output detailed information, default TRUE.

Suppose you have bulk RNA-seq count data and sample information, and you want to perform comprehensive preprocessing and quality control. You can refer to and use the following code:

# Example usage of BulkPreProcess
filtered_counts <- BulkPreProcess(
   data = your_count_matrix,
   sample_info = your_sample_info,
   gene_symbol_conversion = TRUE,
   check = TRUE,
   min_count_threshold = 10,
   min_samples_expressed = 3,
   min_total_reads = 1e6,
   min_genes_detected = 10000,
   min_correlation = 0.8,
   n_top_genes = 500,
   show_plot_results = TRUE,
   verbose = TRUE
)

The function returns a filtered count matrix after applying quality control steps. The order of filtering is: first genes, then samples.

Quality Control Metrics Reported
  • Data Integrity: Number of missing values detected

  • Gene Count: Total number of genes after filtering

  • Sample Read Depth: Total reads per sample

  • Gene Detection Rate: Number of genes detected per sample

  • Sample Correlation: Pearson correlation between samples

  • PCA Variance: Variance explained by first two principal components

  • Batch Effects: Proportion of genes significantly affected by batch

Visualization Output

When show_plot_results = TRUE, the function generates:

PCA plot colored by experimental condition

  • For low-depth sequencing data: reduce min_total_reads and min_genes_detected

  • For noisy datasets: increase min_correlation threshold

  • For large datasets: set check = FALSE for faster processing

2.2.2 Gene Symbol Conversion

SymbolConvert performs a straightforward task: converting common gene identifiers (e.g., Ensemble IDs, Entrez) to standardized gene symbols by using the IDConverter package.

# genes * samples
your_bulk_data <- read.csv("path_to_your_file.csv", header = TRUE, row.names = 1)

your_bulk_data <- SymbolConvert(your_bulk_data)

You can also use the org.Hs.eg.db package for gene symbol matching if you prefer not to use SymbolConvert’s built-in IDConverter.

library(org.Hs.eg.db)

your_bulk_data <- read.csv("path_to_your_file.csv", header = TRUE, row.names = 1)

your_bulk_data <- BulkPreProcess(your_bulk_data, gene_symbol_conversion = FALSE)

ensembl_ids <- sub("\\..*", "", rownames(your_bulk_data))
gene_symbols <- mapIds(org.Hs.eg.db, 
                      keys = ensembl_ids,
                      column = "SYMBOL",
                      keytype = "ENSEMBL",
                      multiVals = "first")

rownames(your_bulk_data) <- gene_symbols

2.3 Phenotype Data

Bascially you can just use your phenotype data directly. If you are confused about the structure Screen() requires, please refer to Section 5.

You can use this function to check NA values:

# `max_print`: output how many NA location messages at one time if NA exists
CheckNA <- function(data, max_print = 5) {
    # Load required packages
    rlang::check_installed(c("dplyr", "purrr", "cli", "data.table"))

    # Convert to data.table if it's a 2D structure but not already a data.table
    is_2d <- !is.null(dim(data))
    dt <- NULL

    if (is_2d && !data.table::is.data.table(data)) {
        dt <- data.table::as.data.table(data)
    } else if (is_2d) {
        dt <- data
    }

    # Initialize output
    na_info <- list()

    cli::cli_h1("NA Value Check (data.table optimized)")
    cli::cli_alert_info("Checking object with dimensions: {.val {dim(data)}}")

    # Handle 1D data (vectors)
    if (!is_2d) {
        na_count <- sum(is.na(data))
        na_positions <- which(is.na(data))
        na_info$positions <- na_positions
        na_info$count <- na_count

        if (na_count == 0) {
            cli::cli_alert_success("No NA values found in the vector!")
        }

        cli::cli_alert_warning("Found {.val {na_count}} NA value{?s} in vector")

        if (length(na_positions) > 0 && max_print > 0) {
            positions_to_show <- na_positions[
                1:min(max_print, length(na_positions))
            ]
            cli::cli_text("Positions: {.val {positions_to_show}}")

            if (length(na_positions) > max_print) {
                cli::cli_text(
                    "{.val {length(na_positions) - max_print}} additional positions not shown"
                )
            }
        }

        # For named vectors
        if (!is.null(names(data))) {
            na_names <- names(data)[na_positions]
            na_info$names <- na_names

            if (length(na_names) > 0 && max_print > 0) {
                names_to_show <- na_names[1:min(max_print, length(na_names))]
                cli::cli_text("Names: {.val {names_to_show}}")
            }
        }
    } else {
        # Handle 2D data (dataframes, matrices, data.tables)
        # Use data.table for efficient NA counting
        na_count <- dt[, sum(is.na(.SD))]
        na_info$count <- na_count

        if (na_count == 0) {
            cli::cli_alert_success("No NA values found in the 2D data!")
        }

        cli::cli_alert_warning(
            "Found {.val {na_count}} NA value{?s} in 2D data"
        )

        # Get NA positions efficiently without melt warning
        if (max_print > 0) {
            # Create a matrix of logical values indicating NA positions
            na_matrix <- is.na(as.matrix(dt))
            na_positions <- which(na_matrix, arr.ind = TRUE)

            # Convert to data.table and add column names
            na_positions_dt <- data.table::as.data.table(na_positions)
            colnames(na_positions_dt) <- c("row", "col")
            na_positions_dt$col_name <- colnames(dt)[na_positions_dt$col]

            na_info$positions <- na_positions_dt

            if (nrow(na_positions_dt) > 0) {
                positions_to_show <- na_positions_dt[
                    1:min(max_print, nrow(na_positions_dt)),
                ]

                cli::cli_text(
                    "First {.val {nrow(positions_to_show)}} position{?s}:"
                )

                for (i in 1:nrow(positions_to_show)) {
                    cli::cli_text(
                        "  Row {.val {positions_to_show$row[i]}}, Col {.val {positions_to_show$col_name[i]}} (index: {.val {positions_to_show$col[i]}})"
                    )
                }

                if (na_count > max_print) {
                    cli::cli_text(
                        "{.val {na_count - max_print}} additional positions not shown"
                    )
                }
            }
        }

        # Column-wise summary using data.table
        col_na <- dt[, lapply(.SD, function(x) sum(is.na(x)))]
        col_na <- unlist(col_na)
        col_na <- col_na[col_na > 0]

        if (length(col_na) > 0) {
            na_info$column_na <- col_na

            cli::cli_text("Column-wise NA counts:")
            for (col in names(col_na)) {
                cli::cli_text("  {.field {col}}: {.val {col_na[col]}}")
            }
        }
    }

    return(invisible(na_info))
}

3. Screening Cells Associated with Phenotype

The function Screen provide 4 different options for screening cells associated with phenotype, These 4 algorithms come from the repositories mentioned in Section 0.1, and you can choose one of them to screen your cells.

Key parameters for Screen:

  • matched_bulk: A data frame of bulk expression data after intersecting samples.
  • sc_data: A Seurat object after preprocessing, you can use the output of Preprocess function or your own preprocessed Seurat object.
  • phenotype: A data frame of phenotype data after intersecting samples.
  • label_type: A character value specifying the filtering labels are stored in the Seurat_object@misc , default: NULL.
  • phenotype_class: A character value specifying the phenotype data type, i.e. "binary", "survival" or "continuous".
  • screen_method: A character value specifying the screening method, i.e. “Scissor”, “scPAS”, “scAB” or “scPP”
  • ...: Other parameters for the screening methods.

3.1 (Option A) Scissor Screening

Parameters pass to ... when using Scissor method:

  • path2save_scissor_inputs: A character value specifying the path to save intermediate data, default: Scissor_inputs.RData
  • path2load_scissor_cahce: A character value specifying the path to load intermediate data
  • alpha: Parameter used to balance the effect of the l1 norm and the network-based penalties. It can be a number or a searching vector. If alpha = NULL, a default searching vector is used. The range of alpha is in [0,1]. A larger alpha lays more emphasis on the l1 norm.
  • cutoff: Cutoff for the percentage of the Scissor selected cells in total cells. This parameter is used to restrict the number of the Scissor selected cells. A cutoff less than 50% (default 20%) is recommended depending on the input data.
  • reliability_test_n: Permutation times (default: 10)
  • nfold: The fold number in cross-validation (default: 10)

Usage:

scissor_result = Screen(
  matched_bulk = matched_bulk,
  sc_data = sc_dataset, # A Seurat object after preprocessing
  phenotype = matched_phenotype_data,
  label_type = "TP53", # The filtering labels are stored in the `@misc`, you can change it to your own label
  phenotype_class = "binary",  
  screen_method = c("Scissor"),
  path2save_scissor_inputs = "Tmp/Scissor_inputs.RData" # Intermediate data
)

You can use the intermediate data for repeated runs. This is an inherent feature of the scissor.

scissor_result = Screen(
  sc_data = sc_dataset, 
  label_type = "TP53", 
  phenotype_class = "binary", 
  screen_method = c("Scissor"),
  path2load_scissor_cahce = "Tmp/Scissor_inputs.RData" # Intermediate data
)

If only the parameters alpha and cutoff are adjusted, this method can also be applied.

# When `alpha = NULL`, an alpha iteration will continue until phenotype-associated cells are screened out or no cells are screened out even after exceeding the `cutoff`.
scissor_result = Screen(
  sc_data = sc_dataset, 
  label_type = "TP53", 
  phenotype_class = "binary", 
  screen_method = c("Scissor"),
  path2load_scissor_cahce = "Tmp/Scissor_inputs.RData", # Intermediate data
  alpha = NULL, 
  cutoff = 0.05 
)

Returning structure: A list containing:

  • scRNA_data: A Seurat object after screening
  • scissor_result: The result of Scissor screening
  • reliability_test: Reliability test results

Cell level Evaluation:

You can use Sissor::evaluate.cell() to obtain some supporting information for each Scissor selected cell. First, prepare a benchmark dataset yourself.

evaluate_summary <- Scissor::evaluate.cell(
    'your_benchmark_data.RData', # file path
    scissor_result$scRNA_data, # The Seurat object after screening
    FDR = 0.05,
    bootstrap_n = 100
)

helpful documentation:

Scissor-Cell Level Evaluations

3.2 (Option B) scPAS Screening

Parameters pass to ... when using scPAS method (basically adapted from the scPAS’s documentation):

  • Parameters passed to scPAS::scPAS()

    These parameters directly interface with the core scPAS() function from the original package:

    • assay: Name of Assay to get.
    • imputation: Logical. imputation or not.
    • nfeature: Numeric. The Number of features to select as top variable features in sc_data. Top variable features will be used to intersect with the features of matched_bulk. Default is NULL and all features will be used.
    • alpha: Numeric. Parameter used to balance the effect of the l1 norm and the network-based penalties. It can be a number or a searching vector. If alpha = NULL, a default searching vector is used. The range of alpha is in [0,1]. A larger alpha lays more emphasis on the l1 norm.
    • network_class: The source of feature-feature similarity network. By default this is set to sc and the other one is bulk.

usage:

scpas_result = Screen(
  matched_bulk = matched_bulk,
  sc_data = A_Seurat_object,
  phenotype = phenotype,
  label_type = "TP53", # The filtering labels are stored in the `@misc` 
  screen_method = "scpas",
  phenotype_class = "binary"
)

returning structure: A list containing:

  • scRNA_data: A Seurat object after screening

3.3 (Option C) scAB Screening

Parameters pass to ... when using scAB method (basically adapted from the scAB’s documentation):

  • alpha: Coefficient of phenotype regularization, default is 0.005
  • alpha_2: Coefficient of cell-cell similarity regularization, default is 5e-05
  • maxiter: Maximum number of iterations, default is 2000
  • tred: Threshold for early stopping, default is 2

usage:

scab_result = Screen(
  matched_bulk = your_matched_bulk,
  sc_data = A_Seurat_object,
  phenotype = your_matched_phenotype,
  label_type = "TP53", # The filtering labels are stored in the `@misc` 
  screen_method = "scAB",
  phenotype_class = "binary",
)

returning structure: A list containing:

  • scRNA_data: A Seurat object after screening
  • scAB_result: A list with the submatrix and loss value

3.4 (Option D) scPP Screening

Parameters pass to ... when using scPP method :

  • ref_group: The reference group for the binary analysis, default is 1
  • Log2FC_cutoff: The cutoff for the log2 fold change of the binary analysis, default is 0.585
  • estimate_cutoff: Effect size threshold for continuous traits, default is 0.2
  • probs: Quantile cutoff for cell classification, default is 0.2

usage:

# This will take several hours
scpp_result = Screen(
  matched_bulk = your_matched_bulk,
  sc_data = A_Seurat_object,
  phenotype = your_matched_phenotype,
  label_type = "TP53", # The filtering labels are stored in the `@misc` 
  screen_method = "scpp",
  phenotype_class = "binary",
)

returning structure: A list containing:

  • scRNA_data: A Seurat object after screening

3.8 (Optional) Merge screening results

If you have performed multiple screening methods one the same single-cell data, you can use the MergeResult to merge the screening results of these methods. The Seurat object or a results list from Screen is accepted.

merged_seurat=MergeResult(
    your_scissor_result, 
    your_scPAS_result, 
    your_scAB_result, 
    your_scPP_result
)

# * mixed input form is alse supported 

merged_seurat=MergeResult(
    your_scissor_result$scRNA_data, 
    your_scPAS_result$scRNA_data, 
    your_scAB_result, 
    your_scPP_result,
)

This function simply merges the meta.data and misc slots of the input Seurat objects. Please note that the intermediate data (e.g., scissor_result$reliability.test or scab_result$scAB_result) will not be preserved in this process.


4. Visualization

Here we provide some visualization methods for the screening results. Considering that many people have different needs for data visualization, SigBridgR hardly provides visualization (except for fraction plot and upset plot, because we provide some statistic results for them). We only provide the source code for reference.

4.1 UMAP for screening results

example:

Suppose you have performed all algorithm screening on your Seurat object and wish to examine the distribution across different celltypes and patient, you may reference and use the following code:

library(patchwork)
library(zeallot)
# library(Seurat)
# library(purrr)

c(
    celltype_umap,
    patient_umap,
    scissor_umap,
    scab_umap,
    scpas_umap,
    scpp_umap
) %<-%
    purrr::map(
        c("celltype", "patient", "scissor", "scAB", "scPAS", "scPP"), # make sure these column names exist
        ~ Seurat::DimPlot(
            your_seurat_obj,
            group.by = .x,
            pt.size = 0.05,
            reduction = "umap"
        ) +
            ggplot2::ggtitle(.x)
    )

# * Show
umaps = celltype_umap +
    patient_umap +
    scissor_umap +
    scab_umap +
    scpas_umap +
    scpp_umap +
    plot_layout(ncol = 2)

umaps

This will generate six UMAP plots separately.

Or suppose you have performed scPAS screening on your Seurat object and want to visualize the distribution of prediction confidence scores, you may reference and use the following code:

library(patchwork)
library(zeallot)
# library(Seurat)
# library(purrr)

c(scPAS_Pvalue_umap, scPAS_NRS_umap) %<-%
    purrr::map(
        c("scPAS_Pvalue", "scPAS_NRS"),
        ~ Seurat::FeaturePlot(
            object = your_seurat_obj,
            features = .x,
        ) +
            ggplot2::ggtitle(.x) +
            theme(legend.position = "right")
    )

# * Show
scPAS_Pvalue_umap | scPAS_NRS_umap

This will generate two plots, one for each feature specified in feature.

helpful documentation:

Seurat::DimPlot - https://satijalab.org/seurat/reference/DimPlot.html

Seurat::FeaturePlot - https://satijalab.org/seurat/reference/FeaturePlot.html

4.2 Stack bar plot for screening results

Key parameters for ScreenFractionPlot:

  • seurat_obj: A Seurat object after screening.
  • group_by: Used to specify the column of the meta.data in seurat_obj. The plot results will be grouped by this parameter.
  • screen_type: Screening algorithm used before. (case-sensitive, e.g., “scissor” for Scissor results)
  • show_null: Logical, whether to show groups with zero cells (default: FALSE).
  • plot_color Custom color palette (named vector format):
    • Required names: “Positive”, “Negative”, “Neutral”, “Other”
    • Default: c(“Neutral”=“#CECECE”, “Other”=“#CECECE”, “Positive”=“#ff3333”, “Negative”=“#386c9b”)

Suppose you have already performed the scPAS algorithm screening on your Seurat object, and you want to check the proportion of positive cells across different patients. You can refer to and use the following code:

# based on `ggplot2`
plot <- ScreenFractionPlot(
   screened_seurat = scpas_result$scRNA_data, 
   group_by = "patient", # grouping basis for the x-axis
   screen_type = "scPAS",
   plot_title = "scPAS Screening Results"
)

If you have performed multiple screening methods and already merged the results, you can use the following code:

plot <- ScreenFractionPlot(
   screened_seurat = merged_seurat, 
   group_by = "patient", # grouping basis for the x-axis
   screen_type = c("scPAS", "scAB", "scPP"), # multiple screening results
   plot_title = "Screening Results" #  A screen_type prefix will be added to the current plot title
)

The order of the groups is determined by the proportion of Positive cells within each group.

returning structure:

If a single screen_type is specified

  • stats: A data frame containing the proportion of positive cells for each group.
  • plot: A ggplot object.

If multiple screen_types are specified

  • stats: A list containing data frames containing the proportion of positive cells for each group.
  • plot: A list containing each ggplot objects.
  • combined_plot: A ggplot object containing all the plots (2*2 grid).

4.3 Venn diagram for screening results

ggVennDiagram is used to generate a Venn diagram for the screening results. Suppose you have performed some of the screening algorithms on your Seurat object, and you want to check the overlap of the cells selected by each algorithm. You can refer to and use the following code:

example:

library(ggVennDiagram)

# * get the cell vectors
scissor_pos <- colnames(scissor_result$scRNA_data)[
    which(scissor_result$scRNA_data$scissor == "Positive")
]
scab_pos <- colnames(scab_result$scRNA_data)[
    which(scab_result$scRNA_data$scAB == "Positive")
]
scpas_pos <- colnames(scpas_result$scRNA_data)[
    which(scpas_result$scRNA_data$scPAS == "Positive")
]
scpp_pos <- colnames(scissor_result$scRNA_data)[
    which(scpp_result$scRNA_data$scPP == "Positive")
]
# # *If you have merged the results, you can use the following code instead:
# c(scissor_pos, scab_pos, scpas_pos, scpp_pos) %<-%
#     purrr::map(
#         c("scissor", "scAB", "scPAS", "scPP"),
#         ~ colnames(merged_seurat)[
#             which(merged_seurat[[.x]] == "Positive")
#         ]
#     )

all_cells <- colnames(your_seurat_obj) # this obj can be changed to other

# * create a list of cell vectors
pos_venn = list(
    scissor = scissor_pos,
    scpas = scpas_pos,
    scab = scab_pos,
    scpp = scpp_pos,
    all_cells = all_cells
    # * you can add more groups here
)

set.seed(123)

venn_plot = ggVennDiagram::ggVennDiagram(
    x = pos_venn,
    # * the labels of each group to be shown on the diagram
    category.names = c(
        "Scissor",
        "scPAS",
        "scAB",
        "scPP",
        "All cells"
    ),
    # * the colors of each group
    set_color = c(
        "#a33333ff",
        "#37ae00ff", 
        "#2a2a94ff",
        "#9c8200ff",
        "#008383ff"
    )
) +
    ggplot2::scale_fill_gradient(low = "white", high = "#ffb6b6ff") +
    ggplot2::ggtitle("Screening Venn Diagram")

helpful link:

ggVennDiagram - https://gaospecial.github.io/ggVennDiagram/

4.4 Upset plot for screening results

If too many screening meyhods are selected, the number of intersections among cells screened by different methods will also increase. In this case, using an upset plot is more intuitive and neat than a Venn diagram.

ggupset is used for visualizing upset plot.

Key parameters for ScreenUpset:

  • screened_seurat: A Seurat object after screening.
  • screen_type: Screening algorithm used before. (case-sensitive, e.g., “scissor” for Scissor results)
  • n_intersections : Number of intersections to display in the plot. Default: 20.
  • x_lab : Label for the x-axis. Default: “Screen Set Intersections”.
  • y_lab : Label for the y-axis. Default: “Number of Cells”.
  • title : Plot title. Default: “Cell Counts Across Screen Set Intersections”.
  • bar_color : Color for the bars in the plot. Default: “#4E79A7”.
  • combmatrix_point_color : Color for points in the combination matrix. Default: “black”.
  • ... : Additional arguments passed to ggplot2::theme() for customizing the plot appearance.
upset <- ScreenUpset(screened_seurat = merged_seurat)

returning structure:

  • plot: A ggplot object.
  • stats: A data frame containing the numbers of positive cells for each group.

helpful link:

ggupset - https://github.com/const-ae/ggupset


5. Example

5.1 Survival-associated cell screening

Here we use the example data LUAD to demonstrate how to use the functions in SigBridgeR to screen cells associated with phenotype.

if (requireNamespace("here", quietly = TRUE)) {
  setwd(here::here())         
  knitr::opts_knit$set(root.dir = here::here())  
}
library(SigBridgeR)
library(zeallot)
library(Seurat)

# * load the example data
c(mat_exam, bulk, pheno) %<-% LoadRefData(data_type = "survival")

dim(mat_exam)
 #[1] 33694  1093
dim(bulk) 
# [1] 4071  506
bulk[1:6,1:6]
#         TCGA-69-7978 TCGA-62-8399 TCGA-78-7539 TCGA-73-4658 TCGA-44-6775 TCGA-44-2655
# HIF3A         4.2598      11.6239       9.1362       5.0288       4.0573       5.5335
# RTN4RL2       8.2023       5.5819       3.5365       7.4156       7.7107       5.3257
# HMGCLL1       2.7476       5.8513       3.8334       3.6447       2.9188       4.8820
# LRRTM1        0.0000       0.4628       4.7506       6.8005       7.7819       2.2882
# GRIN1         6.6074       5.4257       4.9563       7.3510       3.5361       3.3311
# LRRTM3        1.7458       2.0092       0.0000       1.4468       0.0000       0.0000
nrow(pheno)
# [1] 506
head(pheno)
#               time status
# TCGA-69-7978  4.40      0
# TCGA-62-8399 88.57      0
# TCGA-78-7539 25.99      0
# TCGA-73-4658 52.56      1
# TCGA-44-6775 23.16      0
# TCGA-44-2655 43.50      0

This single-cell RNA data is from humans. We set many parameters to FALSE or 0 in order to maximize the flexibility of downstream analyses and capture a broader range of biological signals, so as to avoid insignificant results caused by too small a dataset.

Now we use SCPreProcess() to pre-process the data.

seurat = SCPreProcess(
  sc = mat_exam,
  quality_control = FALSE,
  data_filter = FALSE,
  min_cells = 0,
  min_features = 0,
  scale_features = rownames(mat_exam),
  dims = 1:20,
  resolution = 0.1
)

BulkPreProcess() is to pre-process the bulk expression data and the related clinical data.frame (containing bulk sample information) to perform PCA. Here, we want to retain the data so that the screening can reflect the most accurate situation. You can choose whether to filter or retain based on your own needs. See also 2.2 Bulk expression data for more details.

Thus far, we have completed all the data preprocessing. We are now ready to formally employ various single-cell phenotypic screening algorithms.

First, we use scissor to screen cells associated with survival.

scissor_result = Screen(
    matched_bulk = bulk,
    sc_data = seurat,
    phenotype = pheno,
    label_type = "survival",
    phenotype_class = "survival",
    screen_method = "Scissor",
    alpha = 0.05
)

# ℹ [2025/09/08 17:03:20] Scissor start...
# ℹ [2025/09/08 17:03:20] Start from raw data...
# ℹ Using "RNA_snn" graph for network.
# ℹ [2025/09/08 17:03:20] Normalizing quantiles of data...
# ℹ [2025/09/08 17:03:20] Subsetting data...
# ℹ [2025/09/08 17:03:21] Calculating correlation...
# ----------------------------------------------------------------------------------------------------
# Five-number summary of correlations:

#         0%        25%        50%        75%       100% 
# -0.2342323  0.0594385  0.1118708  0.1642065  0.5250605 
# ----------------------------------------------------------------------------------------------------
# ℹ [2025/09/08 17:03:21] Perform cox regression on the given clinical outcomes:
# ✔ Statistics data saved to `Scissor_inputs.RData`.
# ℹ [2025/09/08 17:03:22] Screening...
# [1] "alpha = 0.05"
# [1] "Scissor identified 246 Scissor+ cells and 244 Scissor- cells."
# [1] "The percentage of selected cell is: 44.831%"
# ----------------------------------------------------------------------------------------------------

table(scissor_result$scRNA_data$scissor)
# Negative  Neutral Positive 
#       73      755      265 

You will see an additional “Scissor_inputs.RData” in the working directory. This is the intermediate data generated by the Scissor method, which we can use to save running time. Meanwhile, we set reliability_test=TRUE, which will run an additional reliability test.

scissor_result = Screen(
    matched_bulk = bulk, # doesn't need to be provided Since the intermediate data is provided
    sc_data = seurat,
    phenotype = pheno_ok, # doesn't need to be provided Since the intermediate data is provided
    label_type = "survival",
    phenotype_class = "survival",
    screen_method = "Scissor",
    alpha = 0.05,
    path2load_scissor_cache = "Scissor_inputs.RData",
    reliability_test = TRUE
)
# ℹ [2025/09/08 16:07:48] Scissor start...
# ℹ [2025/09/08 16:07:48] Loading data from `Scissor_inputs.RData`...
# ℹ [2025/09/08 16:07:48] Screening...
# [1] "alpha = 0.05"
# [1] "Scissor identified 249 Scissor+ cells and 245 Scissor- cells."
# [1] "The percentage of selected cell is: 45.197%"

# --------------------------------------------------------------------------------
# ℹ [2025/09/08 16:07:54] Start reliability test

# Attaching package: ‘survival’

# The following object is masked from ‘package:future’:

#     cluster

# [1] "|**************************************************|"
# [1] "Perform cross-validation on X with true label"
# Finished!
# [1] "|**************************************************|"
# [1] "Perform cross-validation on X with permutated label"
# Finished!
# [1] "Test statistic = 0.590"
# [1] "Reliability significance test p = 0.000"
# ✔ [2025/09/08 16:10:54] reliability test: Done

scissor_result$reliability_result$statistic
# [1] 0.5899587

scissor_result$reliability_result$p
# [1] 0

scissor_result$reliability_result$c_index_test_real
#  [1] 0.6038544 0.5022222 0.6050725 0.6279391 0.5064935 0.6033520 0.6769231 0.6453089 0.4968421 0.6315789

scissor_result$reliability_result$c_index_test_back %>% unlist() %>% matrix(nrow = 10)
#            [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]      [,8]      [,9]     [,10]
#  [1,] 0.5594714 0.4943820 0.5379747 0.5583658 0.5527728 0.6146435 0.5130597 0.5701275 0.4691943 0.5177305
#  [2,] 0.5735608 0.4726891 0.4146341 0.5020661 0.5354331 0.6840149 0.5540275 0.4990758 0.5324484 0.5497382
#  [3,] 0.5960145 0.5157116 0.6191446 0.5091912 0.5659656 0.5870370 0.5090580 0.4247788 0.5734266 0.5539715
#  [4,] 0.6210191 0.6563147 0.5581948 0.4675615 0.4688222 0.5308219 0.5100402 0.6709091 0.6155779 0.5607143
#  [5,] 0.4623656 0.5716695 0.4920441 0.5403727 0.5555556 0.5911215 0.5738499 0.6189258 0.5871560 0.6084337
#  [6,] 0.5868794 0.5127660 0.7416880 0.5366876 0.5417440 0.5633803 0.5161943 0.4830372 0.4732965 0.5740132
#  [7,] 0.4975610 0.5751503 0.4801902 0.5588972 0.4940898 0.5723370 0.5788382 0.6171875 0.5884413 0.5445545
#  [8,] 0.6331361 0.5970516 0.5473888 0.6274131 0.5159705 0.5000000 0.5299760 0.4905660 0.5277778 0.5027322
#  [9,] 0.6390805 0.6548913 0.5049310 0.6299639 0.6385135 0.5964392 0.5381605 0.7462687 0.5185185 0.6585859
# [10,] 0.5207373 0.5000000 0.6051780 0.4932127 0.6892430 0.4786517 0.5619266 0.6614583 0.6502242 0.5333333

Next, we use scPAS, scAB and scPP to do the same screening. Generally you only need to change the screen_method, as long as you have not specified any particular parameters.

scpas_result = Screen(
    matched_bulk = bulk,
    sc_data = seurat,
    phenotype = pheno,
    label_type = "survival",
    phenotype_class = "survival",
    screen_method = "scPAS",
    alpha = 0.05
)
# ℹ [2025/09/08 17:04:15] Start scPAS screening.
# [1] "Step 1:Quantile normalization of bulk data."
# [1] "Step 2: Extracting single-cell expression profiles...."
# [1] "Step 3: Constructing a gene-gene similarity by single cell data...."
# Building SNN based on a provided distance matrix
# Computing SNN
# [1] "Step 4: Optimizing the network-regularized sparse regression model...."
# [1] "Perform cox regression on the given clinical outcomes:"
# [1] "alpha = 0.05"
# [1] "lambda = 0.499586979737535"
# [1] "scPAS identified 46 rick+ features and 16 rick- features."
# [1] "The percentage of selected feature is: 35.838%"

# [1] "|**************************************************|"
# [1] "Step 5: calculating quantified risk scores...."
# [1] "Step 6: qualitative identification by permutation test program with 2000 times random perturbations"
# [1] "Finished."
# ✔ [2025/09/08 17:04:24] scPAS screening done.

table(scpas_result$scRNA_data$scPAS)

# Neutral Positive 
#     1092        1 

As you can see, due to differences in data and algorithms, not every screening algorithm is able to screen out cells. You can adjust the corresponding parameters, e.g. change the alpha to NULL, this will make scPAS iterate alpha until the result is significant (or judged as having no significant cell subpopulations). See also 3.2 (Option B) scPAS Screening

scpas_result = Screen(
    matched_bulk = bulk,
    sc_data = seurat,
    phenotype = pheno,
    label_type = "survival",
    phenotype_class = "survival",
    screen_method = "scPAS",
    alpha = NULL
)
# ℹ [2025/09/22 18:46:25] Start scPAS screening.
# [1] "Step 1:Quantile normalization of bulk data."
# [1] "Step 2: Extracting single-cell expression profiles...."
# [1] "Step 3: Constructing a gene-gene similarity by single cell data...."
# Building SNN based on a provided distance matrix
# Computing SNN
# [1] "Step 4: Optimizing the network-regularized sparse regression model...."
# [1] "Perform cox regression on the given clinical outcomes:"
# [1] "alpha = 0.001"
# [1] "lambda = 7.88965313498356"
# [1] "scPAS identified 449 rick+ features and 236 rick- features."
# [1] "The percentage of selected feature is: 78.375%"

# [1] "alpha = 0.005"
# [1] "lambda = 3.32139271651783"
# [1] "scPAS identified 279 rick+ features and 159 rick- features."
# [1] "The percentage of selected feature is: 50.114%"

# [1] "alpha = 0.01"
# [1] "lambda = 2.5241108003373"
# [1] "scPAS identified 207 rick+ features and 102 rick- features."
# [1] "The percentage of selected feature is: 35.355%"

# [1] "alpha = 0.05"
# [1] "lambda = 0.8038196391727"
# [1] "scPAS identified 82 rick+ features and 24 rick- features."
# [1] "The percentage of selected feature is: 12.128%"

# [1] "alpha = 0.1"
# [1] "lambda = 0.441095530835556"
# [1] "scPAS identified 57 rick+ features and 12 rick- features."
# [1] "The percentage of selected feature is: 7.895%"

# [1] "alpha = 0.2"
# [1] "lambda = 0.220547765417778"
# [1] "scPAS identified 42 rick+ features and 12 rick- features."
# [1] "The percentage of selected feature is: 6.178%"

# [1] "alpha = 0.3"
# [1] "lambda = 0.154032875529483"
# [1] "scPAS identified 37 rick+ features and 9 rick- features."
# [1] "The percentage of selected feature is: 5.263%"

# [1] "alpha = 0.4"
# [1] "lambda = 0.115524656647112"
# [1] "scPAS identified 33 rick+ features and 8 rick- features."
# [1] "The percentage of selected feature is: 4.691%"

# [1] "alpha = 0.5"
# [1] "lambda = 0.0924197253176895"
# [1] "scPAS identified 30 rick+ features and 8 rick- features."
# [1] "The percentage of selected feature is: 4.348%"

# [1] "alpha = 0.6"
# [1] "lambda = 0.0770164377647413"
# [1] "scPAS identified 29 rick+ features and 8 rick- features."
# [1] "The percentage of selected feature is: 4.233%"

# [1] "alpha = 0.7"
# [1] "lambda = 0.0660140895126354"
# [1] "scPAS identified 28 rick+ features and 8 rick- features."
# [1] "The percentage of selected feature is: 4.119%"

# [1] "alpha = 0.8"
# [1] "lambda = 0.0577623283235559"
# [1] "scPAS identified 27 rick+ features and 7 rick- features."
# [1] "The percentage of selected feature is: 3.890%"

# [1] "alpha = 0.9"
# [1] "lambda = 0.0537890889507252"
# [1] "scPAS identified 24 rick+ features and 7 rick- features."
# [1] "The percentage of selected feature is: 3.547%"

# [1] "|**************************************************|"
# [1] "Step 5: calculating quantified risk scores...."
# [1] "Step 6: qualitative identification by permutation test program with 2000 times random perturbations"
# [1] "Finished."
# ✔ [2025/09/22 18:51:16] scPAS screening done.

Now there may be some significant cells appearing.

table(scpas_result$scRNA_data$scPAS)
# Negative  Neutral Positive 
#        2     1086        5 

Now we use scAB and scPP to screen cells.

scab_result = Screen(
    matched_bulk = bulk,
    sc_data = seurat,
    phenotype = pheno,
    label_type = "survival",
    phenotype_class = "survival",
    screen_method = "scAB"
)
# ℹ [2025/09/08 17:04:51] Start scAB screening.
# ℹ  Using "RNA_snn" graph for network.
# ℹ [2025/09/08 17:04:52] Selecting K...
# ℹ [2025/09/08 17:06:15] Run NMF with phenotype and cell-cell similarity regularization at K = 3.
# ℹ [2025/09/08 17:06:19] Screening cells...
# ℹ [2025/09/08 17:06:19] scAB screening done.

table(scab_result$scRNA_data$scAB)
#    Other Positive 
#     1018       75 
scpp_result = Screen(
    matched_bulk = bulk,
    sc_data = seurat,
    phenotype = pheno,
    label_type = "survival",
    phenotype_class = "survival",
    screen_method = "scPP"
)
# ℹ [2025/09/08 17:00:28] Start scPP screening.
# ℹ [2025/09/08 17:00:28] Finding markers...
# Warning in coxph.fit(X, Y, istrat, offset, init, control, weights = weights,  :
#   Loglik converged before variable  1 ; coefficient may be infinite. 
# ℹ [2025/09/08 17:00:52] Screening...
# Genes in the gene sets NOT available in the dataset: 
#   gene_pos:   13 (6% of 230)
#   gene_neg:   54 (12% of 446)
# There are no genes significantly upregulated in Phenotype- compared to Phenotype+.
# ✔ [2025/09/08 17:00:54] scPP screening done.

table(scpp_result$scRNA_data$scPP)
# Negative  Neutral Positive 
#       52      993       48 

After these algorithms have been run, the four sets of data can be merged since screening methods performed on the same data.

screen_result = MergeResult(
    scissor_result,
    scpas_result,
    scab_result,
    scpp_result
)
# ✔ Successfully merged 4 objects.

class(screen_result)
# [1] "Seurat"
# attr(,"package")
# [1] "SeuratObject"

colnames(screen_result@meta.data)
#  [1] "orig.ident"      "nCount_RNA"      "nFeature_RNA"    "percent.mt"      "RNA_snn_res.0.1" "seurat_clusters" "scissor"         "scPAS_RS"        "scPAS_NRS"       "scPAS_Pvalue"   
# [11] "scPAS_FDR"       "scPAS"           "scAB"            "scAB_Subset1"    "Subset1_loading" "scAB_Subset2"    "Subset2_loading" "scAB_Subset3"    "Subset3_loading" "scPP"           

Finally, we can visualize the screening results. Let’s start with a Venn diagram to see the situation.

library(ggVennDiagram)
library(zeallot)

# color palette
set.seed(123)
my_colors = randomcoloR::distinctColorPalette(length(unique(screen_result$seurat_clusters)),runTsne = TRUE)

c(scissor_pos, scab_pos, scpas_pos, scpp_pos) %<-%
    purrr::map(
        c("scissor", "scAB", "scPAS", "scPP"),
        ~ colnames(screen_result)[
            which(screen_result[[.x]] == "Positive")
        ]
    )

all_cells <- colnames(screen_result) 

# * create a list of cell vectors
pos_venn = list(
    scissor = scissor_pos,
    scpas = scpas_pos,
    scab = scab_pos,
    scpp = scpp_pos,
    all_cells = all_cells
)

set.seed(123)

venn = ggVennDiagram::ggVennDiagram(
    x = pos_venn,
    # * the labels of each group to be shown on the diagram
    category.names = c(
        "Scissor",
        "scPAS",
        "scAB",
        "scPP",
        "All cells"
    ),
    # * the colors of each group
    set_color = c(
        "#a33333ff",
        "#37ae00ff", 
        "#2a2a94ff",
        "#9c8200ff",
        "#008383ff"
    )
) +
    ggplot2::scale_fill_gradient(low = "white", high = "#ffb6b6ff") +
    ggplot2::ggtitle("Screening Venn Diagram")

venn

ggplot2::ggsave("vignettes/example_figures/venn.png",plot=venn, width = 10, height = 10)
knitr::include_graphics("vignettes/example_figures/venn.png")

venn

When there are too many sets, the visualization effect of the Venn diagram is not very good. We can use a set plot instead. Only positive cells are shown in the set plot.

upset <- ScreenUpset(screened_seurat = screen_result,screen_type = c("scissor", "scPAS", "scAB", "scPP"))

# * show the cell numbers of each set
head(upset$stats)
# # A tibble: 6 × 3
#   intersection    sets         count
#   <chr>           <named list> <dbl>
# 1 scissor         <chr [1]>      246
# 2 scPAS           <chr [1]>        0
# 3 scAB            <chr [1]>       87
# 4 scPP            <chr [1]>       49
# 5 scissor & scPAS <chr [2]>        0
# 6 scissor & scAB  <chr [2]>       11

ggplot2::ggsave("vignettes/example_figures/upset.png",plot=upset$plot, width = 10, height = 10)
knitr::include_graphics("vignettes/example_figures/upset.png")

upset

A bar chart showing proportions can also be used to examine the screening results. Since the example data does not have sample metadata, we have created a fictional Sample column.

set.seed(123)
# *fictional sample column
screen_result$Sample <- sample(paste0("Sample", 1:10), ncol(screen_result), replace = TRUE)

table(screen_result$Sample)
#  Sample1 Sample10  Sample2  Sample3  Sample4  Sample5  Sample6  Sample7  Sample8  Sample9 
#       98      117       96      119       95      100      101      131      115      121 

fraction_list = ScreenFractionPlot(
    screened_seurat = screen_result,
    group_by = "Sample",
    screen_type = c("scissor", "scPP", "scAB","scPAS"), # scPAS is not included due to no positive cells
    show_null = FALSE,
    plot_color = NULL,
    show_plot = TRUE
)
# Creating plots for 3 screen types...
knitr::include_graphics("vignettes/example_figures/fraction.png")

fraction

The fraction_list contains the statistical data and charts for each screening algorithm.

fraction_list
    ├── stats
    │   ├── scissor
    │   ├── scAB
    │   ├── scPAS
    │   └── scPP
    ├── plots
    │   ├── scissor
    │   ├── scAB
    │   ├── scPAS
    │   └── scPP   
    └── combined_plot # show 3 plots in one plot

UMAP is the most commonly used type of plot in academic literature.

library(patchwork)
library(zeallot)

sample_umap = Seurat::DimPlot(
  screen_result,
  group.by = "Sample",
  pt.size = 0.2,
  reduction = "umap",
  cols = my_colors
) +
  ggplot2::ggtitle("Sample")


c(
  scissor_umap,
  scab_umap,
  scpas_umap,
  scpp_umap
) %<-%
  purrr::map(
    c("scissor", "scAB", "scPAS", "scPP"), # make sure these column names exist
    ~ Seurat::DimPlot(
      screen_result,
      group.by = .x,
      pt.size = 0.2,
      reduction = "umap",
      cols = c(
        "Neutral" = "#CECECE",
        "Other" = "#CECECE",
        "Positive" = "#ff3333",
        "Negative" = "#386c9b"
      )
    ) +
      ggplot2::ggtitle(.x)
  )

# * Show
umaps = sample_umap +
  scissor_umap +
  scab_umap +
  scpas_umap +
  scpp_umap +
  plot_layout(ncol = 2)

umaps
knitr::include_graphics("vignettes/example_figures/umaps.png")

umaps

5.2 Continuous phenotype associated cell screening

Generally speaking, the process is the same as described in 5.1 Survival-associated cell screening. Here, only the preprocessing of continuous phenotypic data is introduced.

library(SigBridgeR)
library(zeallot)
library(Seurat)
setwd(here::here())

# * load the example data
c(mat_exam, bulk, pheno) %<-% LoadRefData(data_type = "continuous")

dim(mat_exam)
 #[1] 33694  1093
dim(bulk) 
# [1] 4106  289
bulk[1:6,1:6]
#        TCGA-AZ-6599-01 TCGA-AA-3655-01 TCGA-A6-6137-01 TCGA-CK-4952-01 TCGA-A6-5657-01 TCGA-AD-6963-01
# HIF3A           2.3437          2.0858          6.0759          1.9506          5.4777          4.4634
# CAMK4           4.9331          2.3709          4.1387          1.1557          4.1746          3.2363
# RNF112          2.4817          2.4947          3.5941          2.3486          4.9185          1.4621
# SPN             5.6704          6.8577          8.0598          5.0049          7.6076          7.3960
# LRRTM1          1.6031          0.9465          1.9142          0.0000          3.2523          0.0000
# GRIN1           6.4944          4.3225          2.8073          7.3460          4.5000          3.1816

# * A named vector
head(pheno)
# TCGA-AZ-6599-01 TCGA-AA-3655-01 TCGA-A6-6137-01 TCGA-CK-4952-01 TCGA-A6-5657-01 TCGA-AD-6963-01 
#             178              65              91             206              63              67 

If your phenotype is a data.frame, try this to convert it to a named vector:

pheno = setNames(your_data.frame$continuous_numeric, your_data.frame$sample)

5.3 Binarized phenotype associated cell screening

This process is also the same as described in 5.1 Survival-associated cell screening. Here, only the preprocessing of binary phenotypic data is introduced.

library(SigBridgeR)
library(zeallot)
library(Seurat)
setwd(here::here())

# * load the example data
c(mat_exam, bulk, pheno) %<-% LoadRefData(data_type = "binary")

dim(mat_exam)
 #[1] 33694  1093
dim(bulk) 
# [1] 4106  434
bulk[1:6,1:6]
#        TCGA-CA-5256-01 TCGA-AZ-6599-01 TCGA-AA-3655-01 TCGA-A6-6137-01 TCGA-CK-4952-01 TCGA-A6-5657-01
# HIF3A           3.7172          2.3437          2.0858          6.0759          1.9506          5.4777
# CAMK4           3.0698          4.9331          2.3709          4.1387          1.1557          4.1746
# RNF112          1.3702          2.4817          2.4947          3.5941          2.3486          4.9185
# SPN             5.5207          5.6704          6.8577          8.0598          5.0049          7.6076
# LRRTM1          3.2408          1.6031          0.9465          1.9142          0.0000          3.2523
# GRIN1           3.0698          6.4944          4.3225          2.8073          7.3460          4.5000

# * A named vector
head(pheno)
# TCGA-CA-5256-01 TCGA-AZ-6599-01 TCGA-AA-3655-01 TCGA-A6-6137-01 TCGA-CK-4952-01 TCGA-A6-5657-01 
#               1               1               1               1               1               1 

If your phenotype is a data.frame, your binary variable is stored in the “data” column, categorized as ‘Tumor’ and ‘Normal’, we will assign ‘Tumor’ a value of 1 and ‘Normal’ a value of 0. In this way, cells screened as Positive will be associated with ‘Tumor’. Try this to convert it to a named vector:

pheno = mutate(
    pheno,
    data = case_when(
        data == "Tumor" ~ 1,
        data == "Normal" ~ 0
    )
)
pheno = setNames(pheno$data, pheno$Sample)

6. Other function details

6.1 Add miscellaneous information to the Seurat object

SigBridgeR uses AddMisc() to record what data features or evidence the various screening algorithms are based on during execution.

  • AddMisc() : Add miscellaneous information to the Seurat object. Support for adding multiple attributes to the SeuratObject@misc slot simultaneously.
# basic usage
seurat_obj <- AddMisc(seurat_obj, "QC_stats" = qc_df)

# Auto-incrementing example when `cover` set to FALSE
seurat_obj <- AddMisc(seurat_obj, markers = markers1)
seurat_obj <- AddMisc(seurat_obj, markers = markers2, cover=FALSE)

# Add multiple attributes to the `SeuratObject@misc` slot simultaneously
seurat_obj <- AddMisc(seurat_obj, markers1 = markers1, markers2 = markers2)

6.2 Calculate the variance of each row in a matrix

  • rowVars() : Calculate the variance of each row in a matrix.
 # Basic usage with a matrix
 mat <- matrix(1:12, nrow = 3)
 rowVars(mat)

 # With missing values
 mat[1, 2] <- NA
 mat[2, 3] <- NA
 rowVars(mat, na.rm = TRUE)   # Excludes NAs
 rowVars(mat, na.rm = FALSE)  # Includes NAs (returns NA for affected rows)

 # With a data frame
 df <- data.frame(
   a = c(1, 4, 7),
   b = c(2, 5, 8),
   c = c(3, 6, 9)
 )
 rowVars(df)

 # Edge case: single column (variance is 0)
 single_col <- matrix(1:3, ncol = 1)
 rowVars(single_col)  # Returns NaN due to division by 0

6.3 Detect Rows with Zero Variance in Matrices

  • Check0VarRows: Detect Rows with Zero Variance in Matrices

The Check0VarRows function identifies rows with zero variance (constant values) or exclusively NA values in numeric matrices, including sparse matrices of class dgCMatrix. It throws an informative error if such rows are detected, listing their names. This is particularly useful for preprocessing steps in statistical analyses (e.g., PCA, regression) where constant rows may cause computational issues or misinterpretations.

Parameters
  • mat (Required)
    A numeric matrix or sparse matrix of class dgCMatrix (from the Matrix package). Rows represent features (e.g., genes), and columns represent observations. Must have row names to identify problematic rows in error messages.

  • call (Optional, Advanced)
    The environment from which the function was called. Used internally for error reporting. Defaults to rlang::caller_env(). Most users can ignore this parameter.

Details

Algorithm:

  • For dense matrices, uses rowVars (Section 6.2) implementation to compute row variances with efficient NA handling.

  • For sparse matrices (dgCMatrix), leverages a mathematical identity to compute variance without dense intermediate matrices:
    Var(x)=x2(x)2nn1 \text{Var}(x) = \frac{\sum x^2 - \frac{(\sum x)^2}{n}}{n-1}

    Rows with <=1 non-zero observation are treated as zero-variance to avoid numerical instability.

# Example 1: Dense matrix with no zero-variance rows
set.seed(123)
mat_dense <- matrix(rnorm(100), nrow = 10)
rownames(mat_dense) <- paste0("Gene", 1:10)
Check0VarRows(mat_dense)  # No error

# Example 2: Dense matrix with zero-variance row
mat_dense[1, ] <- rep(5, 10)  # First row is constant
Check0VarRows(mat_dense)      # Error: "Detected 1 gene(s) with zero variance: Gene1"

# Example 3: Sparse matrix (dgCMatrix)
library(Matrix)
mat_sparse <- as(matrix(rpois(100, 0.5), nrow = 10), "dgCMatrix")
rownames(mat_sparse) <- paste0("Gene", 1:10)
Check0VarRows(mat_sparse)  # Error if any row has zero variance
Notes
  • Row Names Requirement: Ensure the input matrix has row names. Without them, the error message will show indices instead of names.
  • NA Handling: Rows with all NA values are treated as zero-variance and will trigger an error.

7. References

  1. Sun D, Guan X, Moran AE, Wu LY, Qian DZ, Schedin P, et al. Identifying phenotype-associated subpopulations by integrating bulk and single-cell sequencing data. Nat Biotechnol. 2022 Apr;40(4):527–38.

  2. Xie A, Wang H, Zhao J, Wang Z, Xu J, Xu Y. scPAS: single-cell phenotype-associated subpopulation identifier. Briefings in Bioinformatics. 2024 Nov 22;26(1):bbae655.

  3. Zhang Q, Jin S, Zou X. scAB detects multiresolution cell states with clinical significance by integrating single-cell genomics and bulk sequencing data. Nucleic Acids Research. 2022 Nov 28;50(21):12112–30.

  4. WangX-Lab/ScPP Internet. cited 2025 Aug 31. Available from: https://github.com/WangX-Lab/ScPP