Full Tutorial for SigBridgeR
0. Preface
0.1 Contents
0.1 Introduction to SigBridgeR
SigBridgeR (short for Significant cell-to-phenotype Bridge in R) is an R package for screening cells highly associated with phenotype data using single-cell RNA-seq, bulk expression and sample_related phenotype data at a pan-cancer level. It integrates functionality from these R packages: Github-sunduanchen/Scissor, Github-Qinran-Zhang/scAB, Github-WangX-Lab/ScPP and Github-aiminXie/scPAS.
1. Installation
Install SigBridgeR using one of these methods:
1.1 Release from GitHub
Usually we recommend installing the latest release from GitHub because of the latest features and bug fixes.
if(!requireNamespace("remotes")) {
install.packages("remotes")
}
remotes::install_github("WangLabCSU/SigBridgeR")
1.2 Release from r-universe
install.packages("SigBridgeR", repos = "https://wanglab.r-universe.dev")
1.8 Check Dependencies
You can use this function to quickly verify installed dependencies and their versions:
CheckPkgs <- function(packages) {
CheckSinglePkg <- function(pkg_spec) {
installed <- requireNamespace(pkg_spec$pkg, quietly = TRUE)
current_version <- NA_character_
if (installed) {
current_version <- as.character(utils::packageVersion(pkg_spec$pkg))
if (!is.null(pkg_spec$version)) {
installed <- current_version >= pkg_spec$version
}
}
data.frame(
Package = pkg_spec$pkg,
Required_Version = if (is.null(pkg_spec$version)) "Any" else pkg_spec$version,
Installed = installed,
Current_Version = if (installed) current_version else NA,
stringsAsFactors = FALSE
)
}
result <- do.call(rbind, lapply(packages, CheckSinglePkg))
rownames(result) <- NULL
return(result)
}
CheckPkgs(list(
list(pkg = "Seurat", version = "5.3.0"),
list(pkg = "dplyr", version = "1.1.4"),
list(pkg = "cli", version = "3.6.5"),
list(pkg = "AUCell", version = "1.26.0"),
list(pkg = "future", version = "1.67.0"),
list(pkg = "IDConverter", version = "0.3.5"),
list(pkg = "Matrix", version = "1.7.3"),
list(pkg = "scAB", version = "1.0.0"),
list(pkg = "Scissor", version = "2.0.0"),
list(pkg = "scPAS", version = "0.2.0"),
list(pkg = "ScPP", version = "0.0.0.9000"),
list(pkg = "tibble", version = "3.3.0"),
list(pkg = "tidyr", version = "1.3.1"),
list(pkg = "data.table", version = "1.17.8"),
list(pkg = "ggforce", version = "0.5.0"),
list(pkg = "ggplot2", version = "4.0.0"),
list(pkg = "ggupset", version = "0.4.1"),
list(pkg = "purrr", version = "1.1.0"),
list(pkg = "edgeR", version = "4.2.2"),
list(pkg = "scales", version = "1.4.0"),
list(pkg = "preprocessCore", version = "1.66.0"),
# suggest
list(pkg = "randomcoloR", version = "1.1.0.1"),
list(pkg = "reticulate", version = "1.43.0"),
list(pkg = "patchwork", version = "1.3.2"),
list(pkg = "org.Hs.eg.db", verion = "3.19.1"),
list(pkg = "zeallot", version = "0.1.0"),
list(pkg = "ggVennDiagram", version = "1.5.4")
)
)
If you encounter compatibility issues, you can install the version of the package indicated here.
2. Loading and preprocessing data
First load the packaged:
2.1 Single-cell RNA-seq Data
You can use function SCPreProcess
to preprocess your
single-cell RNA-seq data. Here are some options:
2.1.1 (Option A) Start from Raw Matrix
When starting from a raw count matrix (data.frame, matrix or
dgCMatrix), SCPreProcess
will automatically perform Seurat
preprocess (including
NormalizeData, FindVariableFeatures, ScaleData, RunPCA, FindNeighbors, FindClusters, RunTSNE, RunUMAP
).
SCPreProcess
returns a fully preprocessed Seurat object for
downstream use.
your_seurat <- SCPreProcess(
your_matrix,
meta_data = NULL,
project = "Screen_Single_Cell", # Parameters used in Seurat preprocessing pipeline
min_cells = 400,
min_features = 0,
normalization_method = "LogNormalize",
scale_factor = 10000,
scale_features = NULL,
selection_method = "vst",
resolution = 0.6,
dims = 1:10,
verbose = TRUE,
future_global_maxsize = 6 * 1024^3,
quality_control = TRUE,
quality_control.pattern = c("^MT-"), # human for '^MT-', mouse for '^mt-', or specify your own pattern
data_filter = TRUE,
data_filter.nFeature_RNA_thresh = c(200, 6000), # [min, max]
data_filter.percent.mt = 20,
column2only_tumor = NULL,
...
)
-
Build the initial Seurat object
- Imports the count matrix.
- Applies the first, coarse filters: genes must be detected in >=
min_cells
cells; cells must contain >=min_features
genes.
- Attaches optional sample metadata.
- Imports the count matrix.
-
Mitochondrial QC (default: ON)
- A single regular expression (
^MT-
or^mt-
) is used to compute the percentage of mitochondrial reads per cell.
- The metric is stored in
object$percent.mt
for later filtering.
- A single regular expression (
-
Cell filtering (default: ON)
Cells are retained only if they satisfy all of the following:-
nFeature_RNA
lies between the lower and upper bounds defined bydata_filter.nFeature_RNA_thresh
.
-
percent.mt
is below the user-defined cut-off (data_filter.percent.mt
).
-
-
Normalisation, variable-feature selection and scaling
Encapsulated inProcessSeuratObject()
:- Counts are normalised with the chosen method (default:
LogNormalize).
- Highly variable features are identified using the requested
algorithm (default: vst).
- Expression values of those features are scaled to unit variance.
- Counts are normalised with the chosen method (default:
LogNormalize).
Append extra observations
If the input object contains anobs
slot, it is merged intoobject@meta.data
.-
Dimensionality reduction and clustering
ClusterAndReduce()
runs:- PCA -> nearest-neighbour graph -> Louvain clustering
(resolution parameterised).
- UMAP for two-dimensional visualization.
- Optionally removes very small or low-quality clusters.
- PCA -> nearest-neighbour graph -> Louvain clustering
(resolution parameterised).
Tumour-cell flagging
FilterTumorCell()
creates a binary label (1 = Tumour, 0 = Normal) from the column specified bycolumn2only_tumor
.
The new label can be used downstream to restrict analyses to malignant cells.
2.1.2 (Option B) Start from AnnDataR6 Object
SCPreProcess
also supports AnnData objects. You may
reference and use the following code:
reticulate::use_pythonenv("The_path_to_your_python")
# reticulate::use_condaenv("conda_env_name")
anndata_obj <- anndata::read_h5ad("path_to_your_file.h5ad") # Or other formats, make sure the matrix is in obj$X.
your_seurat <- SCPreProcess(
anndata_obj,
meta_data = NULL,
project = "Screen_Single_Cell", # Parameters used in Seurat preprocessing pipeline
min_cells = 400,
min_features = 0,
normalization_method = "LogNormalize",
scale_factor = 10000,
scale_features = NULL,
selection_method = "vst",
resolution = 0.6,
dims = 1:10,
verbose = TRUE,
future_global_maxsize = 6 * 1024^3,
quality_control = TRUE,
quality_control.pattern = c("^MT-"), # human for '^MT-', mouse for '^mt-', or specify your own pattern
data_filter = TRUE,
data_filter.nFeature_RNA_thresh = c(200, 6000), # [min, max]
data_filter.percent.mt = 20,
column2only_tumor = NULL,
...
)
The description of data (meta.data) in anndata_obj$obs
will be add to your_seurat@meta.data
.
helpful documentation:
2.1.8 (Optional) Filter Out Tumor Cells
If you aim to filter out phenotype-associated cells in all tumor
cells. With a preprocessed Seurat object (containing results from ),
SCPreProcess
will filter out tumor cells using the
specified metadata column:
your_seurat <- SCPreProcess(your_seurat, column2only_tumor = "Tissue")
Note: I don’t recommend using columns like
column2only_tumor = "Celltype"
as tumor cell identities vary across tissues. instead:
Create a Dedicated Column: Add a new metadata column (e.g., is_tumor) to explicitly label cells:“Tumo(u)r”/“Normal”
Code Example:
2.2 Bulk expression data
Here are some methods for processing bulk RNA-seq gene expression data matrices.
2.2.1 Evaluate the quality of your bulk RNA-seq data
BulkPreProcess
performs comprehensive quality control on
bulk RNA-seq data.
Key parameters for BulkPreProcess
:
-
data
: Expression matrix with genes as rows and samples as columns, or a list containing count_matrix and sample_info. -
sample_info
: Sample information data frame (optional), ignored if data is a list. -
gene_symbol_conversion
: Whether to convert Ensembl version IDs and TCGA version IDs to genes with SymbolConvert in Section 2.2.2, default TRUE. -
check
: Whether to perform detailed quality checks, default TRUE. -
min_count_threshold
: Minimum count threshold for gene filtering, default 10. -
min_gene_expressed
: Minimum number of samples a gene must be expressed in, default 3. -
min_total_reads
: Minimum total reads per sample, default 1e6. -
min_genes_detected
: Minimum number of genes detected per sample, default 10000. -
min_correlation
: Minimum correlation threshold between samples, default 0.8. -
n_top_genes
: Number of top variable genes for PCA analysis, default 500. -
show_plot_results
: Whether to generate PCA visualization plots, default TRUE. -
verbose
: Whether to output detailed information, default TRUE.
Suppose you have bulk RNA-seq count data and sample information, and you want to perform comprehensive preprocessing and quality control. You can refer to and use the following code:
# Example usage of BulkPreProcess
filtered_counts <- BulkPreProcess(
data = your_count_matrix,
sample_info = your_sample_info,
gene_symbol_conversion = TRUE,
check = TRUE,
min_count_threshold = 10,
min_samples_expressed = 3,
min_total_reads = 1e6,
min_genes_detected = 10000,
min_correlation = 0.8,
n_top_genes = 500,
show_plot_results = TRUE,
verbose = TRUE
)
The function returns a filtered count matrix after applying quality control steps. The order of filtering is: first genes, then samples.
Quality Control Metrics Reported
Data Integrity: Number of missing values detected
Gene Count: Total number of genes after filtering
Sample Read Depth: Total reads per sample
Gene Detection Rate: Number of genes detected per sample
Sample Correlation: Pearson correlation between samples
PCA Variance: Variance explained by first two principal components
Batch Effects: Proportion of genes significantly affected by batch
2.2.2 Gene Symbol Conversion
SymbolConvert
performs a straightforward task:
converting common gene identifiers (e.g., Ensemble IDs, Entrez) to
standardized gene symbols by using the IDConverter
package.
# genes * samples
your_bulk_data <- read.csv("path_to_your_file.csv", header = TRUE, row.names = 1)
your_bulk_data <- SymbolConvert(your_bulk_data)
You can also use the org.Hs.eg.db
package for gene
symbol matching if you prefer not to use SymbolConvert’s built-in
IDConverter
.
library(org.Hs.eg.db)
your_bulk_data <- read.csv("path_to_your_file.csv", header = TRUE, row.names = 1)
your_bulk_data <- BulkPreProcess(your_bulk_data, gene_symbol_conversion = FALSE)
ensembl_ids <- sub("\\..*", "", rownames(your_bulk_data))
gene_symbols <- mapIds(org.Hs.eg.db,
keys = ensembl_ids,
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first")
rownames(your_bulk_data) <- gene_symbols
2.3 Phenotype Data
Bascially you can just use your phenotype data directly. If you are
confused about the structure Screen()
requires, please
refer to Section 5.
You can use this function to check NA values:
# `max_print`: output how many NA location messages at one time if NA exists
CheckNA <- function(data, max_print = 5) {
# Load required packages
rlang::check_installed(c("dplyr", "purrr", "cli", "data.table"))
# Convert to data.table if it's a 2D structure but not already a data.table
is_2d <- !is.null(dim(data))
dt <- NULL
if (is_2d && !data.table::is.data.table(data)) {
dt <- data.table::as.data.table(data)
} else if (is_2d) {
dt <- data
}
# Initialize output
na_info <- list()
cli::cli_h1("NA Value Check (data.table optimized)")
cli::cli_alert_info("Checking object with dimensions: {.val {dim(data)}}")
# Handle 1D data (vectors)
if (!is_2d) {
na_count <- sum(is.na(data))
na_positions <- which(is.na(data))
na_info$positions <- na_positions
na_info$count <- na_count
if (na_count == 0) {
cli::cli_alert_success("No NA values found in the vector!")
}
cli::cli_alert_warning("Found {.val {na_count}} NA value{?s} in vector")
if (length(na_positions) > 0 && max_print > 0) {
positions_to_show <- na_positions[
1:min(max_print, length(na_positions))
]
cli::cli_text("Positions: {.val {positions_to_show}}")
if (length(na_positions) > max_print) {
cli::cli_text(
"{.val {length(na_positions) - max_print}} additional positions not shown"
)
}
}
# For named vectors
if (!is.null(names(data))) {
na_names <- names(data)[na_positions]
na_info$names <- na_names
if (length(na_names) > 0 && max_print > 0) {
names_to_show <- na_names[1:min(max_print, length(na_names))]
cli::cli_text("Names: {.val {names_to_show}}")
}
}
} else {
# Handle 2D data (dataframes, matrices, data.tables)
# Use data.table for efficient NA counting
na_count <- dt[, sum(is.na(.SD))]
na_info$count <- na_count
if (na_count == 0) {
cli::cli_alert_success("No NA values found in the 2D data!")
}
cli::cli_alert_warning(
"Found {.val {na_count}} NA value{?s} in 2D data"
)
# Get NA positions efficiently without melt warning
if (max_print > 0) {
# Create a matrix of logical values indicating NA positions
na_matrix <- is.na(as.matrix(dt))
na_positions <- which(na_matrix, arr.ind = TRUE)
# Convert to data.table and add column names
na_positions_dt <- data.table::as.data.table(na_positions)
colnames(na_positions_dt) <- c("row", "col")
na_positions_dt$col_name <- colnames(dt)[na_positions_dt$col]
na_info$positions <- na_positions_dt
if (nrow(na_positions_dt) > 0) {
positions_to_show <- na_positions_dt[
1:min(max_print, nrow(na_positions_dt)),
]
cli::cli_text(
"First {.val {nrow(positions_to_show)}} position{?s}:"
)
for (i in 1:nrow(positions_to_show)) {
cli::cli_text(
" Row {.val {positions_to_show$row[i]}}, Col {.val {positions_to_show$col_name[i]}} (index: {.val {positions_to_show$col[i]}})"
)
}
if (na_count > max_print) {
cli::cli_text(
"{.val {na_count - max_print}} additional positions not shown"
)
}
}
}
# Column-wise summary using data.table
col_na <- dt[, lapply(.SD, function(x) sum(is.na(x)))]
col_na <- unlist(col_na)
col_na <- col_na[col_na > 0]
if (length(col_na) > 0) {
na_info$column_na <- col_na
cli::cli_text("Column-wise NA counts:")
for (col in names(col_na)) {
cli::cli_text(" {.field {col}}: {.val {col_na[col]}}")
}
}
}
return(invisible(na_info))
}
3. Screening Cells Associated with Phenotype
The function Screen
provide 4 different
options for screening cells associated with phenotype, These 4
algorithms come from the repositories mentioned in Section 0.1, and you can
choose one of them to screen your cells.
Key parameters for Screen
:
-
matched_bulk
: A data frame of bulk expression data after intersecting samples. -
sc_data
: A Seurat object after preprocessing, you can use the output ofPreprocess
function or your own preprocessed Seurat object. -
phenotype
: A data frame of phenotype data after intersecting samples. -
label_type
: A character value specifying the filtering labels are stored in theSeurat_object@misc
, default:NULL
. -
phenotype_class
: A character value specifying the phenotype data type, i.e."binary"
,"survival"
or"continuous"
. -
screen_method
: A character value specifying the screening method, i.e. “Scissor”, “scPAS”, “scAB” or “scPP” -
...
: Other parameters for the screening methods.
3.1 (Option A) Scissor Screening
Parameters pass to ...
when using Scissor
method:
-
path2save_scissor_inputs
: A character value specifying the path to save intermediate data, default:Scissor_inputs.RData
-
path2load_scissor_cahce
: A character value specifying the path to load intermediate data -
alpha
: Parameter used to balance the effect of the l1 norm and the network-based penalties. It can be a number or a searching vector. If alpha = NULL, a default searching vector is used. The range of alpha is in[0,1]
. A larger alpha lays more emphasis on the l1 norm. -
cutoff
: Cutoff for the percentage of the Scissor selected cells in total cells. This parameter is used to restrict the number of the Scissor selected cells. A cutoff less than 50% (default 20%) is recommended depending on the input data. -
reliability_test_n
: Permutation times (default: 10) -
nfold
: The fold number in cross-validation (default: 10)
Usage:
scissor_result = Screen(
matched_bulk = matched_bulk,
sc_data = sc_dataset, # A Seurat object after preprocessing
phenotype = matched_phenotype_data,
label_type = "TP53", # The filtering labels are stored in the `@misc`, you can change it to your own label
phenotype_class = "binary",
screen_method = c("Scissor"),
path2save_scissor_inputs = "Tmp/Scissor_inputs.RData" # Intermediate data
)
You can use the intermediate data for repeated runs. This is an
inherent feature of the scissor
.
scissor_result = Screen(
sc_data = sc_dataset,
label_type = "TP53",
phenotype_class = "binary",
screen_method = c("Scissor"),
path2load_scissor_cahce = "Tmp/Scissor_inputs.RData" # Intermediate data
)
If only the parameters alpha
and cutoff
are
adjusted, this method can also be applied.
# When `alpha = NULL`, an alpha iteration will continue until phenotype-associated cells are screened out or no cells are screened out even after exceeding the `cutoff`.
scissor_result = Screen(
sc_data = sc_dataset,
label_type = "TP53",
phenotype_class = "binary",
screen_method = c("Scissor"),
path2load_scissor_cahce = "Tmp/Scissor_inputs.RData", # Intermediate data
alpha = NULL,
cutoff = 0.05
)
Returning structure: A list containing:
-
scRNA_data
: A Seurat object after screening -
scissor_result
: The result of Scissor screening -
reliability_test
: Reliability test results
Cell level Evaluation:
You can use Sissor::evaluate.cell()
to obtain some
supporting information for each Scissor selected cell. First, prepare a
benchmark dataset yourself.
evaluate_summary <- Scissor::evaluate.cell(
'your_benchmark_data.RData', # file path
scissor_result$scRNA_data, # The Seurat object after screening
FDR = 0.05,
bootstrap_n = 100
)
helpful documentation:
3.2 (Option B) scPAS Screening
Parameters pass to ...
when using scPAS
method (basically adapted from the scPAS
’s
documentation):
-
Parameters passed to
scPAS::scPAS()
These parameters directly interface with the core
scPAS
() function from the original package:-
assay
: Name of Assay to get. -
imputation
: Logical. imputation or not. -
nfeature
: Numeric. The Number of features to select as top variable features insc_data
. Top variable features will be used to intersect with the features ofmatched_bulk
. Default is NULL and all features will be used. -
alpha
: Numeric. Parameter used to balance the effect of the l1 norm and the network-based penalties. It can be a number or a searching vector. Ifalpha = NULL
, a default searching vector is used. The range of alpha is in[0,1]
. A larger alpha lays more emphasis on the l1 norm. -
network_class
: The source of feature-feature similarity network. By default this is set to sc and the other one is bulk.
-
usage:
scpas_result = Screen(
matched_bulk = matched_bulk,
sc_data = A_Seurat_object,
phenotype = phenotype,
label_type = "TP53", # The filtering labels are stored in the `@misc`
screen_method = "scpas",
phenotype_class = "binary"
)
returning structure: A list containing:
-
scRNA_data
: A Seurat object after screening
3.3 (Option C) scAB Screening
Parameters pass to ...
when using scAB
method (basically adapted from the scAB
’s
documentation):
-
alpha
: Coefficient of phenotype regularization, default is0.005
-
alpha_2
: Coefficient of cell-cell similarity regularization, default is5e-05
-
maxiter
: Maximum number of iterations, default is2000
-
tred
: Threshold for early stopping, default is2
usage:
scab_result = Screen(
matched_bulk = your_matched_bulk,
sc_data = A_Seurat_object,
phenotype = your_matched_phenotype,
label_type = "TP53", # The filtering labels are stored in the `@misc`
screen_method = "scAB",
phenotype_class = "binary",
)
returning structure: A list containing:
-
scRNA_data
: A Seurat object after screening -
scAB_result
: A list with the submatrix and loss value
3.4 (Option D) scPP Screening
Parameters pass to ...
when using scPP
method :
-
ref_group
: The reference group for the binary analysis, default is1
-
Log2FC_cutoff
: The cutoff for the log2 fold change of the binary analysis, default is0.585
-
estimate_cutoff
: Effect size threshold for continuous traits, default is0.2
-
probs
: Quantile cutoff for cell classification, default is0.2
usage:
# This will take several hours
scpp_result = Screen(
matched_bulk = your_matched_bulk,
sc_data = A_Seurat_object,
phenotype = your_matched_phenotype,
label_type = "TP53", # The filtering labels are stored in the `@misc`
screen_method = "scpp",
phenotype_class = "binary",
)
returning structure: A list containing:
-
scRNA_data
: A Seurat object after screening
3.8 (Optional) Merge screening results
If you have performed multiple screening methods one the same
single-cell data, you can use the MergeResult
to merge the
screening results of these methods. The Seurat object or a results list
from Screen
is accepted.
merged_seurat=MergeResult(
your_scissor_result,
your_scPAS_result,
your_scAB_result,
your_scPP_result
)
# * mixed input form is alse supported
merged_seurat=MergeResult(
your_scissor_result$scRNA_data,
your_scPAS_result$scRNA_data,
your_scAB_result,
your_scPP_result,
)
This function simply merges the meta.data
and
misc
slots of the input Seurat objects. Please note that
the intermediate data (e.g.,
scissor_result$reliability.test
or
scab_result$scAB_result
) will not be preserved in this
process.
4. Visualization
Here we provide some visualization methods for the screening results.
Considering that many people have different needs for data
visualization, SigBridgR
hardly provides visualization
(except for fraction plot and upset plot, because we provide some
statistic results for them). We only provide the source code for
reference.
4.1 UMAP for screening results
example:
Suppose you have performed all algorithm screening on your Seurat object and wish to examine the distribution across different celltypes and patient, you may reference and use the following code:
library(patchwork)
library(zeallot)
# library(Seurat)
# library(purrr)
c(
celltype_umap,
patient_umap,
scissor_umap,
scab_umap,
scpas_umap,
scpp_umap
) %<-%
purrr::map(
c("celltype", "patient", "scissor", "scAB", "scPAS", "scPP"), # make sure these column names exist
~ Seurat::DimPlot(
your_seurat_obj,
group.by = .x,
pt.size = 0.05,
reduction = "umap"
) +
ggplot2::ggtitle(.x)
)
# * Show
umaps = celltype_umap +
patient_umap +
scissor_umap +
scab_umap +
scpas_umap +
scpp_umap +
plot_layout(ncol = 2)
umaps
This will generate six UMAP plots separately.
Or suppose you have performed scPAS
screening on your
Seurat object and want to visualize the distribution of prediction
confidence scores, you may reference and use the following code:
library(patchwork)
library(zeallot)
# library(Seurat)
# library(purrr)
c(scPAS_Pvalue_umap, scPAS_NRS_umap) %<-%
purrr::map(
c("scPAS_Pvalue", "scPAS_NRS"),
~ Seurat::FeaturePlot(
object = your_seurat_obj,
features = .x,
) +
ggplot2::ggtitle(.x) +
theme(legend.position = "right")
)
# * Show
scPAS_Pvalue_umap | scPAS_NRS_umap
This will generate two plots, one for each feature specified in
feature
.
helpful documentation:
Seurat::DimPlot - https://satijalab.org/seurat/reference/DimPlot.html
Seurat::FeaturePlot - https://satijalab.org/seurat/reference/FeaturePlot.html
4.2 Stack bar plot for screening results
Key parameters for ScreenFractionPlot
:
-
seurat_obj
: A Seurat object after screening. -
group_by
: Used to specify the column of the meta.data inseurat_obj
. The plot results will be grouped by this parameter. -
screen_type
: Screening algorithm used before. (case-sensitive, e.g., “scissor” for Scissor results) -
show_null
: Logical, whether to show groups with zero cells (default: FALSE). -
plot_color
Custom color palette (named vector format):- Required names: “Positive”, “Negative”, “Neutral”, “Other”
- Default: c(“Neutral”=“#CECECE”, “Other”=“#CECECE”, “Positive”=“#ff3333”, “Negative”=“#386c9b”)
Suppose you have already performed the scPAS
algorithm
screening on your Seurat object, and you want to check the proportion of
positive cells across different patients. You can refer to and use the
following code:
# based on `ggplot2`
plot <- ScreenFractionPlot(
screened_seurat = scpas_result$scRNA_data,
group_by = "patient", # grouping basis for the x-axis
screen_type = "scPAS",
plot_title = "scPAS Screening Results"
)
If you have performed multiple screening methods and already merged the results, you can use the following code:
plot <- ScreenFractionPlot(
screened_seurat = merged_seurat,
group_by = "patient", # grouping basis for the x-axis
screen_type = c("scPAS", "scAB", "scPP"), # multiple screening results
plot_title = "Screening Results" # A screen_type prefix will be added to the current plot title
)
The order of the groups is determined by the proportion of Positive cells within each group.
returning structure:
If a single screen_type is specified
-
stats
: A data frame containing the proportion of positive cells for each group. -
plot
: A ggplot object.
If multiple screen_types are specified
-
stats
: A list containing data frames containing the proportion of positive cells for each group. -
plot
: A list containing each ggplot objects. -
combined_plot
: A ggplot object containing all the plots (2*2 grid).
4.3 Venn diagram for screening results
ggVennDiagram
is used to generate a Venn diagram for the
screening results. Suppose you have performed some of the screening
algorithms on your Seurat object, and you want to check the overlap of
the cells selected by each algorithm. You can refer to and use the
following code:
example:
library(ggVennDiagram)
# * get the cell vectors
scissor_pos <- colnames(scissor_result$scRNA_data)[
which(scissor_result$scRNA_data$scissor == "Positive")
]
scab_pos <- colnames(scab_result$scRNA_data)[
which(scab_result$scRNA_data$scAB == "Positive")
]
scpas_pos <- colnames(scpas_result$scRNA_data)[
which(scpas_result$scRNA_data$scPAS == "Positive")
]
scpp_pos <- colnames(scissor_result$scRNA_data)[
which(scpp_result$scRNA_data$scPP == "Positive")
]
# # *If you have merged the results, you can use the following code instead:
# c(scissor_pos, scab_pos, scpas_pos, scpp_pos) %<-%
# purrr::map(
# c("scissor", "scAB", "scPAS", "scPP"),
# ~ colnames(merged_seurat)[
# which(merged_seurat[[.x]] == "Positive")
# ]
# )
all_cells <- colnames(your_seurat_obj) # this obj can be changed to other
# * create a list of cell vectors
pos_venn = list(
scissor = scissor_pos,
scpas = scpas_pos,
scab = scab_pos,
scpp = scpp_pos,
all_cells = all_cells
# * you can add more groups here
)
set.seed(123)
venn_plot = ggVennDiagram::ggVennDiagram(
x = pos_venn,
# * the labels of each group to be shown on the diagram
category.names = c(
"Scissor",
"scPAS",
"scAB",
"scPP",
"All cells"
),
# * the colors of each group
set_color = c(
"#a33333ff",
"#37ae00ff",
"#2a2a94ff",
"#9c8200ff",
"#008383ff"
)
) +
ggplot2::scale_fill_gradient(low = "white", high = "#ffb6b6ff") +
ggplot2::ggtitle("Screening Venn Diagram")
helpful link:
4.4 Upset plot for screening results
If too many screening meyhods are selected, the number of intersections among cells screened by different methods will also increase. In this case, using an upset plot is more intuitive and neat than a Venn diagram.
ggupset
is used for visualizing upset plot.
Key parameters for ScreenUpset
:
-
screened_seurat
: A Seurat object after screening. -
screen_type
: Screening algorithm used before. (case-sensitive, e.g., “scissor” for Scissor results) -
n_intersections
: Number of intersections to display in the plot. Default: 20. -
x_lab
: Label for the x-axis. Default: “Screen Set Intersections”. -
y_lab
: Label for the y-axis. Default: “Number of Cells”. -
title
: Plot title. Default: “Cell Counts Across Screen Set Intersections”. -
bar_color
: Color for the bars in the plot. Default: “#4E79A7”. -
combmatrix_point_color
: Color for points in the combination matrix. Default: “black”. -
...
: Additional arguments passed toggplot2::theme()
for customizing the plot appearance.
upset <- ScreenUpset(screened_seurat = merged_seurat)
returning structure:
-
plot
: A ggplot object. -
stats
: A data frame containing the numbers of positive cells for each group.
helpful link:
ggupset - https://github.com/const-ae/ggupset
5. Example
5.1 Survival-associated cell screening
Here we use the example data LUAD to demonstrate how to use the
functions in SigBridgeR
to screen cells associated with
phenotype.
if (requireNamespace("here", quietly = TRUE)) {
setwd(here::here())
knitr::opts_knit$set(root.dir = here::here())
}
library(SigBridgeR)
library(zeallot)
library(Seurat)
# * load the example data
c(mat_exam, bulk, pheno) %<-% LoadRefData(data_type = "survival")
dim(mat_exam)
#[1] 33694 1093
dim(bulk)
# [1] 4071 506
bulk[1:6,1:6]
# TCGA-69-7978 TCGA-62-8399 TCGA-78-7539 TCGA-73-4658 TCGA-44-6775 TCGA-44-2655
# HIF3A 4.2598 11.6239 9.1362 5.0288 4.0573 5.5335
# RTN4RL2 8.2023 5.5819 3.5365 7.4156 7.7107 5.3257
# HMGCLL1 2.7476 5.8513 3.8334 3.6447 2.9188 4.8820
# LRRTM1 0.0000 0.4628 4.7506 6.8005 7.7819 2.2882
# GRIN1 6.6074 5.4257 4.9563 7.3510 3.5361 3.3311
# LRRTM3 1.7458 2.0092 0.0000 1.4468 0.0000 0.0000
nrow(pheno)
# [1] 506
head(pheno)
# time status
# TCGA-69-7978 4.40 0
# TCGA-62-8399 88.57 0
# TCGA-78-7539 25.99 0
# TCGA-73-4658 52.56 1
# TCGA-44-6775 23.16 0
# TCGA-44-2655 43.50 0
This single-cell RNA data is from humans. We set many parameters to
FALSE
or 0
in order to maximize the
flexibility of downstream analyses and capture a broader range of
biological signals, so as to avoid insignificant results caused by too
small a dataset.
Now we use SCPreProcess()
to pre-process the data.
seurat = SCPreProcess(
sc = mat_exam,
quality_control = FALSE,
data_filter = FALSE,
min_cells = 0,
min_features = 0,
scale_features = rownames(mat_exam),
dims = 1:20,
resolution = 0.1
)
BulkPreProcess()
is to pre-process the bulk expression
data and the related clinical data.frame (containing bulk sample
information) to perform PCA. Here, we want to retain the data so that
the screening can reflect the most accurate situation. You can choose
whether to filter or retain based on your own needs. See also 2.2 Bulk expression data for more
details.
Thus far, we have completed all the data preprocessing. We are now ready to formally employ various single-cell phenotypic screening algorithms.
First, we use scissor
to screen cells associated with
survival.
scissor_result = Screen(
matched_bulk = bulk,
sc_data = seurat,
phenotype = pheno,
label_type = "survival",
phenotype_class = "survival",
screen_method = "Scissor",
alpha = 0.05
)
# ℹ [2025/09/08 17:03:20] Scissor start...
# ℹ [2025/09/08 17:03:20] Start from raw data...
# ℹ Using "RNA_snn" graph for network.
# ℹ [2025/09/08 17:03:20] Normalizing quantiles of data...
# ℹ [2025/09/08 17:03:20] Subsetting data...
# ℹ [2025/09/08 17:03:21] Calculating correlation...
# ----------------------------------------------------------------------------------------------------
# Five-number summary of correlations:
# 0% 25% 50% 75% 100%
# -0.2342323 0.0594385 0.1118708 0.1642065 0.5250605
# ----------------------------------------------------------------------------------------------------
# ℹ [2025/09/08 17:03:21] Perform cox regression on the given clinical outcomes:
# ✔ Statistics data saved to `Scissor_inputs.RData`.
# ℹ [2025/09/08 17:03:22] Screening...
# [1] "alpha = 0.05"
# [1] "Scissor identified 246 Scissor+ cells and 244 Scissor- cells."
# [1] "The percentage of selected cell is: 44.831%"
# ----------------------------------------------------------------------------------------------------
table(scissor_result$scRNA_data$scissor)
# Negative Neutral Positive
# 73 755 265
You will see an additional “Scissor_inputs.RData” in the working
directory. This is the intermediate data generated by the Scissor
method, which we can use to save running time. Meanwhile, we set
reliability_test=TRUE
, which will run an additional
reliability test.
scissor_result = Screen(
matched_bulk = bulk, # doesn't need to be provided Since the intermediate data is provided
sc_data = seurat,
phenotype = pheno_ok, # doesn't need to be provided Since the intermediate data is provided
label_type = "survival",
phenotype_class = "survival",
screen_method = "Scissor",
alpha = 0.05,
path2load_scissor_cache = "Scissor_inputs.RData",
reliability_test = TRUE
)
# ℹ [2025/09/08 16:07:48] Scissor start...
# ℹ [2025/09/08 16:07:48] Loading data from `Scissor_inputs.RData`...
# ℹ [2025/09/08 16:07:48] Screening...
# [1] "alpha = 0.05"
# [1] "Scissor identified 249 Scissor+ cells and 245 Scissor- cells."
# [1] "The percentage of selected cell is: 45.197%"
# --------------------------------------------------------------------------------
# ℹ [2025/09/08 16:07:54] Start reliability test
# Attaching package: ‘survival’
# The following object is masked from ‘package:future’:
# cluster
# [1] "|**************************************************|"
# [1] "Perform cross-validation on X with true label"
# Finished!
# [1] "|**************************************************|"
# [1] "Perform cross-validation on X with permutated label"
# Finished!
# [1] "Test statistic = 0.590"
# [1] "Reliability significance test p = 0.000"
# ✔ [2025/09/08 16:10:54] reliability test: Done
scissor_result$reliability_result$statistic
# [1] 0.5899587
scissor_result$reliability_result$p
# [1] 0
scissor_result$reliability_result$c_index_test_real
# [1] 0.6038544 0.5022222 0.6050725 0.6279391 0.5064935 0.6033520 0.6769231 0.6453089 0.4968421 0.6315789
scissor_result$reliability_result$c_index_test_back %>% unlist() %>% matrix(nrow = 10)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 0.5594714 0.4943820 0.5379747 0.5583658 0.5527728 0.6146435 0.5130597 0.5701275 0.4691943 0.5177305
# [2,] 0.5735608 0.4726891 0.4146341 0.5020661 0.5354331 0.6840149 0.5540275 0.4990758 0.5324484 0.5497382
# [3,] 0.5960145 0.5157116 0.6191446 0.5091912 0.5659656 0.5870370 0.5090580 0.4247788 0.5734266 0.5539715
# [4,] 0.6210191 0.6563147 0.5581948 0.4675615 0.4688222 0.5308219 0.5100402 0.6709091 0.6155779 0.5607143
# [5,] 0.4623656 0.5716695 0.4920441 0.5403727 0.5555556 0.5911215 0.5738499 0.6189258 0.5871560 0.6084337
# [6,] 0.5868794 0.5127660 0.7416880 0.5366876 0.5417440 0.5633803 0.5161943 0.4830372 0.4732965 0.5740132
# [7,] 0.4975610 0.5751503 0.4801902 0.5588972 0.4940898 0.5723370 0.5788382 0.6171875 0.5884413 0.5445545
# [8,] 0.6331361 0.5970516 0.5473888 0.6274131 0.5159705 0.5000000 0.5299760 0.4905660 0.5277778 0.5027322
# [9,] 0.6390805 0.6548913 0.5049310 0.6299639 0.6385135 0.5964392 0.5381605 0.7462687 0.5185185 0.6585859
# [10,] 0.5207373 0.5000000 0.6051780 0.4932127 0.6892430 0.4786517 0.5619266 0.6614583 0.6502242 0.5333333
Next, we use scPAS, scAB and scPP to do the same screening. Generally
you only need to change the screen_method
, as long as you
have not specified any particular parameters.
scpas_result = Screen(
matched_bulk = bulk,
sc_data = seurat,
phenotype = pheno,
label_type = "survival",
phenotype_class = "survival",
screen_method = "scPAS",
alpha = 0.05
)
# ℹ [2025/09/08 17:04:15] Start scPAS screening.
# [1] "Step 1:Quantile normalization of bulk data."
# [1] "Step 2: Extracting single-cell expression profiles...."
# [1] "Step 3: Constructing a gene-gene similarity by single cell data...."
# Building SNN based on a provided distance matrix
# Computing SNN
# [1] "Step 4: Optimizing the network-regularized sparse regression model...."
# [1] "Perform cox regression on the given clinical outcomes:"
# [1] "alpha = 0.05"
# [1] "lambda = 0.499586979737535"
# [1] "scPAS identified 46 rick+ features and 16 rick- features."
# [1] "The percentage of selected feature is: 35.838%"
# [1] "|**************************************************|"
# [1] "Step 5: calculating quantified risk scores...."
# [1] "Step 6: qualitative identification by permutation test program with 2000 times random perturbations"
# [1] "Finished."
# ✔ [2025/09/08 17:04:24] scPAS screening done.
table(scpas_result$scRNA_data$scPAS)
# Neutral Positive
# 1092 1
As you can see, due to differences in data and algorithms, not every
screening algorithm is able to screen out cells. You can adjust the
corresponding parameters, e.g. change the alpha
to
NULL
, this will make scPAS iterate alpha until the result
is significant (or judged as having no significant cell subpopulations).
See also 3.2 (Option B) scPAS
Screening
scpas_result = Screen(
matched_bulk = bulk,
sc_data = seurat,
phenotype = pheno,
label_type = "survival",
phenotype_class = "survival",
screen_method = "scPAS",
alpha = NULL
)
# ℹ [2025/09/22 18:46:25] Start scPAS screening.
# [1] "Step 1:Quantile normalization of bulk data."
# [1] "Step 2: Extracting single-cell expression profiles...."
# [1] "Step 3: Constructing a gene-gene similarity by single cell data...."
# Building SNN based on a provided distance matrix
# Computing SNN
# [1] "Step 4: Optimizing the network-regularized sparse regression model...."
# [1] "Perform cox regression on the given clinical outcomes:"
# [1] "alpha = 0.001"
# [1] "lambda = 7.88965313498356"
# [1] "scPAS identified 449 rick+ features and 236 rick- features."
# [1] "The percentage of selected feature is: 78.375%"
# [1] "alpha = 0.005"
# [1] "lambda = 3.32139271651783"
# [1] "scPAS identified 279 rick+ features and 159 rick- features."
# [1] "The percentage of selected feature is: 50.114%"
# [1] "alpha = 0.01"
# [1] "lambda = 2.5241108003373"
# [1] "scPAS identified 207 rick+ features and 102 rick- features."
# [1] "The percentage of selected feature is: 35.355%"
# [1] "alpha = 0.05"
# [1] "lambda = 0.8038196391727"
# [1] "scPAS identified 82 rick+ features and 24 rick- features."
# [1] "The percentage of selected feature is: 12.128%"
# [1] "alpha = 0.1"
# [1] "lambda = 0.441095530835556"
# [1] "scPAS identified 57 rick+ features and 12 rick- features."
# [1] "The percentage of selected feature is: 7.895%"
# [1] "alpha = 0.2"
# [1] "lambda = 0.220547765417778"
# [1] "scPAS identified 42 rick+ features and 12 rick- features."
# [1] "The percentage of selected feature is: 6.178%"
# [1] "alpha = 0.3"
# [1] "lambda = 0.154032875529483"
# [1] "scPAS identified 37 rick+ features and 9 rick- features."
# [1] "The percentage of selected feature is: 5.263%"
# [1] "alpha = 0.4"
# [1] "lambda = 0.115524656647112"
# [1] "scPAS identified 33 rick+ features and 8 rick- features."
# [1] "The percentage of selected feature is: 4.691%"
# [1] "alpha = 0.5"
# [1] "lambda = 0.0924197253176895"
# [1] "scPAS identified 30 rick+ features and 8 rick- features."
# [1] "The percentage of selected feature is: 4.348%"
# [1] "alpha = 0.6"
# [1] "lambda = 0.0770164377647413"
# [1] "scPAS identified 29 rick+ features and 8 rick- features."
# [1] "The percentage of selected feature is: 4.233%"
# [1] "alpha = 0.7"
# [1] "lambda = 0.0660140895126354"
# [1] "scPAS identified 28 rick+ features and 8 rick- features."
# [1] "The percentage of selected feature is: 4.119%"
# [1] "alpha = 0.8"
# [1] "lambda = 0.0577623283235559"
# [1] "scPAS identified 27 rick+ features and 7 rick- features."
# [1] "The percentage of selected feature is: 3.890%"
# [1] "alpha = 0.9"
# [1] "lambda = 0.0537890889507252"
# [1] "scPAS identified 24 rick+ features and 7 rick- features."
# [1] "The percentage of selected feature is: 3.547%"
# [1] "|**************************************************|"
# [1] "Step 5: calculating quantified risk scores...."
# [1] "Step 6: qualitative identification by permutation test program with 2000 times random perturbations"
# [1] "Finished."
# ✔ [2025/09/22 18:51:16] scPAS screening done.
Now there may be some significant cells appearing.
table(scpas_result$scRNA_data$scPAS)
# Negative Neutral Positive
# 2 1086 5
Now we use scAB and scPP to screen cells.
scab_result = Screen(
matched_bulk = bulk,
sc_data = seurat,
phenotype = pheno,
label_type = "survival",
phenotype_class = "survival",
screen_method = "scAB"
)
# ℹ [2025/09/08 17:04:51] Start scAB screening.
# ℹ Using "RNA_snn" graph for network.
# ℹ [2025/09/08 17:04:52] Selecting K...
# ℹ [2025/09/08 17:06:15] Run NMF with phenotype and cell-cell similarity regularization at K = 3.
# ℹ [2025/09/08 17:06:19] Screening cells...
# ℹ [2025/09/08 17:06:19] scAB screening done.
table(scab_result$scRNA_data$scAB)
# Other Positive
# 1018 75
scpp_result = Screen(
matched_bulk = bulk,
sc_data = seurat,
phenotype = pheno,
label_type = "survival",
phenotype_class = "survival",
screen_method = "scPP"
)
# ℹ [2025/09/08 17:00:28] Start scPP screening.
# ℹ [2025/09/08 17:00:28] Finding markers...
# Warning in coxph.fit(X, Y, istrat, offset, init, control, weights = weights, :
# Loglik converged before variable 1 ; coefficient may be infinite.
# ℹ [2025/09/08 17:00:52] Screening...
# Genes in the gene sets NOT available in the dataset:
# gene_pos: 13 (6% of 230)
# gene_neg: 54 (12% of 446)
# There are no genes significantly upregulated in Phenotype- compared to Phenotype+.
# ✔ [2025/09/08 17:00:54] scPP screening done.
table(scpp_result$scRNA_data$scPP)
# Negative Neutral Positive
# 52 993 48
After these algorithms have been run, the four sets of data can be merged since screening methods performed on the same data.
screen_result = MergeResult(
scissor_result,
scpas_result,
scab_result,
scpp_result
)
# ✔ Successfully merged 4 objects.
class(screen_result)
# [1] "Seurat"
# attr(,"package")
# [1] "SeuratObject"
colnames(screen_result@meta.data)
# [1] "orig.ident" "nCount_RNA" "nFeature_RNA" "percent.mt" "RNA_snn_res.0.1" "seurat_clusters" "scissor" "scPAS_RS" "scPAS_NRS" "scPAS_Pvalue"
# [11] "scPAS_FDR" "scPAS" "scAB" "scAB_Subset1" "Subset1_loading" "scAB_Subset2" "Subset2_loading" "scAB_Subset3" "Subset3_loading" "scPP"
Finally, we can visualize the screening results. Let’s start with a Venn diagram to see the situation.
library(ggVennDiagram)
library(zeallot)
# color palette
set.seed(123)
my_colors = randomcoloR::distinctColorPalette(length(unique(screen_result$seurat_clusters)),runTsne = TRUE)
c(scissor_pos, scab_pos, scpas_pos, scpp_pos) %<-%
purrr::map(
c("scissor", "scAB", "scPAS", "scPP"),
~ colnames(screen_result)[
which(screen_result[[.x]] == "Positive")
]
)
all_cells <- colnames(screen_result)
# * create a list of cell vectors
pos_venn = list(
scissor = scissor_pos,
scpas = scpas_pos,
scab = scab_pos,
scpp = scpp_pos,
all_cells = all_cells
)
set.seed(123)
venn = ggVennDiagram::ggVennDiagram(
x = pos_venn,
# * the labels of each group to be shown on the diagram
category.names = c(
"Scissor",
"scPAS",
"scAB",
"scPP",
"All cells"
),
# * the colors of each group
set_color = c(
"#a33333ff",
"#37ae00ff",
"#2a2a94ff",
"#9c8200ff",
"#008383ff"
)
) +
ggplot2::scale_fill_gradient(low = "white", high = "#ffb6b6ff") +
ggplot2::ggtitle("Screening Venn Diagram")
venn
ggplot2::ggsave("vignettes/example_figures/venn.png",plot=venn, width = 10, height = 10)
knitr::include_graphics("vignettes/example_figures/venn.png")
When there are too many sets, the visualization effect of the Venn diagram is not very good. We can use a set plot instead. Only positive cells are shown in the set plot.
upset <- ScreenUpset(screened_seurat = screen_result,screen_type = c("scissor", "scPAS", "scAB", "scPP"))
# * show the cell numbers of each set
head(upset$stats)
# # A tibble: 6 × 3
# intersection sets count
# <chr> <named list> <dbl>
# 1 scissor <chr [1]> 246
# 2 scPAS <chr [1]> 0
# 3 scAB <chr [1]> 87
# 4 scPP <chr [1]> 49
# 5 scissor & scPAS <chr [2]> 0
# 6 scissor & scAB <chr [2]> 11
ggplot2::ggsave("vignettes/example_figures/upset.png",plot=upset$plot, width = 10, height = 10)
knitr::include_graphics("vignettes/example_figures/upset.png")
A bar chart showing proportions can also be used to examine the
screening results. Since the example data does not have sample metadata,
we have created a fictional Sample
column.
set.seed(123)
# *fictional sample column
screen_result$Sample <- sample(paste0("Sample", 1:10), ncol(screen_result), replace = TRUE)
table(screen_result$Sample)
# Sample1 Sample10 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
# 98 117 96 119 95 100 101 131 115 121
fraction_list = ScreenFractionPlot(
screened_seurat = screen_result,
group_by = "Sample",
screen_type = c("scissor", "scPP", "scAB","scPAS"), # scPAS is not included due to no positive cells
show_null = FALSE,
plot_color = NULL,
show_plot = TRUE
)
# Creating plots for 3 screen types...
knitr::include_graphics("vignettes/example_figures/fraction.png")
The fraction_list
contains the statistical data and
charts for each screening algorithm.
fraction_list
├── stats
│ ├── scissor
│ ├── scAB
│ ├── scPAS
│ └── scPP
├── plots
│ ├── scissor
│ ├── scAB
│ ├── scPAS
│ └── scPP
└── combined_plot # show 3 plots in one plot
UMAP is the most commonly used type of plot in academic literature.
library(patchwork)
library(zeallot)
sample_umap = Seurat::DimPlot(
screen_result,
group.by = "Sample",
pt.size = 0.2,
reduction = "umap",
cols = my_colors
) +
ggplot2::ggtitle("Sample")
c(
scissor_umap,
scab_umap,
scpas_umap,
scpp_umap
) %<-%
purrr::map(
c("scissor", "scAB", "scPAS", "scPP"), # make sure these column names exist
~ Seurat::DimPlot(
screen_result,
group.by = .x,
pt.size = 0.2,
reduction = "umap",
cols = c(
"Neutral" = "#CECECE",
"Other" = "#CECECE",
"Positive" = "#ff3333",
"Negative" = "#386c9b"
)
) +
ggplot2::ggtitle(.x)
)
# * Show
umaps = sample_umap +
scissor_umap +
scab_umap +
scpas_umap +
scpp_umap +
plot_layout(ncol = 2)
umaps
knitr::include_graphics("vignettes/example_figures/umaps.png")
5.2 Continuous phenotype associated cell screening
Generally speaking, the process is the same as described in 5.1 Survival-associated cell screening. Here, only the preprocessing of continuous phenotypic data is introduced.
library(SigBridgeR)
library(zeallot)
library(Seurat)
setwd(here::here())
# * load the example data
c(mat_exam, bulk, pheno) %<-% LoadRefData(data_type = "continuous")
dim(mat_exam)
#[1] 33694 1093
dim(bulk)
# [1] 4106 289
bulk[1:6,1:6]
# TCGA-AZ-6599-01 TCGA-AA-3655-01 TCGA-A6-6137-01 TCGA-CK-4952-01 TCGA-A6-5657-01 TCGA-AD-6963-01
# HIF3A 2.3437 2.0858 6.0759 1.9506 5.4777 4.4634
# CAMK4 4.9331 2.3709 4.1387 1.1557 4.1746 3.2363
# RNF112 2.4817 2.4947 3.5941 2.3486 4.9185 1.4621
# SPN 5.6704 6.8577 8.0598 5.0049 7.6076 7.3960
# LRRTM1 1.6031 0.9465 1.9142 0.0000 3.2523 0.0000
# GRIN1 6.4944 4.3225 2.8073 7.3460 4.5000 3.1816
# * A named vector
head(pheno)
# TCGA-AZ-6599-01 TCGA-AA-3655-01 TCGA-A6-6137-01 TCGA-CK-4952-01 TCGA-A6-5657-01 TCGA-AD-6963-01
# 178 65 91 206 63 67
If your phenotype is a data.frame
, try this to convert
it to a named vector
:
pheno = setNames(your_data.frame$continuous_numeric, your_data.frame$sample)
5.3 Binarized phenotype associated cell screening
This process is also the same as described in 5.1 Survival-associated cell screening. Here, only the preprocessing of binary phenotypic data is introduced.
library(SigBridgeR)
library(zeallot)
library(Seurat)
setwd(here::here())
# * load the example data
c(mat_exam, bulk, pheno) %<-% LoadRefData(data_type = "binary")
dim(mat_exam)
#[1] 33694 1093
dim(bulk)
# [1] 4106 434
bulk[1:6,1:6]
# TCGA-CA-5256-01 TCGA-AZ-6599-01 TCGA-AA-3655-01 TCGA-A6-6137-01 TCGA-CK-4952-01 TCGA-A6-5657-01
# HIF3A 3.7172 2.3437 2.0858 6.0759 1.9506 5.4777
# CAMK4 3.0698 4.9331 2.3709 4.1387 1.1557 4.1746
# RNF112 1.3702 2.4817 2.4947 3.5941 2.3486 4.9185
# SPN 5.5207 5.6704 6.8577 8.0598 5.0049 7.6076
# LRRTM1 3.2408 1.6031 0.9465 1.9142 0.0000 3.2523
# GRIN1 3.0698 6.4944 4.3225 2.8073 7.3460 4.5000
# * A named vector
head(pheno)
# TCGA-CA-5256-01 TCGA-AZ-6599-01 TCGA-AA-3655-01 TCGA-A6-6137-01 TCGA-CK-4952-01 TCGA-A6-5657-01
# 1 1 1 1 1 1
If your phenotype is a data.frame
, your binary variable
is stored in the “data” column, categorized as ‘Tumor’ and ‘Normal’, we
will assign ‘Tumor’ a value of 1 and ‘Normal’ a value of 0. In this way,
cells screened as Positive will be associated with
‘Tumor’. Try this to convert it to a named vector
:
pheno = mutate(
pheno,
data = case_when(
data == "Tumor" ~ 1,
data == "Normal" ~ 0
)
)
pheno = setNames(pheno$data, pheno$Sample)
6. Other function details
6.1 Add miscellaneous information to the Seurat object
SigBridgeR uses AddMisc()
to record what data features
or evidence the various screening algorithms are based on during
execution.
-
AddMisc()
: Add miscellaneous information to the Seurat object. Support for adding multiple attributes to theSeuratObject@misc
slot simultaneously.
# basic usage
seurat_obj <- AddMisc(seurat_obj, "QC_stats" = qc_df)
# Auto-incrementing example when `cover` set to FALSE
seurat_obj <- AddMisc(seurat_obj, markers = markers1)
seurat_obj <- AddMisc(seurat_obj, markers = markers2, cover=FALSE)
# Add multiple attributes to the `SeuratObject@misc` slot simultaneously
seurat_obj <- AddMisc(seurat_obj, markers1 = markers1, markers2 = markers2)
6.2 Calculate the variance of each row in a matrix
-
rowVars()
: Calculate the variance of each row in a matrix.
# Basic usage with a matrix
mat <- matrix(1:12, nrow = 3)
rowVars(mat)
# With missing values
mat[1, 2] <- NA
mat[2, 3] <- NA
rowVars(mat, na.rm = TRUE) # Excludes NAs
rowVars(mat, na.rm = FALSE) # Includes NAs (returns NA for affected rows)
# With a data frame
df <- data.frame(
a = c(1, 4, 7),
b = c(2, 5, 8),
c = c(3, 6, 9)
)
rowVars(df)
# Edge case: single column (variance is 0)
single_col <- matrix(1:3, ncol = 1)
rowVars(single_col) # Returns NaN due to division by 0
6.3 Detect Rows with Zero Variance in Matrices
-
Check0VarRows
: Detect Rows with Zero Variance in Matrices
The Check0VarRows
function identifies rows with zero
variance (constant values) or exclusively NA
values in
numeric matrices, including sparse matrices of class
dgCMatrix
. It throws an informative error if such rows are
detected, listing their names. This is particularly useful for
preprocessing steps in statistical analyses (e.g., PCA, regression)
where constant rows may cause computational issues or
misinterpretations.
Parameters
mat
(Required)
A numeric matrix or sparse matrix of classdgCMatrix
(from theMatrix
package). Rows represent features (e.g., genes), and columns represent observations. Must have row names to identify problematic rows in error messages.call
(Optional, Advanced)
The environment from which the function was called. Used internally for error reporting. Defaults torlang::caller_env()
. Most users can ignore this parameter.
Details
Algorithm:
For dense matrices, uses
rowVars
(Section 6.2) implementation to compute row variances with efficient NA handling.-
For sparse matrices (
dgCMatrix
), leverages a mathematical identity to compute variance without dense intermediate matrices:Rows with <=1 non-zero observation are treated as zero-variance to avoid numerical instability.
# Example 1: Dense matrix with no zero-variance rows
set.seed(123)
mat_dense <- matrix(rnorm(100), nrow = 10)
rownames(mat_dense) <- paste0("Gene", 1:10)
Check0VarRows(mat_dense) # No error
# Example 2: Dense matrix with zero-variance row
mat_dense[1, ] <- rep(5, 10) # First row is constant
Check0VarRows(mat_dense) # Error: "Detected 1 gene(s) with zero variance: Gene1"
# Example 3: Sparse matrix (dgCMatrix)
library(Matrix)
mat_sparse <- as(matrix(rpois(100, 0.5), nrow = 10), "dgCMatrix")
rownames(mat_sparse) <- paste0("Gene", 1:10)
Check0VarRows(mat_sparse) # Error if any row has zero variance
7. References
Sun D, Guan X, Moran AE, Wu LY, Qian DZ, Schedin P, et al. Identifying phenotype-associated subpopulations by integrating bulk and single-cell sequencing data. Nat Biotechnol. 2022 Apr;40(4):527–38.
Xie A, Wang H, Zhao J, Wang Z, Xu J, Xu Y. scPAS: single-cell phenotype-associated subpopulation identifier. Briefings in Bioinformatics. 2024 Nov 22;26(1):bbae655.
Zhang Q, Jin S, Zou X. scAB detects multiresolution cell states with clinical significance by integrating single-cell genomics and bulk sequencing data. Nucleic Acids Research. 2022 Nov 28;50(21):12112–30.
WangX-Lab/ScPP Internet. cited 2025 Aug 31. Available from: https://github.com/WangX-Lab/ScPP