Bulk RNA-seq Data Preprocessing and Quality Control Function
Source:R/11-BulkPreProcess.R
BulkPreProcess.Rd
This function performs comprehensive preprocessing and quality control analysis for bulk RNA-seq data, including data validation, filtering, batch effect detection, principal component analysis, and visualization.
Usage
BulkPreProcess(
data,
sample_info = NULL,
gene_symbol_conversion = TRUE,
check = TRUE,
min_count_threshold = 10,
min_gene_expressed = 3,
min_total_reads = 1e+06,
min_genes_detected = 10000,
min_correlation = 0.8,
n_top_genes = 500,
show_plot_results = TRUE,
verbose = TRUE
)
Arguments
- data
Expression matrix with genes as rows and samples as columns, or a list containing count_matrix and sample_info
- sample_info
Sample information data frame (optional), ignored if data is a list
- gene_symbol_conversion
Whether to convert Ensembles version IDs and TCGA version IDs to genes with IDConverter, default TRUE.
- check
Whether to perform detailed quality checks, default TRUE
- min_count_threshold
Minimum count threshold for gene filtering, default 10
- min_gene_expressed
Minimum number of samples a gene must be expressed in, default 3
- min_total_reads
Minimum total reads per sample, default 1e6
- min_genes_detected
Minimum number of genes detected per sample, default 10000
- min_correlation
Minimum correlation threshold between samples, default 0.8
- n_top_genes
Number of top variable genes for PCA analysis, default 500
- show_plot_results
Whether to generate visualization plots, default TRUE
- verbose
Whether to output detailed information, default TRUE
Details
The function performs the following operations:
Data validation and format conversion
Basic statistics calculation (missing values, read depth, gene detection)
Optional detailed quality checks including:
Sample correlation analysis
Principal Component Analysis (PCA)
Outlier detection using Mahalanobis distance
Batch effect detection using ANOVA
Data filtering based on count thresholds and quality metrics
Optional gene symbol conversion
Visualization generation (PCA plots)
Quality Metrics
The function calculates and reports several quality metrics:
- Data Integrity
Number of missing values
- Gene Count
Total number of genes after filtering
- Sample Read Depth
Total reads per sample
- Gene Detection Rate
Number of genes detected per sample
- Sample Correlation
Pearson correlation between samples
- PCA Variance
Variance explained by first two principal components
- Batch Effects
Proportion of genes significantly affected by batch
Sample Information Format
The sample_info data frame should contain:
- sample
Character vector of unique sample identifiers
- condition
Character vector of experimental conditions
- batch
Optional character vector of batch identifiers
Filtering Criteria
Genes are retained if:
They have counts >= min_count_threshold in >= min_gene_expressed samples
Samples are retained if:
Total reads >= min_total_reads
Detected genes >= min_genes_detected
Mean correlation >= min_correlation (if check=TRUE)
See also
cpm
for counts per million calculation,
prcomp
for PCA analysis,
cor
for correlation analysis,
rowVars
for variance calculation of each row,
SymbolConvert
for gene symbol conversion