Aligns and validates phenotype data with bulk RNA-seq expression matrix.
Supports three phenotype types (binary, continuous, survival) with automatic
sample matching, type validation, and optional conditional mapping via
PhenoMap.
Usage
PhenoPreProcess(
bulk,
phenotype,
phenotype_class = c("binary", "continuous", "survival"),
...,
select = NULL,
verbose = getFuncOption("verbose")
)Arguments
- bulk
A two-dimensional matrix or data frame of bulk RNA-seq expression data with genes as rows and samples as columns. Must have at least 2 samples.
- phenotype
Phenotype data, either:
Named numeric vector (for binary/continuous phenotypes)
Data frame/matrix with row names matching
colnames(bulk)(for survival or multi-column phenotypes)
- phenotype_class
Character. Type of phenotype:
"binary"Two-class categorical outcome (e.g., case/control). Must have exactly 2 unique values.
"continuous"Continuous measurement (e.g., age, expression). Must have >2 unique values.
"survival"Survival data with time and status columns. Automatically guesses columns named "time" and "status/censor" if
selectisNULL.
Partial matching is supported.
- ...
Conditional mapping rules passed to
PhenoMapfor transforming phenotype values before validation. Format:condition ~ value. Example:status == "death" ~ 1, status == "alive" ~ 0.- select
Character. Column name(s) to select from
phenotypewhen it is two-dimensional. Required for survival data with multiple columns, or to specify which column to use for binary/continuous phenotypes.- verbose
Logical. Whether to print diagnostic messages. Default: inherits from
getOption("SigBridgeRUtils.verbose").
Value
Preprocessed phenotype data:
For binary/continuous: Named numeric vector with sample names as names
For survival: Data frame with two columns (time, status) and sample names as row names
Only samples present in both bulk and phenotype are retained.
Validation Rules
Binary: Must have exactly 2 unique values (e.g., 0/1, TRUE/FALSE)
Continuous: Must have >2 unique values (to avoid confusion with binary)
Survival: Must have exactly 2 unique values in status column (typically 0 = censored, 1 = event)
Sample matching: Common samples identified via
intersect(colnames(bulk), names/rownames(phenotype))
Automatic Features
Survival column guessing: When
phenotype_class = "survival"andselect = NULL, automatically detects columns containing "time" and "status"/"censor" in their names (case-insensitive).Type conversion: Logical values are converted to numeric (TRUE->1, FALSE->0)
Sample alignment: Returns only samples present in both
bulkandphenotype
See also
Other input_preprocess:
BulkPreProcess(),
PhenoMap(),
SCPreProcess()
Examples
if (FALSE) { # \dontrun{
# Example 1: Binary phenotype
bulk_data <- matrix(rpois(100, 10), nrow = 20, ncol = 5)
colnames(bulk_data) <- paste0("Sample", 1:5)
pheno_binary <- c(Sample1 = 1, Sample2 = 0, Sample3 = 1, Sample4 = 0, Sample5 = 1)
result <- PhenoPreProcess(
bulk = bulk_data,
phenotype = pheno_binary,
phenotype_class = "binary"
)
# Example 2: Continuous phenotype with discretization
pheno_age <- c(Sample1 = 25, Sample2 = 35, Sample3 = 45, Sample4 = 55, Sample5 = 65)
result <- PhenoPreProcess(
bulk = bulk_data,
phenotype = pheno_age,
phenotype_class = "continuous",
.x < 40 ~ "Young", .x >= 40 ~ "Old"
)
# Example 3: Survival data with automatic column detection
pheno_surv <- data.frame(
time = c(12, 24, 18, 36, 30),
status = c(1, 0, 1, 1, 0),
row.names = paste0("Sample", 1:5)
)
result <- PhenoPreProcess(
bulk = bulk_data,
phenotype = pheno_surv,
phenotype_class = "survival"
)
# Example 4: Survival data with explicit column names
pheno_surv_custom <- data.frame(
follow_up = c(12, 24, 18, 36, 30),
event = c(1, 0, 1, 1, 0),
row.names = paste0("Sample", 1:5)
)
result <- PhenoPreProcess(
bulk = bulk_data,
phenotype = pheno_surv_custom,
phenotype_class = "survival",
select = c("follow_up", "event")
)
} # }