Label Survival-Associated Phenotype Cells Based on Hazard Scores

Classifies cells into survival-associated phenotype groups ("Positive" vs "Other") based on hazard scores using statistical distribution analysis. This function identifies cells with significantly elevated hazard scores that may be associated with survival outcomes, employing adaptive thresholding based on distribution characteristics.

Usage

LabelSurvivalCells(
  pred_dt,
  select_fraction,
  test_method,
  min_threshold = 0.7,
  verbose = TRUE
)

Arguments

pred_dt: A data.table containing hazard scores for cells. Must contain a column named 'Hazard' with numeric hazard scores.
select_fraction: Numeric value between 0 and 1 specifying the target fraction of cells to classify as "Positive". The actual fraction may be adjusted based on distribution characteristics and minimum threshold constraints.
test_method: Character string specifying the statistical test to use for normality assessment of hazard scores. One of: "jarque-bera", "d'agostino", "kolmogorov-smirnov".
verbose: Logical, whether to print messages.

Value

The input pred_dt with an additional column:

label - Character vector with cell classifications: "Positive" (high hazard cells) or "Other"

Details

Classification Strategies:

Non-normal distributions (p-value < 0.05): Uses quantile-based selection where the top select_fraction of cells by hazard score are classified as "Positive", with minimum threshold constraints
Normal distributions (p-value ≥ 0.05): Uses normal distribution quantiles to determine the classification threshold, adjusted to meet minimum requirements

Supported Normality Tests:

Jarque-Bera: Tests for skewness and kurtosis deviations from normality
D'Agostino: Extended normality test focusing on skewness
Kolmogorov-Smirnov: Non-parametric test comparing empirical distribution to normal distribution

Note

The function assumes the input data.table contains a column named 'Hazard' with numeric values representing hazard scores from upstream analysis. The minimum threshold is internally defined to ensure biological relevance of the identified cell populations.

Examples

if (FALSE) { # \dontrun{
# Create example hazard score data
hazard_data <- data.table(
  cell_id = paste0("cell_", 1:1000),
  Hazard = rexp(1000, rate = 2)  # Simulated hazard scores
)

# Identify survival-associated cells
result <- LabelSurvivalCells(
  pred_dt = hazard_data,
  select_fraction = 0.1,
  test_method = "jarque-bera"
)

# Check classification results
table(result$label)

# Analyze the hazard scores of positive cells
summary(result[label == "Positive", Hazard])
} # }