实验2:差异表达分析与可视化

使用DESeq2识别差异基因并进行功能分析

实验目标

完成本实验后,你将能够:

  1. 使用DESeq2进行差异表达分析
  2. 理解DESeq2结果表的各列含义
  3. 筛选和解释差异表达基因
  4. 绘制火山图、热图等可视化图表
  5. 进行基础的功能富集分析(GO/KEGG)

关键知识点

重要核心概念速记
概念 要点
DESeq() 执行完整的差异分析流程(标准化→离散度估计→统计检验)
results() 提取差异分析结果,返回DataFrame
padj BH校正后的p值,用于差异基因筛选
log2FoldChange log2倍数变化,表示差异大小和方向
clusterProfiler 功能富集分析的R包,支持GO/KEGG/GSEA

⚠️ 常见错误

错误 正确做法
使用pvalue而非padj筛选 必须使用padj(BH校正后)
仅看p值不看log2FC 需同时满足padj<0.05和|log2FC|>1
GO分析使用全部基因 背景应为所有检测到的基因

实验准备

加载包和真实数据

代码
# 安装和加载所需包
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

packages <- c("DESeq2", "ggplot2", "pheatmap", "clusterProfiler",
              "org.Hs.eg.db", "enrichplot")

for (pkg in packages) {
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    BiocManager::install(pkg)
    library(pkg, character.only = TRUE)
  }
}

# 中文字体
showtext::showtext_auto()

# 加载真实RNA-seq数据(PTC vs ATC甲状腺癌数据)
# 如果实验1已完成,可以直接使用dds对象
# 否则运行以下代码重新加载数据

# 数据下载链接(从课程网站下载):
# - geneCountMatrix.txt: ../data/geneCountMatrix.txt
# - samplesinfo.txt: ../data/samplesinfo.txt
# 代码中使用GitHub链接确保可执行

# 下载数据文件(如果不存在)
data_url_counts <- "http://biotree.top:38123/courses/rna-seq-analysis/data/geneCountMatrix.txt"
data_url_info <- "http://biotree.top:38123/courses/rna-seq-analysis/data/samplesinfo.txt"

#data_url_counts <- "https://github.com/WangLabCSU/courses/raw/main/rna-seq-analysis/data/geneCountMatrix.txt"
#data_url_info <- "https://github.com/WangLabCSU/courses/raw/main/rna-seq-analysis/data/samplesinfo.txt"

if (!file.exists("../data/geneCountMatrix.txt")) {
  download.file(data_url_counts, "geneCountMatrix.txt", mode = "wb")
  cat("已下载 geneCountMatrix.txt\n")
  count_file <- "geneCountMatrix.txt"
} else {
  count_file <- "../data/geneCountMatrix.txt"
}
if (!file.exists("../data/samplesinfo.txt")) {
  download.file(data_url_info, "samplesinfo.txt", mode = "wb")
  cat("已下载 samplesinfo.txt\n")
  info_file <- "samplesinfo.txt"
} else {
  info_file <- "../data/samplesinfo.txt"
}

# 1. 读取表达矩阵
counts_matrix <- read.table(count_file, header = TRUE, sep = "\t", check.names = FALSE)
counts <- counts_matrix[, -1]  # 移除基因symbol列
rownames(counts) <- counts_matrix$gene_symbol

# 2. 读取样本信息
sample_info <- read.table(info_file, header = TRUE, sep = "\t")
conditions <- factor(sample_info$type, levels = c("PTC", "ATC"))

# 3. 创建DESeq2对象
colData <- data.frame(
  condition = conditions,
  batch = sample_info$batch,
  row.names = sample_info$sample
)

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData = colData,
  design = ~ condition
)

# 4. 过滤低表达基因
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep, ]

cat("数据准备完成!\n")
数据准备完成!
代码
cat("基因数:", nrow(dds), "\n")
基因数: 14154 
代码
cat("样本数:", ncol(dds), "\n")
样本数: 6 
代码
cat("样本名称:", colnames(dds), "\n")
样本名称: ATC1 ATC2 ATC3 PTC1 PTC2 PTC3 

第一部分:差异表达分析

练习1.1:运行DESeq2差异分析

代码
# TODO: 完成DESeq2差异分析流程

# 1. 运行DESeq2(包含标准化、离散度估计、统计检验)
cat("运行DESeq2分析...\n")
运行DESeq2分析...
代码
dds <- DESeq(dds)

# 2. 提取结果
# contrast参数:c("分组列名", "处理组", "对照组")
# 注意:这里的"处理"是ATC vs PTC
res <- results(dds, contrast = c("condition", "ATC", "PTC"))

# 3. (可选) 收缩log2FC估计值,使低表达基因的fold change更稳健
# res_lfc <- lfcShrink(dds, coef = "condition_ATC_vs_PTC", type = "apeglm")
# 推荐使用收缩后的log2FC进行下游可视化

# 3. 查看结果摘要
cat("\nDESeq2结果摘要:\n")

DESeq2结果摘要:
代码
summary(res)

out of 14154 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 399, 2.8%
LFC < 0 (down)     : 547, 3.9%
outliers [1]       : 293, 2.1%
low counts [2]     : 0, 0%
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
代码
# 4. 查看结果表的前几行
cat("\n结果表(前5行):\n")

结果表(前5行):
代码
print(head(res))
log2 fold change (MLE): condition ATC vs PTC 
Wald test p-value: condition ATC vs PTC 
DataFrame with 6 rows and 6 columns
        baseMean log2FoldChange     lfcSE      stat    pvalue      padj
       <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
A1BG     206.821       0.197023  0.742124  0.265485 0.7906358  0.945918
A2LD1    146.073      -0.555663  0.805723 -0.689645 0.4904172  0.831816
A4GALT   593.454      -1.867799  0.967007 -1.931525 0.0534181  0.328350
AAAS    1519.960      -0.232417  0.382539 -0.607563 0.5434774  0.859675
AACS    1281.655      -0.135044  0.501322 -0.269376 0.7876404  0.945089
AADAT    324.801       0.375832  0.623692  0.602592 0.5467804  0.860654
代码
# 5. 统计各p值范围的基因数
cat("\np值分布:\n")

p值分布:
代码
cat("padj < 0.001:", sum(res$padj < 0.001, na.rm = TRUE), "\n")
padj < 0.001: 253 
代码
cat("padj < 0.01:", sum(res$padj < 0.01, na.rm = TRUE), "\n")
padj < 0.01: 418 
代码
cat("padj < 0.05:", sum(res$padj < 0.05, na.rm = TRUE), "\n")
padj < 0.05: 695 
点击查看输出解读

输出解读

  • baseMean:平均表达量(size factor标准化后)
  • log2FoldChange:log2倍数变化(ATC vs PTC)
  • lfcSE:log2FC的标准误
  • stat:Wald统计量
  • pvalue:原始p值
  • padj:BH校正后的p值

练习1.2:理解结果表并筛选差异基因

代码
# TODO: 整理和解读差异分析结果

# 1. 将结果转换为数据框并移除NA
res_df <- as.data.frame(res)
res_df$gene <- rownames(res_df)
res_df <- na.omit(res_df)  # 移除padj为NA的行(通常是低表达基因被独立过滤)

# 2. 添加差异状态分类
res_df$diff_status <- ifelse(
  res_df$padj < 0.05 & res_df$log2FoldChange > 1, "Up",
  ifelse(res_df$padj < 0.05 & res_df$log2FoldChange < -1, "Down", "Not Sig")
)

# 3. 统计各类基因数量
status_table <- table(res_df$diff_status)
cat("差异基因统计:\n")
差异基因统计:
代码
print(status_table)

   Down Not Sig      Up 
    412   13182     267 
代码
# 4. 找出Top 10上调基因(按padj排序)
top_up <- res_df[res_df$diff_status == "Up", ]
top_up <- top_up[order(top_up$padj), ]
cat("\nTop 10 上调基因 (ATC中高表达):\n")

Top 10 上调基因 (ATC中高表达):
代码
print(head(top_up[, c("gene", "baseMean", "log2FoldChange", "padj")], 10))
         gene  baseMean log2FoldChange         padj
NTSR1   NTSR1 1071.2166      10.108541 4.756772e-18
HOXB8   HOXB8  496.0749       9.946011 1.160483e-16
TRHDE   TRHDE 1080.0636       6.219738 1.031072e-13
ABCC9   ABCC9  453.9784       8.038105 1.628228e-12
BTBD11 BTBD11 1406.9263       3.179016 4.406671e-11
RAGE     RAGE 2330.2706       2.481524 4.813586e-11
IL1A     IL1A 1449.5932       7.444527 8.190292e-10
POU2F2 POU2F2 1089.5840       5.354168 8.190292e-10
AGTR1   AGTR1  516.3107       6.588342 1.264246e-09
FOXG1   FOXG1  401.2260      11.102169 3.113666e-09
代码
# 5. 找出Top 10下调基因
top_down <- res_df[res_df$diff_status == "Down", ]
top_down <- top_down[order(top_down$padj), ]
cat("\nTop 10 下调基因 (ATC中低表达):\n")

Top 10 下调基因 (ATC中低表达):
代码
print(head(top_down[, c("gene", "baseMean", "log2FoldChange", "padj")], 10))
           gene  baseMean log2FoldChange         padj
KCNJ16   KCNJ16 4315.7968     -12.723676 2.890509e-32
VCAM1     VCAM1 1942.8727     -10.212810 7.080371e-31
PDE10A   PDE10A 1336.9797      -9.452668 8.483145e-31
RASAL1   RASAL1  298.4580      -7.164806 3.685437e-24
OLFML2B OLFML2B 1555.4485      -3.803537 4.083903e-23
WFDC2     WFDC2 1455.5608      -9.444831 8.541053e-23
DAAM2     DAAM2  338.6293      -7.206446 2.324768e-17
DBC1       DBC1  468.5402      -7.585819 9.010055e-17
INHBB     INHBB 1120.9102      -8.794307 1.958769e-16
SYTL1     SYTL1  627.6601      -7.716563 8.168632e-15
点击查看核心代码
res_df <- as.data.frame(res)
res_df$gene <- rownames(res_df)
res_df <- na.omit(res_df)

res_df$diff_status <- ifelse(
  res_df$padj < 0.05 & res_df$log2FoldChange > 1, "Up",
  ifelse(res_df$padj < 0.05 & res_df$log2FoldChange < -1, "Down", "Not Sig")
)

table(res_df$diff_status)
提示筛选阈值说明
  • padj < 0.05:控制假发现率在5%以下
  • |log2FC| > 1:倍数变化 > 2倍
  • 可根据研究需要调整:严格(|log2FC|>2, padj<0.01)或宽松(|log2FC|>0.5, padj<0.1)

第二部分:结果可视化

练习2.1:MA图

MA图展示表达量(mean)与差异倍数(log2FC)的关系。

代码
# TODO: 创建MA图

# 方法1:使用DESeq2内置函数(快速)
plotMA(res, ylim = c(-5, 5), main = "MA Plot: ATC vs PTC")

代码
# 方法2:使用ggplot2自定义(更美观)
ma_plot <- ggplot(res_df, aes(x = baseMean, y = log2FoldChange, color = diff_status)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_x_log10() +
  scale_color_manual(values = c("Down" = "blue", "Not Sig" = "grey", "Up" = "red")) +
  geom_hline(yintercept = c(-1, 1), linetype = "dashed", color = "grey50") +
  geom_hline(yintercept = 0, linetype = "solid", color = "black", alpha = 0.3) +
  theme_minimal() +
  labs(title = "MA Plot: ATC vs PTC",
       subtitle = paste0("Red: Up (n=", sum(res_df$diff_status=="Up"),
                        "), Blue: Down (n=", sum(res_df$diff_status=="Down"), ")"),
       x = "Mean Expression (log10 scale)",
       y = "log2 Fold Change",
       color = "Status")

print(ma_plot)

点击查看MA图解读

MA图解读

  • X轴:平均表达量(log10)
  • Y轴:log2 Fold Change
  • 红点:上调基因(ATC中高表达)
  • 蓝点:下调基因(ATC中低表达)
  • 灰点:不显著

关键观察

  • 低表达基因(左侧)的log2FC波动更大
  • 高表达基因(右侧)趋于稳定
  • 这是RNA-seq数据的典型特征

练习2.2:火山图

火山图同时展示差异倍数和统计显著性。

代码
# TODO: 创建火山图

# 1. 准备数据
res_plot <- res_df
res_plot$negLog10padj <- -log10(res_plot$padj)

# 2. 绘制火山图
volcano_plot <- ggplot(res_plot, aes(x = log2FoldChange, y = negLog10padj,
                                      color = diff_status)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("Down" = "blue", "Not Sig" = "grey", "Up" = "red")) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "grey50") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "grey50") +
  theme_minimal() +
  labs(title = "Volcano Plot: ATC vs PTC",
       x = "log2 Fold Change",
       y = "-log10(adjusted p-value)",
       color = "Status") +
  theme(legend.position = "top")

print(volcano_plot)

代码
# 3. 标注Top基因(可选)
# 获取最显著的上下调基因各5个
top_genes <- rbind(
  head(res_plot[res_plot$diff_status == "Up", ], 5),
  head(res_plot[res_plot$diff_status == "Down", ], 5)
)

volcano_labeled <- volcano_plot +
  geom_text(data = top_genes,
            aes(label = gene),
            size = 3, vjust = -0.5, hjust = 0.5, color = "black")

print(volcano_labeled)

点击查看核心代码
res_plot <- res_df
res_plot$negLog10padj <- -log10(res_plot$padj)

volcano_plot <- ggplot(res_plot, aes(x = log2FoldChange, y = negLog10padj,
                                      color = diff_status)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("Down" = "blue", "Not Sig" = "grey", "Up" = "red")) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "grey50") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "grey50") +
  theme_minimal() +
  labs(title = "Volcano Plot: ATC vs PTC",
       x = "log2 Fold Change",
       y = "-log10(adjusted p-value)")

print(volcano_plot)
注记火山图解读
  • 左上/右上区域:既显著又差异大的基因
  • 顶部水平线:padj = 0.05阈值
  • 垂直虚线:|log2FC| = 1阈值
  • 注意:|log2FC|很大但p值不显著的点(左右两端灰色点)通常是低表达基因

练习2.3:差异基因热图

代码
# TODO: 创建差异基因表达热图

# 1. 选择Top差异基因进行可视化(padj最小的前50个)
top_genes_heatmap <- head(res_df[order(res_df$padj), "gene"], 50)

# 2. 获取VST转换后的数据
dds_vst <- vst(dds, blind = FALSE)
vst_matrix <- assay(dds_vst)

# 提取Top基因的表达量
heatmap_data <- vst_matrix[top_genes_heatmap, ]

# 3. 按行(基因)标准化,突出表达模式
heatmap_scaled <- t(scale(t(heatmap_data)))

# 4. 准备注释数据
annotation_col <- data.frame(
  Condition = colData(dds)$condition,
  row.names = colnames(dds)
)

# 5. 绘制热图
pheatmap(heatmap_scaled,
         annotation_col = annotation_col,
         cluster_rows = TRUE,
         cluster_cols = TRUE,
         show_rownames = FALSE,
         main = "Top 50 Differential Genes",
         color = colorRampPalette(c("blue", "white", "red"))(100),
         fontsize = 8)

点击查看核心代码
top_genes_heatmap <- head(res_df[order(res_df$padj), "gene"], 50)
dds_vst <- vst(dds, blind = FALSE)
vst_matrix <- assay(dds_vst)
heatmap_data <- vst_matrix[top_genes_heatmap, ]
heatmap_scaled <- t(scale(t(heatmap_data)))

annotation_col <- data.frame(
  Condition = colData(dds)$condition,
  row.names = colnames(dds)
)

pheatmap(heatmap_scaled,
         annotation_col = annotation_col,
         cluster_rows = TRUE,
         cluster_cols = TRUE,
         show_rownames = FALSE,
         main = "Top 50 Differential Genes",
         color = colorRampPalette(c("blue", "white", "red"))(100))
提示热图解读
  • 红色:高表达
  • 蓝色:低表达
  • 列聚类:样本是否按条件分组
  • 行标准化:消除基因间基线差异,突出相对变化

第三部分:功能富集分析

练习3.1:GO富集分析

代码
# TODO: 进行GO功能富集分析

# 1. 准备差异基因列表
gene_list_up <- res_df$gene[res_df$diff_status == "Up"]
gene_list_down <- res_df$gene[res_df$diff_status == "Down"]

cat("上调基因数:", length(gene_list_up), "\n")
上调基因数: 267 
代码
cat("下调基因数:", length(gene_list_down), "\n")
下调基因数: 412 
代码
# 2. 查看支持的基因ID类型
cat("\norg.Hs.eg.db支持的ID类型(部分):\n")

org.Hs.eg.db支持的ID类型(部分):
代码
print(head(keytypes(org.Hs.eg.db), 10))
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
代码
# 3. GO富集分析(Biological Process)
# 使用SYMBOL直接分析(clusterProfiler支持)
ego_up <- enrichGO(
  gene = gene_list_up,
  OrgDb = org.Hs.eg.db,
  keyType = "SYMBOL",        # 输入基因ID类型
  ont = "BP",                # BP: Biological Process
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  qvalueCutoff = 0.2
)

# 4. 查看结果
cat("\nGO富集结果(前10条):\n")

GO富集结果(前10条):
代码
print(head(ego_up, 10))
                   ID                                         Description
GO:0009615 GO:0009615                                   response to virus
GO:0051607 GO:0051607                           defense response to virus
GO:0034340 GO:0034340                       response to type I interferon
GO:0071357 GO:0071357              cellular response to type I interferon
GO:0048525 GO:0048525                negative regulation of viral process
GO:0045071 GO:0045071     negative regulation of viral genome replication
GO:0140374 GO:0140374                    antiviral innate immune response
GO:0001819 GO:0001819          positive regulation of cytokine production
GO:0140888 GO:0140888               interferon-mediated signaling pathway
GO:0032481 GO:0032481 positive regulation of type I interferon production
           GeneRatio   BgRatio RichFactor FoldEnrichment    zScore       pvalue
GO:0009615    28/227 400/18842 0.07000000       5.810308 10.738220 6.350209e-14
GO:0051607    23/227 305/18842 0.07540984       6.259349 10.225759 2.741616e-12
GO:0034340    12/227 122/18842 0.09836066       8.164368  8.766742 2.827805e-08
GO:0071357    11/227 116/18842 0.09482759       7.871107  8.197220 1.579279e-07
GO:0048525     9/227  74/18842 0.12162162      10.095130  8.656636 2.556154e-07
GO:0045071     7/227  48/18842 0.14583333      12.104809  8.506590 1.624742e-06
GO:0140374     8/227  73/18842 0.10958904       9.096373  7.653582 2.689518e-06
GO:0001819    20/227 499/18842 0.04008016       3.326830  5.817188 2.846207e-06
GO:0140888     9/227 105/18842 0.08571429       7.114663  6.938260 5.031970e-06
GO:0032481     8/227  80/18842 0.10000000       8.300441  7.225832 5.395477e-06
               p.adjust       qvalue
GO:0009615 3.733923e-10 3.733923e-10
GO:0051607 8.060352e-09 8.060352e-09
GO:0034340 5.542498e-05 5.542498e-05
GO:0071357 2.321540e-04 2.321540e-04
GO:0048525 3.006037e-04 3.006037e-04
GO:0045071 1.592247e-03 1.592247e-03
GO:0140374 2.091962e-03 2.091962e-03
GO:0001819 2.091962e-03 2.091962e-03
GO:0140888 3.172541e-03 3.172541e-03
GO:0032481 3.172541e-03 3.172541e-03
                                                                                                                                                                    geneID
GO:0009615 GATA3/IFI27/POU2F2/UAP1/STAT1/MX1/IFIT3/IFI6/MLKL/PARP9/IFIT1/ABCC9/IRF7/NLRP3/ISG15/RSAD2/OAS2/IFIT2/DMBT1/USP18/PLSCR1/IFIH1/RTP4/DTX3L/POLR3G/IFI44/CCT5/MX2
GO:0051607                               IFI27/UAP1/STAT1/MX1/IFIT3/IFI6/MLKL/PARP9/IFIT1/ABCC9/IRF7/ISG15/RSAD2/OAS2/IFIT2/DMBT1/USP18/PLSCR1/IFIH1/RTP4/DTX3L/POLR3G/MX2
GO:0034340                                                                                             IFI27/PSMA5/PSME2/PSMA3/STAT1/MX1/IFIT1/IRF7/ISG15/OAS2/USP18/SP100
GO:0071357                                                                                                 IFI27/PSMA5/PSME2/PSMA3/STAT1/IFIT1/IRF7/ISG15/OAS2/USP18/SP100
GO:0048525                                                                                                             STAT1/MX1/IFIT1/ISG15/RSAD2/OAS2/PLSCR1/IFIH1/SP100
GO:0045071                                                                                                                         MX1/IFIT1/ISG15/RSAD2/OAS2/PLSCR1/IFIH1
GO:0140374                                                                                                                     UAP1/MX1/IFIT3/IFIT1/OAS2/IFIT2/USP18/IFIH1
GO:0001819                                              GATA3/PSMA5/PTGS2/POU2F2/IL1A/UAP1/PSMA3/STAT1/SLC7A5/EREG/IRF7/NLRP3/ISG15/RIPK2/RSAD2/OAS2/IFIH1/CCL3/POLR3G/TXK
GO:0140888                                                                                                               IFI27/STAT1/PARP9/IRF7/ISG15/OAS2/USP18/SP100/TXK
GO:0032481                                                                                                                   UAP1/STAT1/IRF7/ISG15/RIPK2/OAS2/IFIH1/POLR3G
           Count
GO:0009615    28
GO:0051607    23
GO:0034340    12
GO:0071357    11
GO:0048525     9
GO:0045071     7
GO:0140374     8
GO:0001819    20
GO:0140888     9
GO:0032481     8
代码
# 5. 可视化
# 条形图
barplot(ego_up, showCategory = 15, title = "GO BP Enrichment - Upregulated Genes")

代码
# 点图(更推荐使用)
dotplot(ego_up, showCategory = 15, orderBy = "p.adjust",
        title = "GO BP Enrichment - Upregulated Genes")

代码
# 6. 保存结果
write.csv(as.data.frame(ego_up), "GO_BP_upregulated.csv", row.names = FALSE)
cat("\n结果已保存到: GO_BP_upregulated.csv\n")

结果已保存到: GO_BP_upregulated.csv
点击查看结果解读

富集结果解读

  • ID:GO条目ID
  • Description:功能描述
  • GeneRatio:差异基因中属于该通路的基因比例
  • BgRatio:背景基因中属于该通路的基因比例
  • pvalue/p.adjust:富集显著性
  • qvalue:FDR控制
  • Count:属于该通路的差异基因数
警告富集分析注意事项
  1. 背景基因集:应为所有检测到的基因,而非全基因组
  2. 基因ID转换:clusterProfiler需要特定ID格式(ENTREZID/SYMBOL等)
  3. 阈值选择:p.adjust < 0.05为常用阈值
  4. 生物学解释:富集≠因果,需结合文献解读
  5. GO冗余:GO terms常有高度重叠,可使用 simplify(ego) 减少冗余

练习3.2:KEGG通路富集分析

代码
# TODO: 进行KEGG通路富集分析

# 1. 基因ID转换(SYMBOL → ENTREZID)
gene_df <- bitr(gene_list_up,
                fromType = "SYMBOL",
                toType = "ENTREZID",
                OrgDb = org.Hs.eg.db)

cat("成功转换的基因数:", nrow(gene_df), "\n")
成功转换的基因数: 238 
代码
# 2. KEGG富集分析
kk <- enrichKEGG(
  gene = gene_df$ENTREZID,
  organism = "hsa",    # hsa = human
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05
)

# 3. 查看结果
cat("\nKEGG富集结果:\n")

KEGG富集结果:
代码
print(head(kk, 10))
 [1] category       subcategory    ID             Description    GeneRatio     
 [6] BgRatio        RichFactor     FoldEnrichment zScore         pvalue        
[11] p.adjust       qvalue         geneID         Count         
<0 rows> (or 0-length row.names)
代码
# 4. 可视化
dotplot(kk, showCategory = 15, title = "KEGG Pathway Enrichment")

点击查看核心代码
gene_df <- bitr(gene_list_up, fromType = "SYMBOL",
                toType = "ENTREZID", OrgDb = org.Hs.eg.db)

kk <- enrichKEGG(gene = gene_df$ENTREZID, organism = "hsa",
                 pAdjustMethod = "BH", pvalueCutoff = 0.05)

dotplot(kk, showCategory = 15)
注记常用物种代码
物种 代码
人类 hsa
小鼠 mmu
大鼠 rno
斑马鱼 dre

练习3.3:网络图可视化(cnetplot)

代码
# TODO: 绘制富集基因与通路的网络图

# 1. 准备基因表达变化数据
gene_foldchange <- res_df$log2FoldChange
names(gene_foldchange) <- res_df$gene

# 2. 绘制cnetplot
# 显示前5个显著富集的条目及其关联基因
cnetplot(ego_up,
         foldChange = gene_foldchange,
         showCategory = 5              # 显示前5个条目
)

点击查看cnetplot解读

cnetplot解读

  • 大节点:富集条目(GO term)
  • 小节点:基因
  • 用途:查看哪些基因参与多个通路,以及基因与通路的关联关系

练习3.4:GSEA分析(可选,进阶)

代码
# TODO: 进行GSEA分析

# 1. 准备排序基因列表(所有基因,按log2FC排序)
geneList <- res_df$log2FoldChange
names(geneList) <- res_df$gene
geneList <- sort(geneList, decreasing = TRUE)

# 2. 查看排序后的基因列表
cat("排序后的基因列表(Top 10上调和下调):\n")
排序后的基因列表(Top 10上调和下调):
代码
cat("Top 10 上调基因:\n")
Top 10 上调基因:
代码
print(head(geneList, 10))
     ESM1     FOXG1     NTSR1     HOXB8      MMP1     CXCL6     LPAR3      CCL3 
13.992158 11.102169 10.108541  9.946011  9.166710  8.613285  8.476962  8.433621 
 C9orf110     ABCC9 
 8.164724  8.038105 
代码
cat("\nTop 10 下调基因:\n")

Top 10 下调基因:
代码
print(tail(geneList, 10))
    PDE10A       NTF3     SPOCK2    SPATA18      GFRA2      VCAM1      KRT19 
 -9.452668  -9.676192  -9.720337  -9.764239  -9.782525 -10.212810 -10.286909 
      AGR2       RORC     KCNJ16 
-10.440192 -10.805467 -12.723676 
代码
# 3. GSEA分析(使用SYMBOL)
gsea <- gseGO(
  geneList = geneList,
  OrgDb = org.Hs.eg.db,
  keyType = "SYMBOL",
  ont = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  minGSSize = 10,      # 基因集最小基因数
  maxGSSize = 500      # 基因集最大基因数
)

# 4. 查看GSEA结果
cat("\nGSEA结果(前5条):\n")

GSEA结果(前5条):
代码
print(head(gsea, 5))
                   ID                        Description setSize
GO:0042254 GO:0042254                ribosome biogenesis     273
GO:0006364 GO:0006364                    rRNA processing     201
GO:0016072 GO:0016072             rRNA metabolic process     240
GO:0042273 GO:0042273 ribosomal large subunit biogenesis      63
GO:0140374 GO:0140374   antiviral innate immune response      57
           enrichmentScore      NES       pvalue     p.adjust       qvalue rank
GO:0042254       0.5700881 2.445870 1.000000e-10 1.175200e-07 1.004103e-07 3904
GO:0006364       0.5821664 2.390488 1.000000e-10 1.175200e-07 1.004103e-07 3872
GO:0016072       0.5418291 2.241265 1.000000e-10 1.175200e-07 1.004103e-07 4662
GO:0042273       0.6158696 2.166541 5.296477e-06 1.556105e-03 1.329552e-03 3667
GO:0140374       0.6139350 2.143528 9.005146e-06 2.204760e-03 1.883770e-03 1064
                             leading_edge
GO:0042254 tags=59%, list=28%, signal=43%
GO:0006364 tags=62%, list=28%, signal=45%
GO:0016072 tags=67%, list=34%, signal=45%
GO:0042273 tags=60%, list=26%, signal=45%
GO:0140374  tags=25%, list=8%, signal=23%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   core_enrichment
GO:0042254 CDKN2A/FGF12/ISG20/RPP25/EBNA1BP2/MPHOSPH6/TRMT2B/NOP2/RPP40/RRS1/NIP7/EIF5/BRIX1/DDX21/NOP58/TEX10/RPF1/MPV17L2/DKC1/EXOSC5/HEATR3/BOP1/RPUSD1/ZNF593/GTPBP4/PAK1IP1/PDCD2L/LYAR/NOL6/GNL3L/DDX10/EMG1/WDR3/NOP16/MRTO4/PPAN/EXOSC3/WDR77/NOP56/NOC2L/MAK16/PRMT5/LSG1/GNL2/RRP9/RIOK1/TFB2M/PNO1/NOL11/WDR12/NGDN/KRI1/PDCD2/PIN4/URB1/EXOSC2/RCL1/LSM6/DNTTIP2/LTV1/DCAF13/SRFBP1/RRP15/EIF1AX/GAR1/WDR43/WDR36/ZNHIT6/RRP7A/EXOSC4/RPP30/ERI1/POP4/UTP14A/NSUN6/DDX28/RIOK2/WDR75/NUDT16/GRWD1/HEATR1/NHP2/PES1/URB2/LAS1L/BYSL/DHX29/UTP20/EXOSC6/RPS24/NOL10/DHX37/CUL4B/C1QBP/USP36/NOLC1/UTP15/EXOSC9/SIRT7/PA2G4/NMD3/POP7/WDR74/RPS16/ZNF622/TRMT61B/FBL/WBP11/MYBBP1A/ZCCHC4/FASTKD2/POP5/DDX49/REXO1/RPS19BP1/RPF2/NSUN4/WDR18/FCF1/EIF4A3/RBM34/UTP23/EXOSC8/SART1/SUV39H1/NOB1/DDX55/NAF1/UTP3/NUP88/NOC4L/NOP10/RAN/PDCD11/PWP2/DDX47/NOL8/RPS4X/RPL35A/NVL/NDUFAB1/GTPBP10/UTP6/ABT1/NOL9/FTSJ3/USP16/GEMIN4/AATF/NOP14/SDAD1/FDXACB1/RPS13/DDX31/ESF1/BMS1/NSUN3/RPS5/NSA2/DDX56/RPS23
GO:0006364                                                                                                                                                                                                                                        CDKN2A/ISG20/RPP25/EBNA1BP2/MPHOSPH6/TRMT2B/NOP2/RPP40/RRS1/BRIX1/DDX21/NOP58/TEX10/RPF1/DKC1/EXOSC5/BOP1/RPUSD1/PAK1IP1/LYAR/NOL6/DDX10/EMG1/WDR3/MRTO4/PPAN/EXOSC3/WDR77/NOP56/MAK16/PRMT5/RRP9/RIOK1/TFB2M/PNO1/NOL11/WDR12/NGDN/KRI1/PIN4/URB1/EXOSC2/RCL1/LSM6/DNTTIP2/DCAF13/SRFBP1/RRP15/GAR1/WDR43/WDR36/ZNHIT6/RRP7A/EXOSC4/RPP30/ERI1/POP4/UTP14A/NSUN6/RIOK2/WDR75/NUDT16/HEATR1/NHP2/PES1/LAS1L/BYSL/UTP20/EXOSC6/RPS24/NOL10/DHX37/USP36/NOLC1/UTP15/EXOSC9/SIRT7/PA2G4/POP7/WDR74/RPS16/TRMT61B/FBL/WBP11/ZCCHC4/POP5/DDX49/REXO1/RPF2/NSUN4/WDR18/FCF1/EIF4A3/RBM34/UTP23/EXOSC8/SART1/SUV39H1/NOB1/DDX55/NAF1/UTP3/NOC4L/NOP10/PDCD11/PWP2/DDX47/NOL8/RPL35A/NVL/UTP6/ABT1/NOL9/FTSJ3/GEMIN4/AATF/NOP14/FDXACB1/RPS13/ESF1/BMS1/NSUN3/NSA2/DDX56
GO:0016072                  CDKN2A/ISG20/RPP25/EBNA1BP2/MPHOSPH6/TRMT2B/NOP2/RPP40/POLR1E/RRS1/BRIX1/DDX21/NOP58/TEX10/RPF1/DKC1/EXOSC5/BOP1/RPUSD1/MYC/PAK1IP1/LYAR/NOL6/DDX10/EMG1/WDR3/MRTO4/PPAN/EXOSC3/WDR77/NOP56/MAK16/PRMT5/RRP9/RIOK1/TFB2M/PNO1/NOL11/WDR12/SLFN5/NGDN/KRI1/PIN4/URB1/EXOSC2/RCL1/LSM6/DNTTIP2/IPPK/DCAF13/SRFBP1/RRP15/GAR1/WDR43/WDR36/ZNHIT6/RRP7A/EXOSC4/POLR2F/RPP30/POLR2H/ERI1/POP4/UTP14A/NSUN6/RIOK2/WDR75/NUDT16/HEATR1/NHP2/PES1/LAS1L/BYSL/NCL/UTP20/EXOSC6/RPS24/NOL10/DHX37/USP36/POLR1C/NOLC1/UTP15/EXOSC9/SIRT7/PA2G4/POP7/WDR74/RPS16/TRMT61B/FBL/POLR2K/WBP11/ZCCHC4/POP5/POLR1B/DDX49/REXO1/RPF2/NSUN4/WDR18/FCF1/EIF4A3/RBM34/UTP23/EXOSC8/SART1/SUV39H1/NOB1/DDX55/NAF1/UTP3/GTF3C6/NOC4L/NOP10/PDCD11/PWP2/DDX47/NOL8/RPL35A/NVL/UTP6/ABT1/NOL9/FTSJ3/GEMIN4/AATF/DDX11/NOP14/FDXACB1/RPS13/ESF1/BMS1/NSUN3/NSA2/DDX56/MTOR/RRP1/IMP3/PIH1D1/SENP3/POLR1A/PWP1/DDX27/TCOF1/UTP18/IMP4/RPS21/C1D/NOL7/YTHDF2/GTF3C4/EXOSC1/TSR1/RPP38/SLFN12/RPS6/RPS15/WDR46/PELP1
GO:0042273                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 EBNA1BP2/NOP2/RRS1/NIP7/BRIX1/TEX10/RPF1/HEATR3/BOP1/GTPBP4/PAK1IP1/NOP16/MRTO4/PPAN/NOC2L/MAK16/WDR12/URB1/RRP15/ZNHIT6/DDX28/PES1/LAS1L/WDR74/ZNF622/FASTKD2/RPF2/NSUN4/WDR18/RBM34/DDX55/RPL35A/NVL/NDUFAB1/GTPBP10/NOL9/FTSJ3/SDAD1
GO:0140374                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         IFIT1/OAS2/USP18/MX1/IFIT3/IFIT2/OAS1/IFIH1/IFIT5/OAS3/UAP1/OASL/EIF2AK2/TRIM25
             log2err
GO:0042254       NaN
GO:0006364       NaN
GO:0016072       NaN
GO:0042273 0.6105269
GO:0140374 0.5933255
代码
# 5. GSEA可视化
# 绘制最显著富集的条目
gseaplot2(gsea, geneSetID = 1, title = gsea$Description[1])

代码
# 同时绘制前3个
gseaplot2(gsea, geneSetID = 1:3)

点击查看GSEA解读

GSEA结果解读

  • NES:标准化富集分数
    • NES > 0:通路在排序列表顶部富集(上调)
    • NES < 0:通路在排序列表底部富集(下调)
  • p.adjust:校正后的显著性
  • leading_edge:核心富集基因

GSEA mountain plot解读

  • 绿线:富集分数曲线,峰值位置表示最大富集
  • 黑色竖线:基因集中基因在排序列表中的位置
  • 热图:基因表达值分布

第四部分:综合分析报告

练习4.1:生成差异分析报告

# TODO: 生成差异分析汇总报告

cat("==============================================\n")
cat("         RNA-seq 差异分析报告\n")
cat("         PTC vs ATC\n")
cat("==============================================\n\n")

cat("1. 样本信息\n")
cat("   - 对照组(PTC)样本数:", sum(colData$condition == "PTC"), "\n")
cat("   - 处理组(ATC)样本数:", sum(colData$condition == "ATC"), "\n")
cat("   - 总基因数:", nrow(res_df), "\n\n")

cat("2. 差异基因统计\n")
cat("   - 总差异基因数:", sum(res_df$diff_status != "Not Sig"), "\n")
cat("   - 上调基因数(ATC高):", sum(res_df$diff_status == "Up"), "\n")
cat("   - 下调基因数(ATC低):", sum(res_df$diff_status == "Down"), "\n\n")

cat("3. Top 5 上调基因\n")
top5_up <- head(res_df[res_df$diff_status == "Up", ], 5)
for (i in 1:5) {
  cat(sprintf("   %s: log2FC = %.2f, padj = %.2e\n",
              top5_up$gene[i], top5_up$log2FoldChange[i], top5_up$padj[i]))
}

cat("\n4. Top 5 下调基因\n")
top5_down <- head(res_df[res_df$diff_status == "Down", ], 5)
for (i in 1:5) {
  cat(sprintf("   %s: log2FC = %.2f, padj = %.2e\n",
              top5_down$gene[i], top5_down$log2FoldChange[i], top5_down$padj[i]))
}

cat("\n5. 功能富集分析结果\n")
if (exists("ego_up") && nrow(as.data.frame(ego_up)) > 0) {
  cat("   - GO BP富集条目数:", nrow(as.data.frame(ego_up)), "\n")
  cat("   - Top富集通路:", as.data.frame(ego_up)$Description[1], "\n")
} else {
  cat("   - GO BP富集条目数: 0(可能需要调整阈值)\n")
}
if (exists("kk") && !is.null(kk) && nrow(as.data.frame(kk)) > 0) {
  cat("   - KEGG富集条目数:", nrow(as.data.frame(kk)), "\n")
} else {
  cat("   - KEGG富集条目数: 0或分析未完成\n")
}

cat("\n6. 生物学解释\n")
cat("   - 上调基因可能参与ATC的恶性表型\n")
cat("   - 下调基因可能在ATC中受到抑制\n")
cat("   - 富集到的通路提示了ATC与PTC的生物学差异\n")
cat("   - 需结合文献进一步验证关键通路\n")

cat("\n==============================================\n")

思考题

  1. 为什么火山图中会出现|log2FC|很大但p值不显著的点?

    点击查看答案

    这些基因通常表达量很低,counts的随机波动大,统计检验效能不足

  2. MA图中为什么高表达基因的log2FC趋于0?

    点击查看答案

    高表达基因通常受严格调控,表达量相对稳定;低表达基因更容易出现大倍数变化

  3. 如何选择差异基因筛选阈值?

    点击查看答案

    标准:|log2FC| > 1, padj < 0.05。可根据研究需要调整,严格阈值用于下游验证,宽松阈值用于探索

  4. 热图中为什么要对数据进行行标准化?

    点击查看答案

    消除基因间表达量基线差异,突出展示基因在不同样本间的表达模式(相对变化)

  5. 富集分析中为什么要进行多重检验校正?

    点击查看答案

    检测大量GO/通路时,假阳性会累积,需用BH等方法控制FDR


关键代码速查表

操作 代码
运行DESeq2 dds <- DESeq(dds)
提取结果 res <- results(dds, contrast = c("condition", "A", "B"))
筛选差异基因 subset(res, padj < 0.05 & abs(log2FC) > 1)
MA图 plotMA(res)
火山图 ggplot(res_df, aes(x=log2FC, y=-log10(padj), color=status)) + geom_point()
查看keytypes keytypes(org.Hs.eg.db)
ID转换 bitr(genes, fromType="SYMBOL", toType="ENTREZID", OrgDb=org.Hs.eg.db)
GO富集 enrichGO(gene, OrgDb=org.Hs.eg.db, ont="BP", keyType="SYMBOL")
KEGG富集 enrichKEGG(gene, organism="hsa")
点图 dotplot(ego, showCategory=15, orderBy="p.adjust")
网络图 cnetplot(ego, foldChange=geneList, showCategory=5)
GSEA gseGO(geneList, OrgDb=org.Hs.eg.db, ont="BP")
GSEA图 gseaplot2(gsea, geneSetID=1)

延伸阅读


实验结束,祝学习愉快! 🎉