Chapter 6 RNAseq Analysis - DESeq2
In this Chapter, we will be analyzing RNAseq data previously published in Brunton et al. HNF4A and GATA6 Loss Reveals Therapeutically Actionable Subtypes in Pancreatic Cancer. Cell Reports 2020. The RNAseq experiment that we will re-analyse involves the Knock-Down (KD) of HNF4A using a specific siRNA in 2 different Patient Derived Cell Lines (PDCLs) Mayo5289 and TKCC22. Both Mayo5289 and TKCC22 are Classical lines which expresses HNF4A. We have 12 samples in total, representing:
- 3x Mayo5289 Scrambled Controls (Biological Replicates)
- 3x Mayo5289 siHNF4A (Biological Replicates)
- 3x TKCC22 Scrambled Controls (Biological Replicates)
- 3x TKCC22 siHNF4A (Biological Replicates)
RNAseq data has been mapped and a merged counts file is available for us to analyse. Let’s take a look at the sample sheet for the experiment.
<- read.delim("RNASeqData/SampleSheetBrunton.tsv", sep = "\t",
sampleSheet stringsAsFactors = FALSE)
sampleSheet#> id Treatment_group Batch
#> 1 HB.A1 TKCC-22_Cont 1
#> 2 HB.A3 TKCC-22_HNF4A 1
#> 3 HB.B1 TKCC-22_Cont 2
#> ...<6 more rows>...
#> 10 HB.E3 Mayo_5289_HNF4A 2
#> 11 HB.F1 Mayo_5289_Cont 3
#> 12 HB.F3 Mayo_5289_HNF4A 3
The merged counts file is in the following format:
<- read.delim("RNASeqData/CountsFileBrunton.tsv",
countsFile stringsAsFactors = FALSE, sep = "\t",row.names = 1)
head(countsFile)
#> HB.A1 HB.A3 HB.B1 HB.B3 HB.C1 HB.C3 HB.D1 HB.D3 HB.E1 HB.E3
#> ENSG00000223972 0 0 0 0 0 0 1 0 0 0
#> ENSG00000227232 206 147 149 161 160 184 154 200 222 201
#> ENSG00000243485 0 0 0 0 0 0 0 0 0 0
#> ENSG00000237613 0 0 0 0 0 0 0 0 0 0
#> ENSG00000268020 0 0 0 0 0 0 0 0 0 0
#> ENSG00000240361 0 0 0 0 0 0 0 0 0 0
#> HB.F1 HB.F3
#> ENSG00000223972 0 0
#> ENSG00000227232 167 188
#> ENSG00000243485 0 0
#> ENSG00000237613 0 0
#> ENSG00000268020 0 0
#> ENSG00000240361 0 0
6.1 DESEq2 package
The DESeq2
package expects count data in the form of a matrix of integer values as input. The values in the matrix should be non-normalized counts or estimated counts of sequencing reads (for single-end RNA-seq) or fragments (for paired-end RNA-seq). The DESeq2
model internally corrects for library size, so transformed or normalized values such as counts scaled by library size should not be used as input.
The DESeq2
package stores the read counts and the results of any analysis in an object of class DESeqDataSet
. The DESeqDataSet
class extends the RangedSummarizedExperiment
class of the SummarizedExperiment
package, which we used previously in Section 4.2.
6.2 The DESeqDataSet
A DESeqDataSet
object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should be a tilde (~) followed by the variables with plus signs between them (it will be coerced into an formula if it is not already).
6.3 Count matrix input
The function DESeqDataSetFromMatrix
can be used to generate a RangedSummarizedExperiment
object comprising our data. The Treatment_group
sample sheet column is used for our design formula!
library(DESeq2)
## Produce a DESeq dataset
<- DESeqDataSetFromMatrix(countData = countsFile,
dds colData = sampleSheet,
design = ~ Treatment_group)
#> Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
#> design formula are characters, converting to factors
#> Note: levels of factors in the design contain characters other than
#> letters, numbers, '_' and '.'. It is recommended (but not required) to use
#> only letters, numbers, and delimiters '_' or '.', as these are safe characters
#> for column names in R. [This is a message, not an warning or error]
Let’s take a look at the colData
for this object. The colData
function will allow us to access the Treatment_group
and experiment ids
later on!
head(colData(dds))
#> DataFrame with 6 rows and 3 columns
#> id Treatment_group Batch
#> <character> <factor> <integer>
#> HB.A1 HB.A1 TKCC-22_Cont 1
#> HB.A3 HB.A3 TKCC-22_HNF4A 1
#> HB.B1 HB.B1 TKCC-22_Cont 2
#> HB.B3 HB.B3 TKCC-22_HNF4A 2
#> HB.C1 HB.C1 TKCC-22_Cont 3
#> HB.C3 HB.C3 TKCC-22_HNF4A 3
6.4 Pre-filtering
While it is not necessary to pre-filter low count genes before running the DESeq2
functions, there are two reasons which make pre-filtering useful:
- by removing rows in which there are very few reads, we reduce the memory size of the dds data object
- we increase the speed of the transformation and testing functions within DESeq2
.
Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads total. Note that more strict filtering to increase power is automatically applied via independent filtering on the mean of normalized counts within the results function.
<- rowSums(counts(dds)) >= 10
keep <- dds[keep,] dds
6.5 Differential expression analysis
The standard differential expression analysis steps are wrapped into a single function named DESeq
.
Results tables are generated using the function results
, which extracts a results table with log2 fold changes, p values and adjusted p values. Here we want to compare differential gene expression in the TKCC22 and Mayo5289 PDCLs, separately. To do this we can provide a contrast
argument, where the first vector value is our design name and the other vector values represent the contrast of interest!
<- DESeq(dds)
dds <- results(dds, contrast=c("Treatment_group", "TKCC-22_HNF4A", "TKCC-22_Cont"))
resTKCC22 head(resTKCC22)
#> log2 fold change (MLE): Treatment_group TKCC-22_HNF4A vs TKCC-22_Cont
#> Wald test p-value: Treatment_group TKCC-22_HNF4A vs TKCC-22_Cont
#> DataFrame with 6 rows and 6 columns
#> baseMean log2FoldChange lfcSE
#> <numeric> <numeric> <numeric>
#> ENSG00000227232 176.215965236868 0.0681688692461893 0.155683348871782
#> ENSG00000238009 4.95585118508252 -1.27198412595543 0.897892011941174
#> ENSG00000237683 23.0539253255181 -0.26031208423795 0.530153592513952
#> ENSG00000241860 26.5900311689865 0.368471855396428 0.402842462674039
#> ENSG00000228463 25.8307510095804 -0.144518497310896 0.356207690630831
#> ENSG00000237094 1.06267800811557 1.70544281563047 2.03235686968038
#> stat pvalue padj
#> <numeric> <numeric> <numeric>
#> ENSG00000227232 0.43786872353531 0.66148144925531 0.862931209585896
#> ENSG00000238009 -1.41663374775492 0.156590036356189 NA
#> ENSG00000237683 -0.491012581851172 0.623417548715701 NA
#> ENSG00000241860 0.914679780653059 0.360359762037706 NA
#> ENSG00000228463 -0.405714141249896 0.684952650597028 NA
#> ENSG00000237094 0.839145349457587 0.401387750491202 NA
<- results(dds, contrast=c("Treatment_group", "Mayo_5289_HNF4A", "Mayo_5289_Cont"))
resMayo5289 head(resMayo5289)
#> log2 fold change (MLE): Treatment_group Mayo_5289_HNF4A vs Mayo_5289_Cont
#> Wald test p-value: Treatment group Mayo 5289 HNF4A vs Mayo 5289 Cont
#> DataFrame with 6 rows and 6 columns
#> baseMean log2FoldChange lfcSE
#> <numeric> <numeric> <numeric>
#> ENSG00000227232 176.215965236868 -0.0119495009244944 0.15255273154876
#> ENSG00000238009 4.95585118508252 -0.626000432390719 0.809896379944906
#> ENSG00000237683 23.0539253255181 -0.206194554568681 0.419326773029208
#> ENSG00000241860 26.5900311689865 0.0661684857011655 0.355511952662026
#> ENSG00000228463 25.8307510095804 -0.234755143878033 0.323620257076077
#> ENSG00000237094 1.06267800811557 -3.15503624670329 2.1465733495444
#> stat pvalue padj
#> <numeric> <numeric> <numeric>
#> ENSG00000227232 -0.0783302980102654 0.937565317169935 0.970065078334923
#> ENSG00000238009 -0.772938918968009 0.439558530357351 NA
#> ENSG00000237683 -0.49172761633877 0.622911912676688 0.794297567450887
#> ENSG00000241860 0.186121690721521 0.852349332369706 0.931857984783944
#> ENSG00000228463 -0.725403119072509 0.468204760148512 0.679428386866806
#> ENSG00000237094 -1.46980127530836 0.141615583670199 NA
The results are not ordered by p value so let’s order the results so that the most significant results are shown at the top of the table.
<- resTKCC22[order(resTKCC22$pvalue),]
resTKCC22.Ordered head(resTKCC22.Ordered)
#> log2 fold change (MLE): Treatment_group TKCC-22_HNF4A vs TKCC-22_Cont
#> Wald test p-value: Treatment_group TKCC-22_HNF4A vs TKCC-22_Cont
#> DataFrame with 6 rows and 6 columns
#> baseMean log2FoldChange lfcSE
#> <numeric> <numeric> <numeric>
#> ENSG00000110092 47574.3163285699 -0.934451413369446 0.0663560063758533
#> ENSG00000101076 2441.44886387565 -1.2534535054247 0.0897912061465399
#> ENSG00000167460 13340.8304535922 0.848040750049125 0.0654138250456114
#> ENSG00000197008 692.953231862017 1.15818733365479 0.0893877618088685
#> ENSG00000112419 4645.27612423264 0.814829053165872 0.0655868735363201
#> ENSG00000054983 3591.9079640758 -0.772496907696008 0.0657930175504959
#> stat pvalue padj
#> <numeric> <numeric> <numeric>
#> ENSG00000110092 -14.0823938088819 4.87307090466001e-45 6.78623854182953e-41
#> ENSG00000101076 -13.9596465981207 2.74793035252734e-44 1.91338390446479e-40
#> ENSG00000167460 12.9642434066164 1.95146374859122e-38 7.47764079018454e-35
#> ENSG00000197008 12.9568892901834 2.14782156834254e-38 7.47764079018454e-35
#> ENSG00000112419 12.4236605471771 1.94456107359187e-35 5.41599150216807e-32
#> ENSG00000054983 -11.7413205299957 7.82549268200408e-32 1.81629685149315e-28
<- resMayo5289[order(resMayo5289$pvalue),]
resMayo5289.Ordered head(resMayo5289.Ordered)
#> log2 fold change (MLE): Treatment_group Mayo_5289_HNF4A vs Mayo_5289_Cont
#> Wald test p-value: Treatment group Mayo 5289 HNF4A vs Mayo 5289 Cont
#> DataFrame with 6 rows and 6 columns
#> baseMean log2FoldChange lfcSE
#> <numeric> <numeric> <numeric>
#> ENSG00000101076 2441.44886387565 -2.78867240452894 0.0928799664884088
#> ENSG00000110092 47574.3163285699 -1.5069874076035 0.0661753371004424
#> ENSG00000127831 5402.36212558166 -1.47536222408073 0.0721308861825415
#> ENSG00000155465 1117.788063774 -1.6920701897444 0.0850231608539067
#> ENSG00000101342 496.544981122766 -2.36158629500247 0.123882852836385
#> ENSG00000101311 12900.2341316972 -1.03950771162206 0.0548227604532701
#> stat pvalue padj
#> <numeric> <numeric> <numeric>
#> ENSG00000101076 -30.0244768593554 4.7036205234925e-198 7.69653426259078e-194
#> ENSG00000110092 -22.7726442151124 8.5621641757204e-115 7.00513462036565e-111
#> ENSG00000127831 -20.453959491736 5.53914538617638e-93 3.02123453180014e-89
#> ENSG00000155465 -19.901285399773 3.96625810089935e-88 1.6224970326254e-84
#> ENSG00000101342 -19.0630603100614 5.11911428173576e-81 1.67528133984085e-77
#> ENSG00000101311 -18.9612435241768 3.56612180336771e-80 9.72540851141765e-77
6.6 Gene Annotation
To annotate gene names, we will use the AnnotationDbi
and org.Hs.eg.db
packages which provide functions to map ensembl Ids to HGNC symbol names. Other annotations are also provided which you can explore.
library("AnnotationDbi")
library("org.Hs.eg.db")
columns(org.Hs.eg.db)
#> [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
#> [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
#> [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
#> [13] "IPI" "MAP" "OMIM" "ONTOLOGY"
#> [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
#> [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
#> [25] "UNIGENE" "UNIPROT"
The following code maps the EnsemblIDs e.g. ENSG00000223972
to gene symbols and EntrezIds. First we obtain the vector of EnsemblIDs found in our resMayo5289.Ordered
data.frame
- EnsemblIDs represent the row names of the data.frame
and can be obtained using the rownames()
function. We then use the mapIds
function to match each EnsemblID to a corresponding gene symbol and assign the results to a new column named symbol
.
#Extract the EnsemblIds
<- rownames(resMayo5289.Ordered)
ensIds #Map the ensIds
$symbol <- mapIds(org.Hs.eg.db,
resMayo5289.Orderedkeys=ensIds,
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
#> 'select()' returned 1:many mapping between keys and columns
$entrez <- mapIds(org.Hs.eg.db,
resMayo5289.Orderedkeys=ensIds,
column="ENTREZID",
keytype="ENSEMBL",
multiVals="first")
#> 'select()' returned 1:many mapping between keys and columns
head(resMayo5289.Ordered)
#> log2 fold change (MLE): Treatment_group Mayo_5289_HNF4A vs Mayo_5289_Cont
#> Wald test p-value: Treatment group Mayo 5289 HNF4A vs Mayo 5289 Cont
#> DataFrame with 6 rows and 8 columns
#> baseMean log2FoldChange lfcSE
#> <numeric> <numeric> <numeric>
#> ENSG00000101076 2441.44886387565 -2.78867240452894 0.0928799664884088
#> ENSG00000110092 47574.3163285699 -1.5069874076035 0.0661753371004424
#> ENSG00000127831 5402.36212558166 -1.47536222408073 0.0721308861825415
#> ENSG00000155465 1117.788063774 -1.6920701897444 0.0850231608539067
#> ENSG00000101342 496.544981122766 -2.36158629500247 0.123882852836385
#> ENSG00000101311 12900.2341316972 -1.03950771162206 0.0548227604532701
#> stat pvalue padj
#> <numeric> <numeric> <numeric>
#> ENSG00000101076 -30.0244768593554 4.7036205234925e-198 7.69653426259078e-194
#> ENSG00000110092 -22.7726442151124 8.5621641757204e-115 7.00513462036565e-111
#> ENSG00000127831 -20.453959491736 5.53914538617638e-93 3.02123453180014e-89
#> ENSG00000155465 -19.901285399773 3.96625810089935e-88 1.6224970326254e-84
#> ENSG00000101342 -19.0630603100614 5.11911428173576e-81 1.67528133984085e-77
#> ENSG00000101311 -18.9612435241768 3.56612180336771e-80 9.72540851141765e-77
#> symbol entrez
#> <character> <character>
#> ENSG00000101076 HNF4A 3172
#> ENSG00000110092 CCND1 595
#> ENSG00000127831 VIL1 7429
#> ENSG00000155465 SLC7A7 9056
#> ENSG00000101342 TLDC2 140711
#> ENSG00000101311 FERMT1 55612
$symbol <- mapIds(org.Hs.eg.db,
resTKCC22.Orderedkeys=ensIds,
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
#> 'select()' returned 1:many mapping between keys and columns
$entrez <- mapIds(org.Hs.eg.db,
resTKCC22.Orderedkeys=ensIds,
column="ENTREZID",
keytype="ENSEMBL",
multiVals="first")
#> 'select()' returned 1:many mapping between keys and columns
Exercise 6.1
What do you think the
multiVals
argument in themapIds
function is doing? Add additional annotations to theresMayo5289.Ordered
data.frame.
6.7 Exploring the results
At this stage we often want to get a first glimpse of the results. Let’s first determine how many genes are differentially expressed in the Mayo5289 cells after KD with HNF4A!
#How many genes change significantly between conditions?
sum(resMayo5289.Ordered$pvalue <= 0.05, na.rm=TRUE)
#> [1] 5594
sum(resMayo5289.Ordered$padj <= 0.05, na.rm=TRUE)
#> [1] 3852
#How many are also associated with abs(log2FC) >= 1
sum(resMayo5289.Ordered$padj <=0.05 & abs(resMayo5289.Ordered$log2FoldChange) >= 1, na.rm=TRUE)
#> [1] 368
We might also like to get an idea about the top 10 genes ranked by padj
and LogFC
<- resMayo5289.Ordered[!is.na(resMayo5289.Ordered$padj), ]
resMayo5289.Ordered <- resMayo5289.Ordered[resMayo5289.Ordered$padj <= 0.05 & abs(resMayo5289.Ordered$log2FoldChange) >= 1, ]
sigGenesMayo5289 head(sigGenesMayo5289,3)
#> log2 fold change (MLE): Treatment_group Mayo_5289_HNF4A vs Mayo_5289_Cont
#> Wald test p-value: Treatment group Mayo 5289 HNF4A vs Mayo 5289 Cont
#> DataFrame with 3 rows and 8 columns
#> baseMean log2FoldChange lfcSE
#> <numeric> <numeric> <numeric>
#> ENSG00000101076 2441.44886387565 -2.78867240452894 0.0928799664884088
#> ENSG00000110092 47574.3163285699 -1.5069874076035 0.0661753371004424
#> ENSG00000127831 5402.36212558166 -1.47536222408073 0.0721308861825415
#> stat pvalue padj
#> <numeric> <numeric> <numeric>
#> ENSG00000101076 -30.0244768593554 4.7036205234925e-198 7.69653426259078e-194
#> ENSG00000110092 -22.7726442151124 8.5621641757204e-115 7.00513462036565e-111
#> ENSG00000127831 -20.453959491736 5.53914538617638e-93 3.02123453180014e-89
#> symbol entrez
#> <character> <character>
#> ENSG00000101076 HNF4A 3172
#> ENSG00000110092 CCND1 595
#> ENSG00000127831 VIL1 7429
$symbol[1:10]
sigGenesMayo5289#> ENSG00000101076 ENSG00000110092 ENSG00000127831 ENSG00000155465
#> "HNF4A" "CCND1" "VIL1" "SLC7A7"
#> ENSG00000101342 ENSG00000101311 ENSG00000115641 ENSG00000058085
#> "TLDC2" "FERMT1" "FHL2" "LAMC2"
#> ENSG00000186832 ENSG00000180745
#> "KRT16" "CLRN3"
Exercise 6.2
What do you observe? Do the genes at the top of our list make sense?
We also might want to compare our HNF4A KD results in both the TKCC22 and Mayo5289 PDCLs? Upset plots are a nice way of comparing the relationship between data sets. The UpSetR
package provides a simple set of functions to define data set overlap.
library(UpSetR)
library(ggplot2)
#Set up the Mayo5289 gene list
<- resMayo5289.Ordered[resMayo5289.Ordered$padj <=0.05 & abs(resMayo5289.Ordered$log2FoldChange) >= 1, "symbol"]
sigGenesMayo5289 #Set up the TKCC22 gene list
#Need to remove NAs as above
<- resTKCC22.Ordered[!is.na(resTKCC22.Ordered$padj), ]
resTKCC22.Ordered <- resTKCC22.Ordered[resTKCC22.Ordered$padj <=0.05 & abs(resTKCC22.Ordered$log2FoldChange) >= 1, "symbol"]
sigGenesTKCC22 #Create a list of genes
<- list(TKCC22 = sigGenesTKCC22 , Mayo5289 = sigGenesMayo5289)
listInput #Create the plot
upset(fromList(listInput), order.by = "freq")

Figure 6.1: UpSet plot showing the relationship between the TKCC22 and Mayo-5289 experiments
Exercise 6.3
What are the genes in common? Hint: try the
interesect
function!
6.8 Exporting results
Exporting results is straight forward! Results can also be stored in an .RData
file for later use!
write.csv(as.data.frame(resMayo5289.Ordered), file = "resMayo5289.Ordered_results.csv")
save(dds, resMayo5289.Ordered, resTKCC22.Ordered, file = "PDCL_HNF4A_siRNA_Experiment.RData")
6.9 Data transformation and visualisation
For downstream analyses – e.g. for visualization or clustering – we need to work with transformed versions of the count data. There are a number of methods to transform count data provided in the DESeq2
package and include variance stabilizing transformations (vst) and regularized logarithm (rlog).
Both the vst
and rlog
transformations remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low. The vst
and rlog
transformations produce transformed data on the log2 scale which has been normalized with respect to library size and/or other normalization factors. The assay
function returns the results of the transformation as a matrix of transformed values.
<- vst(dds, blind=FALSE)
vsd <- rlog(dds, blind=FALSE)
rld head(assay(vsd), 3)
#> HB.A1 HB.A3 HB.B1 HB.B3 HB.C1 HB.C3
#> ENSG00000227232 8.566844 8.611630 8.585389 8.633558 8.530630 8.541801
#> ENSG00000238009 7.239080 7.321780 7.438567 7.012016 7.263053 7.178992
#> ENSG00000237683 7.239080 7.351215 7.627718 7.414596 7.343795 7.402633
#> HB.D1 HB.D3 HB.E1 HB.E3 HB.F1 HB.F3
#> ENSG00000227232 8.688812 8.668386 8.761663 8.676995 8.490305 8.584954
#> ENSG00000238009 7.330801 7.288113 7.341380 7.289009 7.349144 7.252707
#> ENSG00000237683 7.833839 7.755957 7.742276 7.768123 7.794813 7.687399
6.9.1 PCA using transformed data
A well-established technique for visualization and extraction of relevant information is the popular Principal Component Analysis (PCA). PCA allows us to determine the relationship between samples and can provide valuable insights into the veracity of the results.
library(ggplot2)
<- plotPCA(vsd, intgroup=c("Treatment_group", "Batch"), returnData=TRUE)
pcaData <- round(100 * attr(pcaData, "percentVar"))
percentVar ggplot(pcaData, aes(PC1, PC2, color=Treatment_group, shape=as.factor(Batch))) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance"))

Figure 6.2: Principal Component Analysis (PCA) plot showing the relationship between the TKCC22 and Mayo-5289 experiments
6.9.2 Heatmaps using transformed data
Using vsd
transformed data let’s generate a heatmap showing a selection of genes that are differentially expressed after HNF4A KD in the Mayo5289 cell line.
#Looking at the vsd matrix it comprises EnsemblIds and not gene symbols as row names!
head(assay(vsd), 3)
#> HB.A1 HB.A3 HB.B1 HB.B3 HB.C1 HB.C3
#> ENSG00000227232 8.566844 8.611630 8.585389 8.633558 8.530630 8.541801
#> ENSG00000238009 7.239080 7.321780 7.438567 7.012016 7.263053 7.178992
#> ENSG00000237683 7.239080 7.351215 7.627718 7.414596 7.343795 7.402633
#> HB.D1 HB.D3 HB.E1 HB.E3 HB.F1 HB.F3
#> ENSG00000227232 8.688812 8.668386 8.761663 8.676995 8.490305 8.584954
#> ENSG00000238009 7.330801 7.288113 7.341380 7.289009 7.349144 7.252707
#> ENSG00000237683 7.833839 7.755957 7.742276 7.768123 7.794813 7.687399
<- assay(vsd)
expMatrix #To make the Heatmap we will select genes based on our significant cutoffs
<- resMayo5289.Ordered[!is.na(resMayo5289.Ordered$symbol),]
resMayo5289.Ordered
<- subset(resMayo5289.Ordered,
resMayo5289.select <= 1e-20 &
padj abs(resMayo5289.Ordered$log2FoldChange) >= 1)
#Map the significant results to the transformed expression values
<- intersect(rownames(expMatrix), rownames(resMayo5289.select))
inter head(inter)
#> [1] "ENSG00000162545" "ENSG00000137959" "ENSG00000198758" "ENSG00000134193"
#> [5] "ENSG00000198483" "ENSG00000163209"
#Subset the matrix to include only the significant genes of interest
<- expMatrix[inter, ]
expMatrix.select dim(expMatrix.select)
#> [1] 82 12
head(expMatrix.select)
#> HB.A1 HB.A3 HB.B1 HB.B3 HB.C1 HB.C3
#> ENSG00000162545 10.895563 10.600985 10.910440 10.657951 10.842195 10.633991
#> ENSG00000137959 7.387884 7.580602 7.246242 7.513228 7.292596 7.467657
#> ENSG00000198758 10.109697 9.496844 10.195344 9.511486 10.207478 9.498285
#> ENSG00000134193 7.906800 7.876837 7.934171 7.830890 7.870031 7.882784
#> ENSG00000198483 7.172658 7.289183 7.203331 7.280885 7.012016 7.216462
#> ENSG00000163209 7.332804 7.378253 7.416824 7.280885 7.292596 7.130119
#> HB.D1 HB.D3 HB.E1 HB.E3 HB.F1 HB.F3
#> ENSG00000162545 11.092906 10.173745 11.041170 10.154572 11.05369 10.048354
#> ENSG00000137959 10.489250 11.855465 10.855140 12.085880 10.74541 11.910030
#> ENSG00000198758 11.218128 10.317309 11.248313 10.373998 11.20353 10.299625
#> ENSG00000134193 11.622704 10.881198 11.747548 10.706791 11.75830 10.660777
#> ENSG00000198483 8.229145 9.195484 8.268459 9.225827 8.18042 9.111117
#> ENSG00000163209 10.769595 9.569257 10.669877 9.599853 10.55096 9.571898
Let’s now finish off the matrix by swapping the Ensembl IDs for the gene symbols - this will make our heatmap biologist friendly! We also want to limit the matrix to the Mayo5289 HNF4A KD experiment!
#Add human readable gene names
rownames(expMatrix.select) <- mapIds(org.Hs.eg.db,
keys=rownames(expMatrix.select),
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
#> 'select()' returned 1:1 mapping between keys and columns
head(rownames(expMatrix.select))
#> ENSG00000162545 ENSG00000137959 ENSG00000198758 ENSG00000134193
#> "CAMK2N1" "IFI44L" "EPS8L3" "REG4"
#> ENSG00000198483 ENSG00000163209
#> "ANKRD35" "SPRR3"
#Select the Mayo5289 columns
<- expMatrix.select[, grepl("Mayo_5289_HNF4A|Mayo_5289_Cont",colData(dds)$Treatment_group)]
expMatrix.select #Give the matrix some useful column names
colnames(expMatrix.select) <- colData(dds)[colnames(expMatrix.select),"Treatment_group"]
head(expMatrix.select)
#> Mayo_5289_Cont Mayo_5289_HNF4A Mayo_5289_Cont Mayo_5289_HNF4A
#> CAMK2N1 11.092906 10.173745 11.041170 10.154572
#> IFI44L 10.489250 11.855465 10.855140 12.085880
#> EPS8L3 11.218128 10.317309 11.248313 10.373998
#> REG4 11.622704 10.881198 11.747548 10.706791
#> ANKRD35 8.229145 9.195484 8.268459 9.225827
#> SPRR3 10.769595 9.569257 10.669877 9.599853
#> Mayo_5289_Cont Mayo_5289_HNF4A
#> CAMK2N1 11.05369 10.048354
#> IFI44L 10.74541 11.910030
#> EPS8L3 11.20353 10.299625
#> REG4 11.75830 10.660777
#> ANKRD35 8.18042 9.111117
#> SPRR3 10.55096 9.571898
We now make the heatmap using the ComplexHeatmap
package.
library(ComplexHeatmap)
library(circlize)
#Define some colours for the annotation
<- RColorBrewer::brewer.pal(8, "Set1")
colors #Make the sample annotation track
<- HeatmapAnnotation(Treatment = colnames(expMatrix.select),
ha col = list(Treatment = structure(names = c(
"Mayo_5289_Cont", "Mayo_5289_HNF4A"
3:4])))
), colors[#Scale the matrix to generate z-scores
<- t(scale(t(expMatrix.select)))
expMatrix.select.sc #Generate the heatmap
Heatmap(
expMatrix.select.sc,col = colorRamp2(c(-2, 0, 2), c("#377EB8", "white", "#E41A1C")),
name = "Z score",
cluster_columns = TRUE,
cluster_rows = TRUE,
top_annotation = ha,
show_row_names = TRUE,
show_column_names = FALSE, #Turned off column names
row_names_gp = gpar(fontsize = 5),
column_names_gp = gpar(fontsize = 5)
)

Figure 6.3: Heatmap showing differentially expressed genes in Mayo-5289 after HNF4A KD
6.10 Gene Set Enrichment Analysis
Over-representation and/or Gene Set Enrichment Analysis (GSEA) are essential tasks in any RNAseq study. These analyses help to define biological functions and/or processes that are enriched in different biological samples. In this section we will use the clusterProfiler
package to perform over-representation analysis to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed genes (DEGs). A comprehensive overview of the theory and methods used in this section is provided at the following location clusterProfiler-book or by typing browseVignettes("clusterProfiler")
at the R prompt. Please read through the clusterProfiler
documentation.
First load the clusterProfiler
library into the current environment.
library(clusterProfiler)
For over-representation analysis, we require a vector of gene IDs. These gene IDs can be obtained by differential expression analysis (e.g. with the DESeq2
package). The functions that we will be using require Entrez IDs as input. We define a vector of significantly and differentially expressed Entrez IDs (genes) defining our experiment as follows:
<- subset(resMayo5289.Ordered,
resMayo5289.enriched <= 0.05) padj
6.10.1 KEGG
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. Our analysis will determine whether our experimentally derived gene list is enriched in any one of the curated gene sets defined by KEGG. To do this we will use the enrichKEGG
function.
#Perform the over-representation analysis using keggEnrich
<- enrichKEGG(gene = resMayo5289.enriched$entrez,
keggEnrich organism = 'hsa',
pvalueCutoff = 0.05)
head(keggEnrich)
#> ID Description GeneRatio BgRatio
#> hsa04390 hsa04390 Hippo signaling pathway 52/1613 157/8094
#> hsa05418 hsa05418 Fluid shear stress and atherosclerosis 47/1613 139/8094
#> hsa05222 hsa05222 Small cell lung cancer 34/1613 92/8094
#> hsa04068 hsa04068 FoxO signaling pathway 44/1613 131/8094
#> hsa04115 hsa04115 p53 signaling pathway 28/1613 73/8094
#> hsa00600 hsa00600 Sphingolipid metabolism 21/1613 49/8094
#> pvalue p.adjust qvalue
#> hsa04390 5.990875e-05 0.01125709 0.008723608
#> hsa05418 7.560810e-05 0.01125709 0.008723608
#> hsa05222 1.043490e-04 0.01125709 0.008723608
#> hsa04068 1.507906e-04 0.01125709 0.008723608
#> hsa04115 2.012321e-04 0.01125709 0.008723608
#> hsa00600 2.071857e-04 0.01125709 0.008723608
#> geneID
#> hsa04390 595/166824/7474/84962/122786/330/652/7042/6657/6591/80326/71/3398/84612/1495/27113/1741/7532/25937/4092/4088/999/1490/5516/10413/1499/7476/7533/332/81029/5522/9231/8312/374/60/7534/23513/7004/329/1856/3996/2736/51176/7046/8324/60485/1454/5520/650/2535/7855/7473
#> hsa05418 7850/4259/652/5154/858/71/2948/207/3162/7295/445/91860/801/8878/6300/3685/5747/3320/5879/1499/5599/7157/2938/9446/5602/60/6382/5562/5155/4780/1843/90/5563/10365/6613/1728/3326/2946/8503/6416/1535/9181/7422/805/3725/5600/4217
#> hsa05222 595/3918/3909/1026/3673/3675/330/4616/1647/1284/207/3914/7187/6257/3685/598/1021/51426/3688/5747/54205/3913/836/7185/7157/1869/3655/329/1287/331/5728/8503/6502/1027
#> hsa04068 595/1026/80854/901/7042/4616/9455/5106/1647/207/5565/604/6789/6300/4088/1263/6774/5599/10769/3276/6446/8743/4193/3643/3667/10110/5602/1956/5562/664/7046/5563/10365/6794/92579/1454/5728/8503/85417/51422/2308/6502/1027/5600
#> hsa04115 595/1026/901/2810/64065/3732/4616/8795/5268/1647/27113/9540/598/1021/637/54205/64393/836/7157/8797/4193/50484/84883/3486/55240/5728/7249/51246
#> hsa00600 2581/29956/8612/123099/57704/8877/9517/64781/81537/55627/2717/410/6609/8611/9514/8879/8560/2629/79603/55331/56848
#> Count
#> hsa04390 52
#> hsa05418 47
#> hsa05222 34
#> hsa04068 44
#> hsa04115 28
#> hsa00600 21
#Make the output biologist friendly
<-
keggEnrichRead setReadable(keggEnrich, OrgDb = org.Hs.eg.db, keyType = "ENTREZID")
head(keggEnrichRead, 55)
#> ID Description GeneRatio BgRatio
#> hsa04390 hsa04390 Hippo signaling pathway 52/1613 157/8094
#> hsa05418 hsa05418 Fluid shear stress and atherosclerosis 47/1613 139/8094
#> hsa05222 hsa05222 Small cell lung cancer 34/1613 92/8094
#> hsa04710 hsa04710 Circadian rhythm 13/1613 31/8094
#> hsa05225 hsa05225 Hepatocellular carcinoma 48/1613 168/8094
#> hsa01230 hsa01230 Biosynthesis of amino acids 25/1613 75/8094
#> pvalue p.adjust qvalue
#> hsa04390 5.990875e-05 0.01125709 0.008723608
#> hsa05418 7.560810e-05 0.01125709 0.008723608
#> hsa05222 1.043490e-04 0.01125709 0.008723608
#> hsa04710 4.176224e-03 0.04644364 0.035991197
#> hsa05225 4.194968e-03 0.04644364 0.035991197
#> hsa01230 4.273955e-03 0.04644364 0.035991197
#> geneID
#> hsa04390 CCND1/RASSF6/WNT5A/AJUBA/FRMD6/BIRC3/BMP4/TGFB2/SOX2/SNAI2/WNT10A/ACTG1/ID2/PARD6B/CTNNA1/BBC3/DLG3/YWHAG/WWTR1/SMAD7/SMAD3/CDH1/CCN2/PPP2CB/YAP1/CTNNB1/WNT7A/YWHAH/BIRC5/WNT5B/PPP2R2C/DLG5/AXIN1/AREG/ACTB/YWHAZ/SCRIB/TEAD4/BIRC2/DVL2/LLGL1/GLI2/LEF1/TGFBR1/FZD7/SAV1/CSNK1E/PPP2R2A/BMP2/FZD2/FZD5/WNT3
#> hsa05418 IL1R2/MGST3/BMP4/PDGFA/CAV2/ACTG1/GSTM4/AKT1/HMOX1/TXN/ASS1/CALML4/CALM1/SQSTM1/MAPK12/ITGAV/PTK2/HSP90AA1/RAC1/CTNNB1/MAPK8/TP53/GSTA1/GSTO1/MAPK10/ACTB/SDC1/PRKAA1/PDGFB/NFE2L2/DUSP1/ACVR1/PRKAA2/KLF2/SUMO2/NQO1/HSP90AB1/GSTM2/PIK3R3/MAP2K4/CYBA/ARHGEF2/VEGFA/CALM2/JUN/MAPK11/MAP3K5
#> hsa05222 CCND1/LAMC2/LAMA3/CDKN1A/ITGA2/ITGA3/BIRC3/GADD45B/GADD45A/COL4A2/AKT1/LAMB3/TRAF3/RXRB/ITGAV/BCL2L1/CDK6/POLK/ITGB1/PTK2/CYCS/LAMB2/CASP3/TRAF1/TP53/E2F1/ITGA6/BIRC2/COL4A5/XIAP/PTEN/PIK3R3/SKP2/CDKN1B
#> hsa04710 BHLHE41/PRKAB2/PER3/CREB1/PER2/PER1/NR1D1/PRKAA1/CRY1/PRKAA2/RORC/CSNK1E/PRKAG2
#> hsa05225 CCND1/CDKN1A/MGST3/WNT5A/FRAT2/TGFB2/GADD45B/WNT10A/TGFA/ACTG1/GADD45A/GSTM4/AKT1/HMOX1/MET/TXNRD1/BCL2L1/CDK6/POLK/SMAD3/SMARCE1/CTNNB1/WNT7A/WNT5B/TP53/AXIN1/E2F1/SMARCC1/GSTA1/GSTO1/BRD7/ACTB/EGFR/GAB1/DVL2/NFE2L2/LEF1/TGFBR1/NQO1/FZD7/PTEN/GSTM2/PIK3R3/FZD2/SMARCD2/FZD5/SMARCD3/WNT3
#> hsa01230 PSAT1/GLUL/PFKL/GPT2/ACO2/ACO1/CBS/ASS1/SHMT1/PHGDH/RPIA/ASNS/ALDOB/TKT/IDH3A/IDH1/IDH2/PSPH/IDH3G/CS/ENO2/ENO1/PFKP/PC/PYCR3
#> Count
#> hsa04390 52
#> hsa05418 47
#> hsa05222 34
#> ...<24 more rows>...
#> hsa04710 13
#> hsa05225 48
#> hsa01230 25
6.10.2 REACTOME
Like KEGG, the REACTOME database contains curated gene set annotations covering diverse molecular and cellular biology pathways. Similarly, our analysis will determine whether our experimentally derived gene list is enriched in any one of the curated gene sets defined by REACTOME. To do this we use the enrichPathway
function provided by the ReactomePA
package.
library(ReactomePA)
<- enrichPathway(
reactomeEnrich gene = resMayo5289.enriched$entrez,
organism = "human",
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
qvalueCutoff = 0.2,
minGSSize = 10,
maxGSSize = 500,
readable = TRUE
)head(reactomeEnrich, 5)
#> ID Description
#> R-HSA-1059683 R-HSA-1059683 Interleukin-6 signaling
#> R-HSA-112409 R-HSA-112409 RAF-independent MAPK1/3 activation
#> R-HSA-5357801 R-HSA-5357801 Programmed Cell Death
#> R-HSA-109581 R-HSA-109581 Apoptosis
#> R-HSA-5633008 R-HSA-5633008 TP53 Regulates Transcription of Cell Death Genes
#> GeneRatio BgRatio pvalue p.adjust qvalue
#> R-HSA-1059683 9/2199 11/10619 2.521779e-05 0.01521919 0.01419629
#> R-HSA-112409 14/2199 23/10619 3.112841e-05 0.01521919 0.01419629
#> R-HSA-5357801 60/2199 179/10619 3.891784e-05 0.01521919 0.01419629
#> R-HSA-109581 59/2199 176/10619 4.485219e-05 0.01521919 0.01419629
#> R-HSA-5633008 21/2199 44/10619 5.518197e-05 0.01521919 0.01419629
#> geneID
#> R-HSA-1059683 SOCS3/IL6ST/STAT1/TYK2/IL6R/STAT3/JAK2/PTPN11/JAK1
#> R-HSA-112409 IL6ST/DUSP4/TYK2/IL6R/JAK2/DUSP6/PTPN11/DUSP10/DUSP7/DUSP1/JAK1/PEA15/DUSP2/DUSP5
#> R-HSA-5357801 CASP7/SFN/DSG3/BIRC3/TJP2/TNFRSF10B/PLEC/PSMB8/UACA/AKT1/MAGED1/NMT1/BBC3/BCL2L1/YWHAG/BID/CDH1/PTK2/TLR4/PSMC3/CTNNB1/RIPK1/CYCS/STAT3/MAPK8/PSME2/YWHAH/GSN/CASP3/DSP/PSMC4/H1F0/TP53/UNC5A/TNFRSF10A/TNFSF10/PSME1/E2F1/APIP/PPP3CC/PSMB9/YWHAZ/BMF/BIRC2/SPTAN1/ARHGAP10/APPL1/XIAP/UBB/HIST1H1C/DBNL/TJP1/FADD/PSMD7/TICAM1/PRKCQ/TP63/HIST1H1E/LMNB1/PSMD1
#> R-HSA-109581 CASP7/SFN/DSG3/TJP2/TNFRSF10B/PLEC/PSMB8/UACA/AKT1/MAGED1/NMT1/BBC3/BCL2L1/YWHAG/BID/CDH1/PTK2/TLR4/PSMC3/CTNNB1/RIPK1/CYCS/STAT3/MAPK8/PSME2/YWHAH/GSN/CASP3/DSP/PSMC4/H1F0/TP53/UNC5A/TNFRSF10A/TNFSF10/PSME1/E2F1/APIP/PPP3CC/PSMB9/YWHAZ/BMF/BIRC2/SPTAN1/ARHGAP10/APPL1/XIAP/UBB/HIST1H1C/DBNL/TJP1/FADD/PSMD7/TICAM1/PRKCQ/TP63/HIST1H1E/LMNB1/PSMD1
#> R-HSA-5633008 PERP/TNFRSF10B/TP53INP1/BCL6/BBC3/TP53I3/BID/PRELID1/CASP10/CASP1/BIRC5/TP53/BNIP3L/TNFRSF10A/NDRG1/AIFM2/IGFBP3/STEAP3/RABGGTB/TP63/TNFRSF10D
#> Count
#> R-HSA-1059683 9
#> R-HSA-112409 14
#> R-HSA-5357801 60
#> R-HSA-109581 59
#> R-HSA-5633008 21
6.10.3 Plotting functions
Once we have our results, we would like to generate publication quality figures. The clusterProfiler
and enrichPlot
packages provide a number of plotting functions that make this easy.
Dotplot
library(enrichplot)
<- dotplot(keggEnrich, showCategory = 10) + ggtitle("KEGG")
plot1 #> wrong orderBy parameter; set to default `orderBy = "x"`
<-
plot2 dotplot(reactomeEnrich, showCategory = 10) + ggtitle("REACTOME")
#> wrong orderBy parameter; set to default `orderBy = "x"`
plot_grid(plot1, plot2, ncol = 2)

Figure 6.4: Dotplot showing KEGG and REACTOME pathways that are significantly enriched.
Upset plot
An upsetplot
shows the overlap of genes comprising different gene sets and provides a complex overview of genes and gene sets significantly enriched in the experiment.
upsetplot(reactomeEnrich)

Figure 6.5: UpSet plot showing the overlap of genes comprising significantly enriched processes and pathways.
Enrichment map plot
An enrichment map plot emapplot
generates a network of significantly enriched gene sets with edges connecting overlapping gene sets.
emapplot(reactomeEnrich)

Figure 6.6: Enrichment map of Reactome pathways significantly enriched.
6.10.4 Extracting enrichment results to generate heatmaps
The above analysis showed that our experimentally derived gene list is significantly enriched in the “Transcriptional Regulation by TP53” REACTOME gene set. Because we have selected genes that are both up and down regulated following HNF4A KD, we would like to understand how specific genes, within this pathway, are modulated by HNF4A. A heatmap is an informative way of determining the relative expression of sets of genes, so let’s generate a heatmap using the set of genes comprising this pathway. First, we need to get the data into the correct shape as follows:
slotNames(reactomeEnrich)
#> [1] "result" "pvalueCutoff" "pAdjustMethod" "qvalueCutoff"
#> [5] "organism" "ontology" "gene" "keytype"
#> [9] "universe" "gene2Symbol" "geneSets" "readable"
<- reactomeEnrich@result
reactResults <- subset(reactResults,
tp53genes == "Transcriptional Regulation by TP53")$geneID
Description <- strsplit(tp53genes, "/")[[1]]
tp53genes.vector <- assay(vsd)
expMatrix rownames(expMatrix) <- mapIds(
org.Hs.eg.db,keys = rownames(expMatrix),
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first"
)#> 'select()' returned 1:many mapping between keys and columns
#Map the significant results to the transformed expression values
<- intersect(rownames(expMatrix), tp53genes.vector)
inter <- expMatrix[inter,]
expMatrix.tp53 #Add human readable gene names
<- expMatrix.tp53[, grepl("Mayo_5289_HNF4A|Mayo_5289_Cont",
expMatrix.tp53 colData(dds)$Treatment_group)]
#Give the matrix some useful column names
colnames(expMatrix.tp53) <-
colData(dds)[colnames(expMatrix.tp53), "Treatment_group"]
Notice that we have used the
strsplit
function to split the character stringtp53genes
into a vector of gene symbols!
Now produce the heatmap using the ComplexHeatmap
package:
#Setup the annotation showing treatment type
<- HeatmapAnnotation(Treatment = colnames(expMatrix.tp53),
ha col = list(Treatment = structure(
names = c("Mayo_5289_Cont", "Mayo_5289_HNF4A"), colors[3:4]
)))#Scale the matrix to generate z-scores
<- t(scale(t(expMatrix.tp53)))
expMatrix.tp53.sc #Generate the heatmap
Heatmap(
expMatrix.tp53.sc,col = colorRamp2(c(-2, 0, 2), c("#377EB8", "white", "#E41A1C")),
name = "Z score",
cluster_columns = TRUE,
cluster_rows = TRUE,
top_annotation = ha,
show_row_names = TRUE,
show_column_names = FALSE,
#Turned off column names
row_names_gp = gpar(fontsize = 5),
column_names_gp = gpar(fontsize = 5)
)

Figure 6.7: Heatmap showing differentially expressed genes in the Transcriptional Regulation by TP53
Reactome pathway