There are two ways you can do RNA-Seq processing: 1. Read alignment. 2. Transcriptome mapping. In most cases, transcriptome mapping (i.e. kallisto or Salmon) is faster, however the RNA-Seq genome aligner Rsubread - when paired with FeatureCounts for counting reads from genomic features - can approach the computing time required by transcriptome mappers. In another tutorial, I will guide you through transcriptome mapping - but here, we are going to use R for end to end analysis - from raw reads to differential gene expression, ontological and pathway enrichment, and transcription factor activity characterization.
First, let’s load the packages we will use.
Then, let’s change into the directory where our fastq files are:
base_dir <- "~/Path/To/Files"
Next, let’s import our sample metadata. We can create a simple matrix, with columns such as sample and condition.
samples <- read.table(file.path(base_dir, "Sample_info.txt"), header = TRUE, stringsAsFactors=FALSE)
samples
The first step will be to pre-process our reads. To do this, we can use a R-base wrapper for TrimGalore! called trim_galore. Here, we’re working with paired-end reads that end in “R1.fastq.gz” and “R2.fastq.gz”. Let’s import them and make sure that each sample has two reads.
fastq1 <- list.files(path = file.path(base_dir), pattern = "*1.fq.gz$", full.names = TRUE)
fastq2 <- list.files(path = file.path(base_dir), pattern = "*2.fq.gz$", full.names = TRUE)
all.equal(length(fastq1),length(fastq2))
Now let’s run fastqc on our unprocessed reads, and save the results in a new directory within our base directory.
fastqc_dir <- "~/Path/To/Files"
fastqc(fq.dir = base_dir, qc.dir = fastqc_dir, threads = 4)
Next, we can aggregate all the fastqc results and view them
qc <- qc_aggregate(fastqc_dir)
qc
Let’s look at what different statistics are available for one file
qc <- qc_read(file.path(fastqc_dir, "file_fastqc.zip"))
names(qc)
Let’s plot some of the satistics
par(mfrow=c(2,2))
qc_plot(qc, "Per base sequence quality")
qc_plot(qc, "Per sequence quality scores")
qc_plot(qc, "Per base sequence content")
qc_plot(qc, "sequence length distribution")
We’re ready to trim our reads.
rseqR::trim_fastq(fastq1, fastq2,
illumina = TRUE,
minlength = 90,
minqual = 30,
trimN = TRUE,
retainUnpaired = FALSE,
dest.dir = "./TRIMMED_FASTQC",
trimgalore = "trim_galore")
Next, we can run MultiQC on the trimmed fastqc files
run_multiqc(fastqc_dir, "./MULTIQC", multiqc = "multiqc")
After viewing the MultiQC output, we can move to importing in the trimmed fastq files
trimmed_dir <- "./TRIMMED_FASTQC"
reads1 <- list.files(path = file.path(trimmed_dir), pattern = "*1.fq.gz$", full.names = TRUE)
reads2 <- list.files(path = file.path(trimmed_dir), pattern = "*2.fq.gz$", full.names = TRUE)
all.equal(length(reads1),length(reads2))
Next, we need to download the reference genome. Hg19 and hg38 are the most commonly use. Which is the best to use? It depends on what tools you may want to use downstream! You can build your own Rsubread genome index, or get a precompiled one from RNASEQ$. I have generated hg19 from the 1000genomes project (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz) and hg38 (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz). Let’s assume you want to build a hg19 index. You can set the memory to any amount your computing resources can supply.
Rsuburead::buildindex(basename="hg19_g1k",
reference="human_g1k_v37.fasta",
memory=3600)
Now that we have an index built, we’re ready to align our reads with Rsubread.
Rsubread::align(index = "hg19_g1k",
readfile1 = reads1,
readfile2 = reads2,
type = "rna",
input_format = "gzFASTQ",
output_format = "BAM",
PE_orientation = "rf",
nthreads = 6 )
Let’s view our bam files.
trimmed_dir <-"~/Path/To/Files"
bam.files <- list.files(trimmed_dir, pattern = "BAM$")
bam.files
And determine the proportion of mapped reads.
Now that we’ve aligned our reads, we can use FeatureCounts to count the reads to genes from the genomic alignment. We can use the inbuilt annotation.
Let’s take a look at the statistics of our counts
fc$stat
samples$samplename <- colnames(fc$counts)
samples
Now we are ready for differential expression. First, we will create a DGElist object from our FeatureCounts output.
y <- DGEList(fc$counts,lib.size = colSums(fc$counts),
norm.factors = calcNormFactors(fc$counts),
samples = samples$samplename,
group = samples$condition)
Since we used the inbuilt hg19 annotation, our columns from our fc object are in ENTREZID form. Let’s get the Gene Symbol for each ENTREZID (where available) and apply it to our DGEList object “y”.
Next, we will use EdgeR to filter lowly expressed genes, and then take a look at the count denisty for each sample before and after filtering.
# View the library size for each sample
y$samples
# Load a nice color palette of 50 colors to be used for plots
myPalette <- c(brewer.pal(8, "Set1"), brewer.pal(8, "Set2"))
# convert counts to cpm and log
unfilteredExpr <- cpm(y, log=T)
# Filter lowly expressed genes via edgeR
keep = filterByExpr(y)
y <- y[keep,]
# Calculate normalization factors
y <- calcNormFactors(y)
# Plot the density of filtered gene expression for all samples within groups
filteredExpr <- cpm(y, log=T)
# Plot the density of unfiltered and filtered gene expression for all samples within groups
par(mfrow=c(2,2))
plotDensities(unfilteredExpr, group=samples$condition, col=myPalette[1:8])
plotDensities(filteredExpr, group=samples$condition, col=myPalette[1:8])
boxplot(unfilteredExpr, las=2, main="")
boxplot(filteredExpr, las=2, main="")
How do the samples within conditions correlate with each other in expression space, and how do the conditions differ? This is akin to a PCA plot
group <- factor(samples$condition,levels=c("condition1","condition2","condition3"))
# View an MDS plot
plotMDS(filteredExpr, labels=group, col=as.numeric(group))
You can also create an interactive MDS plot with Glimma
Now we’re ready to create an experimental design and contrast matrix for differential expression analysis. We are interested in comparing the two Presenilin1 mutations - condition2 and condition3 - relative to condition1, so we will first create the different conditions within the overall group, and the create a contrast matrix specifying these two comparisions (condition2 relative to condition1 and condition3 relative to condition1) devtools::install_github("drjingma/netgsa", build_vignettes=T) {r contrast matrix} # Create the sample table sampleTable <- data.frame(condition=factor(rep(c(samples$condition)))) # Create the group and design - rename your conditions! group <- factor(samples$condition,levels=c("condition1","condition2","condition3")) # Create the design model matrix. Here it's just by group, but if you have different batches, you want want to define batches and use ~0+group+batch design <- model.matrix(~0+group, data = sampleTable) # Rename the colnames colnames(design) <- levels(group) # Create the contrast matrix for DE contr.matrix <- makeContrasts(condition2vcondition1 = condition2 - condition1, condition3vcondition1 = condition3 - condition1, levels = colnames(design)) # View your contrast matrix contr.matrix
Now, we will apply the voom method (Law et al. 2014), variance modeling at the observational level, to our count data and fit with limma’s linear model approach. we can run either voom() or voomWithQualityWeights(), which is particularly useful when the library size of the samples differ substantially within an experiment.
par(mfrow=c(2,1))
v <- voomWithQualityWeights(y, design=design, plot=TRUE)
# fit the linear model
fit <- lmFit(v, design)
# apply your contrasts
cfit <- contrasts.fit(fit, contrasts=contr.matrix)
# eBayes method
efit <- eBayes(cfit)
plotSA(efit, main="Final model: Mean-variance trend")
How many genes are differentially expressed at a FDR=0.05?
summary(decideTests(efit))
We can also use the treat method for determination of differentially expressed genes, which will recalculate the p-values and t-values of each gene in the study at a particular logFC threshold. If we want to set the threshold at logFC of 1.5 (which is about ~0.58 in log2 fold space), we can do so as follows.
tfit <- treat(efit,lfc=log2(1.5))
# See how many DEGs in each condition
dt <- decideTests(tfit)
summary(dt)
How many DEGs overlap between condition2 and condition3 relative to condition1?
de.common <- which(dt[,1]!=0 & dt[,2]!=0)
length(de.common)
# Make a vennDiagram of the overlap of DEGs between conditions
vennDiagram(dt[,1:2], circle.col=c("turquoise", "salmon", "blue"))
Let’s first make a volcano plot of the condition2 to condition1 contrast
volcanoplot(efit, coef=1,style = "p-value", highlight = 100, names = efit$genes$SYMBOL, hl.col="blue", xlab = "Log2 Fold Change")
Next, we can create an interactive MD plot with limma
There are two main types of enrichment analysis: 1. Self-contained tests (e.g., DAVID, gProfiler, Metascape, fry) which ask: are the genes in my gene lists overrepresented for a given gene set (i.e. Reactome:Cell Cycle or TF-targets of E2F1) 2. Competitive test (e.g. GSEA, camera, PantherDB statistical enrichment): are the genes in my list which overlap with a gene set more differentially expressed in that gene set relative to the rest of the DEGs not in the geneset, normalized across all gene sets within a database?
Type 2 competive tests have the advantage of using all differential gene expression data (even if they don’t cross an arbitrary logFC or p-value threshold), but both can be informative - think about which would be the best to use for your data. Let’s do some of both! First, we will need some reference databases for enrichment. Popular ones are Gene Ontology, Reactome, and MSigDB - Hallmark. Let’s use msigdbr to import a few
library("msigdbr")
Hs.GOBP <- msigdbr(species = "Homo sapiens", category = "C5", subcategory = "BP")
Hs.Reactome <- msigdbr(species = "Homo sapiens", category = "C2", subcategory = "CP:REACTOME")
Hs.Hallmark <- msigdbr(species = "Homo sapiens", category = "H")
msigdbr stores the genesets in both ENTREZID and SYMBOL form. We’ll use both from here on out, so lets split out each one into Entrez and Symbol for downstream use.
Hs.GOBP.Entrez <- Hs.GOBP %>% split(x = .$entrez_gene, f = .$gs_name)
Hs.Hallmark.Entrez <- Hs.Hallmark %>% split(x = .$entrez_gene, f = .$gs_name)
Hs.GOBP.Symbol <- Hs.GOBP %>% split(x = .$gene_symbol, f = .$gs_name)
Hs.Hallmark.Symbol <- Hs.Hallmark %>% split(x = .$gene_symbol, f = .$gs_name)
Let’s start out with a fast and simple competitive enrichment test (type 2) called camera - it’s built into the limma package. First, we will create an index with the corresponding gene set database we generated.
idx <- ids2indices(Hs.GOBP.Entrez,id=v$genes$ENTREZID)
Next, we will perform camera enrichment - specifying the condition2 vs condition1 contrast, which is contrast 1 in our matrix
cam.HsGO <- camera(v,idx,design,contrast=contr.matrix[,1])
#Top 40 results
head(cam.HsGO,10)
It looks like Synaptic Signaling is significantly downregulated in this comparison. We can create a barcodeplot of the enrichment of the geneset, plotting the t-value of each gene in the condition2vscondition1 contrast within this geneset
barcodeplot(efit$t[,1],
index=idx$GO_SYNAPTIC_SIGNALING,
main="condition2vscondition1 - GO:Synaptic Signaling")
Another popular and powerful competitive enrichment test is GSEA (Subramanian et al. 2005). It’s widely used and there is a R package - fgsea - which has implemented the algorithm in a manner to facilitate fast enrichment (substatianlly faster than running the GSEA java app). Let’s look at the second contrast - condition3vscondition1 - and rank the genes by t value.
reslimma <- topTable(efit, coef=2, adjust.method="BH", sort.by = "t", n = Inf)
#Convert your limma topTable results to a data table format
reslimma_dt <- data.table(reslimma)
# Remove rows with an empty or duplicated Symbol
reslimma_dt <- subset(reslimma_dt, SYMBOL != "" )
reslimma_dt <- subset(reslimma_dt, ! duplicated(SYMBOL))
# Tidy up and order by t metric
ranks_limma <- as_tibble(reslimma_dt[order(t), list(SYMBOL, t)])
# Deframe the data table
ranks <- deframe(ranks_limma)
#see the top 20 genes by t
head(ranks, 10)
We’ll use the Hs.Hallmark.Symbol we generated earlier, and view the top 10 upregulated genesets in Hallmark.
fgseaRes.Hallmark <- fgsea::fgseaMultilevel(pathways=Hs.Hallmark.Symbol, stats=ranks, eps = 0)
# Make the Results tidy
fgseaResTidy <- fgseaRes.Hallmark %>%
as_tibble() %>%
dplyr::select(-leadingEdge, -ES, -nMoreExtreme) %>%
arrange(desc(NES))
head(fgseaResTidy,10)
What if you want to determine the activity of a given TF? You can use DoRothEA (Garcia-Alonso et al 2019 Genome Research, 29(8), 1363-1375). They have pre-generated TF regulon networks, which we can use to analyze our expression data. They are in the VIPER format, an R package to determine protein activity from expression data. DoRothEA uses the viper algorithm to calculate directional TF activity changes in expression data. There are currently 5 main networks - A to E - with descending stringency on validation requirements. The C regulon network requires a TF-gene edge be validated by either 1.) ChIP-Seq peak in gene and motif occurance or 2.) Literature edge and motif occurance. We’ll use the network here.
library("viper")
# C set is a good place to start - it requires either 1) Curated data w/ TFBS or 2) ChIP-Seq data w/ TFBS
# C_viperRegulon.rdata
load('~/Path/To/Files')
# Clean TF names & explore object
names(viper_regulon) <- sapply(strsplit(names(viper_regulon), split = ' - '), head, 1)
# Explore the regulons object
names(viper_regulon)[1:10]
viper_regulon[[1]]
Now, we need to take our expression results and z-score it. Let’s focus on contrast condition2vscondition1.
DEsignature <- topTable(efit, coef=1, adjust.method="BH", sort.by = "t", n = Inf)
# Exclude probes with unknown or duplicated gene symbol
DEsignature <- subset(DEsignature, SYMBOL != "" )
DEsignature <- subset(DEsignature, ! duplicated(SYMBOL))
# Estimate z-score values for the GES. Check VIPER manual for details
myStatistics <- matrix(DEsignature$logFC, dimnames = list(DEsignature$SYMBOL, 'logFC') )
myPvalue <- matrix(DEsignature$P.Value, dimnames = list(DEsignature$SYMBOL, 'P.Value') )
mySignature <- (qnorm(myPvalue/2, lower.tail = FALSE) * sign(myStatistics))[, 1]
# Reorder
mySignature <- mySignature[order(mySignature, decreasing = T)]
head(mySignature)
Now we’re ready to run DoRothEA TF analysis with a minimum TF-gene edge size of 4. Let’s view the top 20 TFs with the most significant activity change.
library("viper")
# C_viperRegulon.rdata
load('/~/Path/To/Files')
names(viper_regulon) <- sapply(strsplit(names(viper_regulon), split = ' - '), head, 1)
mrs <- msviper(ges = mySignature, regulon = viper_regulon, minsize = 4, ges.filter = F, verbose = TRUE)
# Save the results as a dataframe
TF_activities <- data.frame(Regulon = names(mrs$es$nes),
Size = mrs$es$size[ names(mrs$es$nes) ],
NES = mrs$es$nes,
p.value = mrs$es$p.value,
FDR = p.adjust(mrs$es$p.value, method = 'fdr'))
Which TFs have the most significant change in activity?
Which genes contributed most to these TF activity changes?
mrs <- ledge(mrs)
# See the results
summary(mrs)
sessionInfo()