Introduction

There are two ways you can do RNA-Seq processing: 1. Read alignment. 2. Transcriptome mapping. In most cases, transcriptome mapping (i.e. kallisto or Salmon) is faster, however the RNA-Seq genome aligner Rsubread - when paired with FeatureCounts for counting reads from genomic features - can approach the computing time required by transcriptome mappers. In another tutorial, I will guide you through transcriptome mapping - but here, we are going to use R for end to end analysis - from raw reads to differential gene expression, ontological and pathway enrichment, and transcription factor activity characterization.
First, let’s load the packages we will use.

Then, let’s change into the directory where our fastq files are:

base_dir <- "~/Path/To/Files"

Next, let’s import our sample metadata. We can create a simple matrix, with columns such as sample and condition.

samples <- read.table(file.path(base_dir, "Sample_info.txt"), header = TRUE, stringsAsFactors=FALSE)
samples

Data pre-processing

Using FastQC to analyze the raw sequencing data

The first step will be to pre-process our reads. To do this, we can use a R-base wrapper for TrimGalore! called trim_galore. Here, we’re working with paired-end reads that end in “R1.fastq.gz” and “R2.fastq.gz”. Let’s import them and make sure that each sample has two reads.

fastq1 <- list.files(path = file.path(base_dir), pattern = "*1.fq.gz$", full.names = TRUE)
fastq2 <- list.files(path = file.path(base_dir), pattern = "*2.fq.gz$", full.names = TRUE)
all.equal(length(fastq1),length(fastq2))

Now let’s run fastqc on our unprocessed reads, and save the results in a new directory within our base directory.

fastqc_dir <- "~/Path/To/Files"
fastqc(fq.dir = base_dir, qc.dir = fastqc_dir, threads = 4)

Next, we can aggregate all the fastqc results and view them

qc <- qc_aggregate(fastqc_dir)
qc

Let’s look at what different statistics are available for one file

qc <- qc_read(file.path(fastqc_dir, "file_fastqc.zip"))
names(qc)

Let’s plot some of the satistics

par(mfrow=c(2,2))
qc_plot(qc, "Per base sequence quality")
qc_plot(qc, "Per sequence quality scores")
qc_plot(qc, "Per base sequence content")
qc_plot(qc, "sequence length distribution")

Using TrimGalore! and MultiQC to preprocess the raw sequencing data

We’re ready to trim our reads.

rseqR::trim_fastq(fastq1, fastq2, 
           illumina = TRUE,
           minlength = 90, 
           minqual = 30, 
           trimN = TRUE,
           retainUnpaired = FALSE,
           dest.dir = "./TRIMMED_FASTQC",
           trimgalore = "trim_galore")

Next, we can run MultiQC on the trimmed fastqc files

run_multiqc(fastqc_dir, "./MULTIQC", multiqc = "multiqc")

After viewing the MultiQC output, we can move to importing in the trimmed fastq files

trimmed_dir <- "./TRIMMED_FASTQC"
reads1 <- list.files(path = file.path(trimmed_dir), pattern = "*1.fq.gz$", full.names = TRUE)
reads2 <- list.files(path = file.path(trimmed_dir), pattern = "*2.fq.gz$", full.names = TRUE)
all.equal(length(reads1),length(reads2))

Genome alignment with Rsubread

Next, we need to download the reference genome. Hg19 and hg38 are the most commonly use. Which is the best to use? It depends on what tools you may want to use downstream! You can build your own Rsubread genome index, or get a precompiled one from RNASEQ$. I have generated hg19 from the 1000genomes project (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz) and hg38 (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz). Let’s assume you want to build a hg19 index. You can set the memory to any amount your computing resources can supply.

Rsuburead::buildindex(basename="hg19_g1k",
           reference="human_g1k_v37.fasta",
           memory=3600)

Now that we have an index built, we’re ready to align our reads with Rsubread.

Rsubread::align(index = "hg19_g1k",
      readfile1 = reads1,
      readfile2 = reads2,
      type = "rna",
      input_format = "gzFASTQ",
      output_format = "BAM",
      PE_orientation = "rf",
      nthreads = 6 )

Let’s view our bam files.

trimmed_dir <-"~/Path/To/Files"
bam.files <- list.files(trimmed_dir, pattern = "BAM$")
bam.files

And determine the proportion of mapped reads.

Read to gene counting with FeatureCounts

Now that we’ve aligned our reads, we can use FeatureCounts to count the reads to genes from the genomic alignment. We can use the inbuilt annotation.

Let’s take a look at the statistics of our counts

fc$stat
samples$samplename <- colnames(fc$counts)
samples

Differential Expression with limma-voom

Now we are ready for differential expression. First, we will create a DGElist object from our FeatureCounts output.

y <- DGEList(fc$counts,lib.size = colSums(fc$counts),
             norm.factors = calcNormFactors(fc$counts),
             samples = samples$samplename,
             group = samples$condition)

Creating a Homo Sapiens annotation

Since we used the inbuilt hg19 annotation, our columns from our fc object are in ENTREZID form. Let’s get the Gene Symbol for each ENTREZID (where available) and apply it to our DGEList object “y”.

Filtering Expression Data

Next, we will use EdgeR to filter lowly expressed genes, and then take a look at the count denisty for each sample before and after filtering.

# View the library size for each sample
y$samples
# Load a nice color palette of 50 colors to be used for plots
myPalette <- c(brewer.pal(8, "Set1"), brewer.pal(8, "Set2"))
# convert counts to cpm and log
unfilteredExpr <- cpm(y, log=T)
# Filter lowly expressed genes via edgeR
keep = filterByExpr(y)
y <- y[keep,]
# Calculate normalization factors
y <- calcNormFactors(y)
# Plot the density of filtered gene expression for all samples within groups
filteredExpr <- cpm(y, log=T)
# Plot the density of unfiltered and filtered gene expression for all samples within groups
par(mfrow=c(2,2))
plotDensities(unfilteredExpr, group=samples$condition, col=myPalette[1:8])
plotDensities(filteredExpr, group=samples$condition, col=myPalette[1:8])
boxplot(unfilteredExpr, las=2, main="")
boxplot(filteredExpr, las=2, main="")

Multi Dimentional Scaling (MDS plot) to assess

How do the samples within conditions correlate with each other in expression space, and how do the conditions differ? This is akin to a PCA plot

group <- factor(samples$condition,levels=c("condition1","condition2","condition3"))
# View an MDS plot
plotMDS(filteredExpr, labels=group, col=as.numeric(group))

You can also create an interactive MDS plot with Glimma

Creating a design and contrast matrix

Now we’re ready to create an experimental design and contrast matrix for differential expression analysis. We are interested in comparing the two Presenilin1 mutations - condition2 and condition3 - relative to condition1, so we will first create the different conditions within the overall group, and the create a contrast matrix specifying these two comparisions (condition2 relative to condition1 and condition3 relative to condition1) devtools::install_github("drjingma/netgsa", build_vignettes=T) {r contrast matrix} # Create the sample table sampleTable <- data.frame(condition=factor(rep(c(samples$condition)))) # Create the group and design - rename your conditions! group <- factor(samples$condition,levels=c("condition1","condition2","condition3")) # Create the design model matrix. Here it's just by group, but if you have different batches, you want want to define batches and use ~0+group+batch design <- model.matrix(~0+group, data = sampleTable) # Rename the colnames colnames(design) <- levels(group) # Create the contrast matrix for DE contr.matrix <- makeContrasts(condition2vcondition1 = condition2 - condition1, condition3vcondition1 = condition3 - condition1, levels = colnames(design)) # View your contrast matrix contr.matrix

Differential expression with voom

Now, we will apply the voom method (Law et al. 2014), variance modeling at the observational level, to our count data and fit with limma’s linear model approach. we can run either voom() or voomWithQualityWeights(), which is particularly useful when the library size of the samples differ substantially within an experiment.

par(mfrow=c(2,1))
v <- voomWithQualityWeights(y, design=design, plot=TRUE)
# fit the linear model
fit <- lmFit(v, design)
# apply your contrasts
cfit <- contrasts.fit(fit, contrasts=contr.matrix)
# eBayes method
efit <- eBayes(cfit)
plotSA(efit, main="Final model: Mean-variance trend")

eBayes method for DEG determination

How many genes are differentially expressed at a FDR=0.05?

summary(decideTests(efit))

TREAT method for DEG determination

We can also use the treat method for determination of differentially expressed genes, which will recalculate the p-values and t-values of each gene in the study at a particular logFC threshold. If we want to set the threshold at logFC of 1.5 (which is about ~0.58 in log2 fold space), we can do so as follows.

tfit <- treat(efit,lfc=log2(1.5))
# See how many DEGs in each condition
dt <- decideTests(tfit)
summary(dt)

DEG overlap

How many DEGs overlap between condition2 and condition3 relative to condition1?

de.common <- which(dt[,1]!=0 & dt[,2]!=0)
length(de.common)
# Make a vennDiagram of the overlap of DEGs between conditions
vennDiagram(dt[,1:2], circle.col=c("turquoise", "salmon", "blue"))

Plotting

Let’s first make a volcano plot of the condition2 to condition1 contrast

volcanoplot(efit, coef=1,style = "p-value", highlight = 100, names = efit$genes$SYMBOL, hl.col="blue", xlab = "Log2 Fold Change")

Next, we can create an interactive MD plot with limma

Gene Set Enrichment Analysis

There are two main types of enrichment analysis: 1. Self-contained tests (e.g., DAVID, gProfiler, Metascape, fry) which ask: are the genes in my gene lists overrepresented for a given gene set (i.e. Reactome:Cell Cycle or TF-targets of E2F1) 2. Competitive test (e.g. GSEA, camera, PantherDB statistical enrichment): are the genes in my list which overlap with a gene set more differentially expressed in that gene set relative to the rest of the DEGs not in the geneset, normalized across all gene sets within a database?
Type 2 competive tests have the advantage of using all differential gene expression data (even if they don’t cross an arbitrary logFC or p-value threshold), but both can be informative - think about which would be the best to use for your data. Let’s do some of both! First, we will need some reference databases for enrichment. Popular ones are Gene Ontology, Reactome, and MSigDB - Hallmark. Let’s use msigdbr to import a few

library("msigdbr")
Hs.GOBP <- msigdbr(species = "Homo sapiens", category = "C5", subcategory = "BP")
Hs.Reactome <- msigdbr(species = "Homo sapiens", category = "C2", subcategory = "CP:REACTOME")
Hs.Hallmark <- msigdbr(species = "Homo sapiens", category = "H")

msigdbr stores the genesets in both ENTREZID and SYMBOL form. We’ll use both from here on out, so lets split out each one into Entrez and Symbol for downstream use.

Hs.GOBP.Entrez <- Hs.GOBP %>% split(x = .$entrez_gene, f = .$gs_name)
Hs.Hallmark.Entrez <- Hs.Hallmark %>% split(x = .$entrez_gene, f = .$gs_name)
Hs.GOBP.Symbol <- Hs.GOBP %>% split(x = .$gene_symbol, f = .$gs_name)
Hs.Hallmark.Symbol <- Hs.Hallmark %>% split(x = .$gene_symbol, f = .$gs_name)

Camera competitive enrichment test

Let’s start out with a fast and simple competitive enrichment test (type 2) called camera - it’s built into the limma package. First, we will create an index with the corresponding gene set database we generated.

idx <- ids2indices(Hs.GOBP.Entrez,id=v$genes$ENTREZID)

Next, we will perform camera enrichment - specifying the condition2 vs condition1 contrast, which is contrast 1 in our matrix

cam.HsGO <- camera(v,idx,design,contrast=contr.matrix[,1])
#Top 40 results
head(cam.HsGO,10)

It looks like Synaptic Signaling is significantly downregulated in this comparison. We can create a barcodeplot of the enrichment of the geneset, plotting the t-value of each gene in the condition2vscondition1 contrast within this geneset

barcodeplot(efit$t[,1], 
            index=idx$GO_SYNAPTIC_SIGNALING, 
            main="condition2vscondition1 - GO:Synaptic Signaling")

GSEA competitive enrichment test with fgsea

Another popular and powerful competitive enrichment test is GSEA (Subramanian et al. 2005). It’s widely used and there is a R package - fgsea - which has implemented the algorithm in a manner to facilitate fast enrichment (substatianlly faster than running the GSEA java app). Let’s look at the second contrast - condition3vscondition1 - and rank the genes by t value.

reslimma <- topTable(efit, coef=2, adjust.method="BH", sort.by = "t", n = Inf)
#Convert your limma topTable results to a data table format
reslimma_dt <- data.table(reslimma)
# Remove rows with an empty or duplicated Symbol
reslimma_dt <- subset(reslimma_dt, SYMBOL != "" )
reslimma_dt <- subset(reslimma_dt, ! duplicated(SYMBOL))
# Tidy up and order by t metric
ranks_limma <- as_tibble(reslimma_dt[order(t), list(SYMBOL, t)])
# Deframe the data table
ranks <- deframe(ranks_limma)
#see the top 20 genes by t
head(ranks, 10)

We’ll use the Hs.Hallmark.Symbol we generated earlier, and view the top 10 upregulated genesets in Hallmark.

fgseaRes.Hallmark <- fgsea::fgseaMultilevel(pathways=Hs.Hallmark.Symbol, stats=ranks, eps = 0)
# Make the Results tidy
fgseaResTidy <- fgseaRes.Hallmark %>%
  as_tibble() %>%
  dplyr::select(-leadingEdge, -ES, -nMoreExtreme) %>% 
  arrange(desc(NES))
head(fgseaResTidy,10)

Transcription Factor Analysis

DoRothEA TF Activity Analysis

What if you want to determine the activity of a given TF? You can use DoRothEA (Garcia-Alonso et al 2019 Genome Research, 29(8), 1363-1375). They have pre-generated TF regulon networks, which we can use to analyze our expression data. They are in the VIPER format, an R package to determine protein activity from expression data. DoRothEA uses the viper algorithm to calculate directional TF activity changes in expression data. There are currently 5 main networks - A to E - with descending stringency on validation requirements. The C regulon network requires a TF-gene edge be validated by either 1.) ChIP-Seq peak in gene and motif occurance or 2.) Literature edge and motif occurance. We’ll use the network here.

library("viper")
# C set is a good place to start - it requires either 1) Curated data w/ TFBS or 2) ChIP-Seq data w/ TFBS
# C_viperRegulon.rdata
load('~/Path/To/Files')
# Clean TF names & explore object
names(viper_regulon) <- sapply(strsplit(names(viper_regulon), split = ' - '), head, 1)
# Explore the regulons object
names(viper_regulon)[1:10]
viper_regulon[[1]]

Now, we need to take our expression results and z-score it. Let’s focus on contrast condition2vscondition1.

DEsignature <- topTable(efit, coef=1, adjust.method="BH", sort.by = "t", n = Inf)
# Exclude probes with unknown or duplicated gene symbol
DEsignature <- subset(DEsignature, SYMBOL != "" )
DEsignature <- subset(DEsignature, ! duplicated(SYMBOL))
# Estimate z-score values for the GES. Check VIPER manual for details
myStatistics <- matrix(DEsignature$logFC, dimnames = list(DEsignature$SYMBOL, 'logFC') )
myPvalue <- matrix(DEsignature$P.Value, dimnames = list(DEsignature$SYMBOL, 'P.Value') )
mySignature <- (qnorm(myPvalue/2, lower.tail = FALSE) * sign(myStatistics))[, 1]
# Reorder
mySignature <- mySignature[order(mySignature, decreasing = T)]
head(mySignature)

Now we’re ready to run DoRothEA TF analysis with a minimum TF-gene edge size of 4. Let’s view the top 20 TFs with the most significant activity change.

library("viper")

# C_viperRegulon.rdata
load('/~/Path/To/Files')
names(viper_regulon) <- sapply(strsplit(names(viper_regulon), split = ' - '), head, 1)
mrs <- msviper(ges = mySignature, regulon = viper_regulon, minsize = 4, ges.filter = F, verbose = TRUE)
# Save the results as a dataframe
TF_activities <- data.frame(Regulon = names(mrs$es$nes),
                            Size = mrs$es$size[ names(mrs$es$nes) ], 
                            NES = mrs$es$nes, 
                            p.value = mrs$es$p.value, 
                            FDR = p.adjust(mrs$es$p.value, method = 'fdr'))

Which TFs have the most significant change in activity?

Which genes contributed most to these TF activity changes?

mrs <- ledge(mrs)
# See the results
summary(mrs)

Software and code used

sessionInfo()