RNA-seq analysis using kallisto, tximport, and DeSeq2

This walkthrough was built by combining the information from several vignettes in order to streamline workflow from kallisto output files.
To see the beginning of the workflow from raw sequencing reads (fastq files) to finished kallisto output directory, see Raw Sequencing Reads -> filtering(FastQC, TrimGalore) -> kallisto.
Here, we start with .h5 files already ordered into a directory, and we have already created a .txt files that describes the experimental paradigm, including samples and conditions.
If you are executing this for the very first time, see this file Install Packages for information on how to download all the packages you will need, which you’ll need to do first before you can implement the function library().

Load your libraries:

library("tximport")
library("readr")
library("tximportData")
library("rhdf5")
library("GenomicFeatures")
library("DESeq2")
library("pacman")

Set up your working directory

Set the file path to the folder that contains: 1) A directory of folders wtih each of your abundance.h5 files from the kallisto operation, and contains 2) The .txt file describing samples and conditions

setwd("C:/Users/AMDprograms/Desktop/Untrimmed/")

Turn the folder contents into a dataframe called dir:

dir <- dir(file.path("./"))
head(dir)

## [1] "0h.csv"              "0h_STLC_vs_DMSO.csv" "0h_txi_kallisto.csv"
## [4] "5190-S1.fastq.gz"    "5190-S2.fastq.gz"    "5190-S3.fastq.gz"

Make a dataframe that describes your experimental conditions

Turn the list of sample names into a dataframe:

samples <- read.table(file.path("./samples.txt"), header=TRUE)
samples

##     sample condition
## 1 S1output      DMSO
## 2 S2output      STLC
## 3 S3output      DMSO
## 4 S4output      STLC
## 5 S5output      DMSO
## 6 S6output      STLC

Make a dataframe with rows listed by sample name:

rownames(samples) <- samples$sample
samples

##            sample condition
## S1output S1output      DMSO
## S2output S2output      STLC
## S3output S3output      DMSO
## S4output S4output      STLC
## S5output S5output      DMSO
## S6output S6output      STLC

Make a dataframe that lists the full path to all the kallisto .h5 files:

files <- file.path("./kallisto", samples$sample, "abundance.h5")
files

## [1] "./kallisto/S1output/abundance.h5" "./kallisto/S2output/abundance.h5"
## [3] "./kallisto/S3output/abundance.h5" "./kallisto/S4output/abundance.h5"
## [5] "./kallisto/S5output/abundance.h5" "./kallisto/S6output/abundance.h5"

Add column headers to the files dataframe ordered sample[+n]. All this does is provide columns for each sample’s reads:

names(files) <- paste0("sample", 1:6)
files

##                            sample1                            sample2 
## "./kallisto/S1output/abundance.h5" "./kallisto/S2output/abundance.h5" 
##                            sample3                            sample4 
## "./kallisto/S3output/abundance.h5" "./kallisto/S4output/abundance.h5" 
##                            sample5                            sample6 
## "./kallisto/S5output/abundance.h5" "./kallisto/S6output/abundance.h5"

Making the TxDb database to relate transcript name to gene name

Getting the right mapping for turning your kallisto list of transcripts into a list useable for tximport is important, and somewhat tricky. A few different vignettes suggested codes that did not work for me. Luckily, the authors of kallisto maintain a database of tarball downloads that not only contain high-quality pre-made kallisto indexes (.idx files) for common organisms, but also contain a .gtf file which is exactly what you can use here to make the TxDb object. So, in short, I used the kallisto tarball which includes the organism’s gtf file as input into TxDb, making it into a TxDb dataframe. Simply navigate to the directory and file where you extracted the tarball (via Unix/Linux, on Windows Linux using WSL/Ubuntu, or on Mac) and you’re good to go.

txdb <- makeTxDbFromGFF("C:/Users/AMDprograms/Downloads/mus_musculus/mus_musculus/Mus_musculus.GRCm38.96.gtf")

You can check what this data is like by calling these:

columns(txdb)

##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSPHASE"  
##  [6] "CDSSTART"   "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"    
## [11] "EXONNAME"   "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"    
## [16] "TXCHROM"    "TXEND"      "TXID"       "TXNAME"     "TXSTART"   
## [21] "TXSTRAND"   "TXTYPE"

keytypes(txdb)

## [1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"

Make a key of the transcript names, make a dataframe of this key.

k <- keys(txdb, keytype = "TXNAME")

Make a dataframe using that key to map transcript names to gene names. This will allow tximport to run.

tx_map <- select(txdb, keys = k, columns="GENEID", keytype="TXNAME")

## 'select()' returned 1:1 mapping between keys and columns

head(tx_map)

##               TXNAME             GENEID
## 1 ENSMUST00000193812 ENSMUSG00000102693
## 2 ENSMUST00000082908 ENSMUSG00000064842
## 3 ENSMUST00000192857 ENSMUSG00000102851
## 4 ENSMUST00000161581 ENSMUSG00000089699
## 5 ENSMUST00000192183 ENSMUSG00000103147
## 6 ENSMUST00000193244 ENSMUSG00000102348

Run tximport

Run tximport, and outputs to a dataframe txi.kallisto:

#convert dataframe to new name
tx2gene <- tx_map
# then
txi.kallisto <- tximport(files, type = "kallisto", tx2gene=tx2gene, ignoreTxVersion = TRUE)

## 1 2 3 4 5 6 
## transcripts missing from tx2gene: 1673
## summarizing abundance
## summarizing counts
## summarizing length
## summarizing inferential replicates

head(txi.kallisto$counts)

##                    sample1   sample2 sample3 sample4   sample5  sample6
## ENSMUSG00000000001    8299 14400.000    9423    9805 12225.000 9895.000
## ENSMUSG00000000003       0     0.000       0       0     0.000    0.000
## ENSMUSG00000000028     792  2803.000    1021    1244  1559.567 1514.813
## ENSMUSG00000000037     194   296.000     283     217   383.000  270.000
## ENSMUSG00000000049       0     0.000       0       0     0.000    0.000
## ENSMUSG00000000056    2401  3793.896    2382    2474  3247.000 2581.000

Make a DESeq2 object

Turn the tximport dataset into a DeSeq2 object, called a DeSeqDataset:

ddsTxi <- DESeqDataSetFromTximport(txi.kallisto,
                                   colData = samples,
                                   design = ~ condition)
ddsTxi

## class: DESeqDataSet 
## dim: 36047 6 
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(36047): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000118389 ENSMUSG00000118393
## rowData names(0):
## colnames(6): S1output S2output ... S5output S6output
## colData names(2): sample condition

Prefilter the reads to remove low read counts

This prefiltering step is suggested in the DESeq2 vignette. It removes any rows that have less than a total of 10 reads across all samples:

keep <- rowSums(counts(ddsTxi)) >= 10
dds <- ddsTxi[keep,]
dds

## class: DESeqDataSet 
## dim: 20185 6 
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(20185): ENSMUSG00000000001 ENSMUSG00000000028 ...
##   ENSMUSG00000118382 ENSMUSG00000118389
## rowData names(0):
## colnames(6): S1output S2output ... S5output S6output
## colData names(2): sample condition

# the above prefiltering function got rid of many transcripts (36047-x=20185)

Make sure DESeq2 compares the correct experimental conditions.

This next step will just make sure comparisons are made to the correct control. name the ref to whatever your control condition is named (here it’s DMSO):

dds$condition <- relevel(dds$condition, ref = "DMSO")

Run DESeq2

Run DeSeq2, which returns a results file res with log2 fold changes, p values and adjusted p values.

dds <- DESeq(dds)
res <- results(dds)
head(res)

## log2 fold change (MLE): condition STLC vs DMSO 
## Wald test p-value: condition STLC vs DMSO 
## DataFrame with 6 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE       stat    pvalue
##                    <numeric>      <numeric> <numeric>  <numeric> <numeric>
## ENSMUSG00000000001 10528.932     -0.0921937  0.136806 -0.6739009  0.500374
## ENSMUSG00000000028  1402.240      0.3068577  0.290703  1.0555719  0.291164
## ENSMUSG00000000037   293.507      0.0588162  0.630455  0.0932918  0.925672
## ENSMUSG00000000056  2862.085      0.3036562  0.289039  1.0505727  0.293455
## ENSMUSG00000000058   195.899      0.7890995  0.757948  1.0410992  0.297830
## ENSMUSG00000000078  1130.841      0.3721557  0.277546  1.3408782  0.179960
##                         padj
##                    <numeric>
## ENSMUSG00000000001  0.999644
## ENSMUSG00000000028  0.942239
## ENSMUSG00000000037  0.999644
## ENSMUSG00000000056  0.942759
## ENSMUSG00000000058  0.944772
## ENSMUSG00000000078  0.846797

this checks and tells you the comparison that was made (e.g. experimental vs control):

resultsNames(dds)

## [1] "Intercept"              "condition_STLC_vs_DMSO"

(Optional) Shrink effect size for making graphs

This is going to shrink effect size, using apelgm estimator (see comment in the block for why this is used):

resFC <- lfcShrink(dds, coef="condition_STLC_vs_DMSO", type="apeglm")
head(resLFC)
# Shrinkage is not needed for text-based analysis, So the estimator
# output is written to a different variable.  This would be useful for graphing.

Examine the output of the differential expression analysis

Order your data list in any way you’d like, here it’s ordered by pvalue:

resOrdered <- res[order(res$pvalue),]
summary(res)

## 
## out of 20185 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 108, 0.54%
## LFC < 0 (down)     : 135, 0.67%
## outliers [1]       : 387, 1.9%
## low counts [2]     : 3884, 19%
## (mean count < 11)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

You can filter your summary by any parameters you’d like, here its padj <0.1:

sum(res$padj < 0.1, na.rm=TRUE)

## [1] 243

Write your results to a file:

write.csv(as.data.frame(resOrdered), 
          file="0h_STLC_vs_DMSO.csv")
#you can write this file to output but it is not as detailed as the file 
#that is output below, which contains gene descriptions.

Check if all mapping of the data is good:

#check if all mapping of the data is good
any(duplicated(rownames(res)))

## [1] FALSE

any(duplicated(colnames(res)))

## [1] FALSE

Add common gene names and short descriptions to your DESeq2 output

load these libraries to prepare to add gene annotations to the dataset:

library("AnnotationDbi")
library("org.Mm.eg.db")

Append gene names (symbols) to the results file:

ens.str2 <- rownames(resOrdered)
head(ens.str2)

## [1] "ENSMUSG00000058492" "ENSMUSG00000078154" "ENSMUSG00000056394"
## [4] "ENSMUSG00000027342" "ENSMUSG00000094777" "ENSMUSG00000094463"

resOrdered$symbol <- mapIds(org.Mm.eg.db,
                     keys=ens.str2,
                     column="SYMBOL",
                     keytype="ENSEMBL",
                     multiVals="first")

## 'select()' returned 1:many mapping between keys and columns

head(resOrdered)

## log2 fold change (MLE): condition STLC vs DMSO 
## Wald test p-value: condition STLC vs DMSO 
## DataFrame with 6 rows and 7 columns
##                     baseMean log2FoldChange     lfcSE      stat      pvalue
##                    <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSMUSG00000058492  192.2996       10.89298  1.361804   7.99893 1.25501e-15
## ENSMUSG00000078154  304.9936       11.55989  1.540744   7.50280 6.24704e-14
## ENSMUSG00000056394 5710.2571        1.34065  0.189609   7.07059 1.54279e-12
## ENSMUSG00000027342 8893.8889        1.14944  0.164325   6.99490 2.65442e-12
## ENSMUSG00000094777 2254.9554        1.94985  0.298221   6.53827 6.22332e-11
## ENSMUSG00000094463   97.9846        9.92236  1.596438   6.21531 5.12223e-10
##                           padj      symbol
##                      <numeric> <character>
## ENSMUSG00000058492 1.99722e-11          NA
## ENSMUSG00000078154 4.97077e-10          NA
## ENSMUSG00000056394 8.18400e-09        Lig1
## ENSMUSG00000027342 1.05606e-08        Pcna
## ENSMUSG00000094777 1.98076e-07   Hist1h2ap
## ENSMUSG00000094463 1.35859e-06          NA

this turns the resOrdered object into a dataframe in prep for writing to csv file.

resOrdered_df <- as.data.frame(resOrdered)

Write your analysis to a .csv file

This now executes a block which does: 1) converts our DESeq2 results to a dataframe 2) makes a merged table that sums together gene names and descriptions

# this will make a dataframe out of merged gene names and shrot descriptions.
gene_desc <- merge(org.Mm.egSYMBOL, org.Mm.egGENENAME)
# we then create a new dataframe merging the gene descriptions to the modified DeSeq2 output.
res_desc <- merge(resOrdered_df, gene_desc)
head(res_desc)

##          symbol   baseMean log2FoldChange     lfcSE       stat     pvalue
## 1 0610009B22Rik  560.50309     0.86889258 0.4851068  1.7911368 0.07327135
## 2 0610010F05Rik 1084.12469    -0.18228416 0.4195712 -0.4344535 0.66395915
## 3 0610010K14Rik 3366.96582     0.17743986 0.2121151  0.8365262 0.40285892
## 4 0610012G03Rik 1737.34507    -0.05600407 0.2516677 -0.2225318 0.82389990
## 5 0610030E20Rik  673.87868    -0.19946140 0.3437019 -0.5803325 0.56169040
## 6 0610040J01Rik   36.38102    -0.57157967 2.1793335 -0.2622727 0.79311121
##        padj gene_id                  gene_name
## 1 0.6167701   66050 RIKEN cDNA 0610009B22 gene
## 2 0.9996441   71675 RIKEN cDNA 0610010F05 gene
## 3 0.9903657  104457 RIKEN cDNA 0610010K14 gene
## 4 0.9996441  106264 RIKEN cDNA 0610012G03 gene
## 5 0.9996441   68364 RIKEN cDNA 0610030E20 gene
## 6 0.9996441   76261 RIKEN cDNA 0610040J01 gene

this writes a file that contains gene names, DiffEx data, stats, and gene descriptions all together to the current working directory.

write.csv(res_desc, 
          file="gene_desc_0h_STLC_vs_DMSO.csv")

Your data is now readable in R, command line, python, excel, etc.!

kallisto h5 files to tximport to DESeq2 Walkthrough

Aaron Mitchell-Dick