This walkthrough was built by combining the information from several vignettes in order to streamline workflow from kallisto output files.
To see the beginning of the workflow from raw sequencing reads (fastq files) to finished kallisto output directory, see Raw Sequencing Reads -> filtering(FastQC, TrimGalore) -> kallisto.
Here, we start with .h5
files already ordered into a directory, and we have already created a .txt
files that describes the experimental paradigm, including samples and conditions.
If you are executing this for the very first time, see this file Install Packages for information on how to download all the packages you will need, which you’ll need to do first before you can implement the function library()
.
library("tximport")
library("readr")
library("tximportData")
library("rhdf5")
library("GenomicFeatures")
library("DESeq2")
library("pacman")
Set the file path to the folder that contains: 1) A directory of folders wtih each of your abundance.h5 files from the kallisto operation, and contains 2) The .txt file describing samples and conditions
setwd("C:/Users/AMDprograms/Desktop/Untrimmed/")
Turn the folder contents into a dataframe called dir:
dir <- dir(file.path("./"))
head(dir)
## [1] "0h.csv" "0h_STLC_vs_DMSO.csv" "0h_txi_kallisto.csv"
## [4] "5190-S1.fastq.gz" "5190-S2.fastq.gz" "5190-S3.fastq.gz"
Turn the list of sample names into a dataframe:
samples <- read.table(file.path("./samples.txt"), header=TRUE)
samples
## sample condition
## 1 S1output DMSO
## 2 S2output STLC
## 3 S3output DMSO
## 4 S4output STLC
## 5 S5output DMSO
## 6 S6output STLC
Make a dataframe with rows listed by sample name:
rownames(samples) <- samples$sample
samples
## sample condition
## S1output S1output DMSO
## S2output S2output STLC
## S3output S3output DMSO
## S4output S4output STLC
## S5output S5output DMSO
## S6output S6output STLC
Make a dataframe that lists the full path to all the kallisto .h5 files:
files <- file.path("./kallisto", samples$sample, "abundance.h5")
files
## [1] "./kallisto/S1output/abundance.h5" "./kallisto/S2output/abundance.h5"
## [3] "./kallisto/S3output/abundance.h5" "./kallisto/S4output/abundance.h5"
## [5] "./kallisto/S5output/abundance.h5" "./kallisto/S6output/abundance.h5"
Add column headers to the files dataframe ordered sample[+n]
. All this does is provide columns for each sample’s reads:
names(files) <- paste0("sample", 1:6)
files
## sample1 sample2
## "./kallisto/S1output/abundance.h5" "./kallisto/S2output/abundance.h5"
## sample3 sample4
## "./kallisto/S3output/abundance.h5" "./kallisto/S4output/abundance.h5"
## sample5 sample6
## "./kallisto/S5output/abundance.h5" "./kallisto/S6output/abundance.h5"
Getting the right mapping for turning your kallisto list of transcripts into a list useable for tximport is important, and somewhat tricky. A few different vignettes suggested codes that did not work for me. Luckily, the authors of kallisto maintain a database of tarball downloads that not only contain high-quality pre-made kallisto indexes (.idx files) for common organisms, but also contain a .gtf
file which is exactly what you can use here to make the TxDb object. So, in short, I used the kallisto tarball which includes the organism’s gtf file as input into TxDb, making it into a TxDb dataframe. Simply navigate to the directory and file where you extracted the tarball (via Unix/Linux, on Windows Linux using WSL/Ubuntu, or on Mac) and you’re good to go.
txdb <- makeTxDbFromGFF("C:/Users/AMDprograms/Downloads/mus_musculus/mus_musculus/Mus_musculus.GRCm38.96.gtf")
You can check what this data is like by calling these:
columns(txdb)
## [1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSPHASE"
## [6] "CDSSTART" "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID"
## [11] "EXONNAME" "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID"
## [16] "TXCHROM" "TXEND" "TXID" "TXNAME" "TXSTART"
## [21] "TXSTRAND" "TXTYPE"
keytypes(txdb)
## [1] "CDSID" "CDSNAME" "EXONID" "EXONNAME" "GENEID" "TXID" "TXNAME"
Make a key of the transcript names, make a dataframe of this key.
k <- keys(txdb, keytype = "TXNAME")
Make a dataframe using that key to map transcript names to gene names. This will allow tximport to run.
tx_map <- select(txdb, keys = k, columns="GENEID", keytype="TXNAME")
## 'select()' returned 1:1 mapping between keys and columns
head(tx_map)
## TXNAME GENEID
## 1 ENSMUST00000193812 ENSMUSG00000102693
## 2 ENSMUST00000082908 ENSMUSG00000064842
## 3 ENSMUST00000192857 ENSMUSG00000102851
## 4 ENSMUST00000161581 ENSMUSG00000089699
## 5 ENSMUST00000192183 ENSMUSG00000103147
## 6 ENSMUST00000193244 ENSMUSG00000102348
Run tximport, and outputs to a dataframe txi.kallisto:
#convert dataframe to new name
tx2gene <- tx_map
# then
txi.kallisto <- tximport(files, type = "kallisto", tx2gene=tx2gene, ignoreTxVersion = TRUE)
## 1 2 3 4 5 6
## transcripts missing from tx2gene: 1673
## summarizing abundance
## summarizing counts
## summarizing length
## summarizing inferential replicates
head(txi.kallisto$counts)
## sample1 sample2 sample3 sample4 sample5 sample6
## ENSMUSG00000000001 8299 14400.000 9423 9805 12225.000 9895.000
## ENSMUSG00000000003 0 0.000 0 0 0.000 0.000
## ENSMUSG00000000028 792 2803.000 1021 1244 1559.567 1514.813
## ENSMUSG00000000037 194 296.000 283 217 383.000 270.000
## ENSMUSG00000000049 0 0.000 0 0 0.000 0.000
## ENSMUSG00000000056 2401 3793.896 2382 2474 3247.000 2581.000
Turn the tximport dataset into a DeSeq2 object, called a DeSeqDataset:
ddsTxi <- DESeqDataSetFromTximport(txi.kallisto,
colData = samples,
design = ~ condition)
ddsTxi
## class: DESeqDataSet
## dim: 36047 6
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(36047): ENSMUSG00000000001 ENSMUSG00000000003 ...
## ENSMUSG00000118389 ENSMUSG00000118393
## rowData names(0):
## colnames(6): S1output S2output ... S5output S6output
## colData names(2): sample condition
This prefiltering step is suggested in the DESeq2 vignette. It removes any rows that have less than a total of 10 reads across all samples:
keep <- rowSums(counts(ddsTxi)) >= 10
dds <- ddsTxi[keep,]
dds
## class: DESeqDataSet
## dim: 20185 6
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(20185): ENSMUSG00000000001 ENSMUSG00000000028 ...
## ENSMUSG00000118382 ENSMUSG00000118389
## rowData names(0):
## colnames(6): S1output S2output ... S5output S6output
## colData names(2): sample condition
# the above prefiltering function got rid of many transcripts (36047-x=20185)
This next step will just make sure comparisons are made to the correct control. name the ref to whatever your control condition is named (here it’s DMSO):
dds$condition <- relevel(dds$condition, ref = "DMSO")
Run DeSeq2, which returns a results file res
with log2 fold changes, p values and adjusted p values.
dds <- DESeq(dds)
res <- results(dds)
head(res)
## log2 fold change (MLE): condition STLC vs DMSO
## Wald test p-value: condition STLC vs DMSO
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000000001 10528.932 -0.0921937 0.136806 -0.6739009 0.500374
## ENSMUSG00000000028 1402.240 0.3068577 0.290703 1.0555719 0.291164
## ENSMUSG00000000037 293.507 0.0588162 0.630455 0.0932918 0.925672
## ENSMUSG00000000056 2862.085 0.3036562 0.289039 1.0505727 0.293455
## ENSMUSG00000000058 195.899 0.7890995 0.757948 1.0410992 0.297830
## ENSMUSG00000000078 1130.841 0.3721557 0.277546 1.3408782 0.179960
## padj
## <numeric>
## ENSMUSG00000000001 0.999644
## ENSMUSG00000000028 0.942239
## ENSMUSG00000000037 0.999644
## ENSMUSG00000000056 0.942759
## ENSMUSG00000000058 0.944772
## ENSMUSG00000000078 0.846797
this checks and tells you the comparison that was made (e.g. experimental vs control):
resultsNames(dds)
## [1] "Intercept" "condition_STLC_vs_DMSO"
This is going to shrink effect size, using apelgm estimator (see comment in the block for why this is used):
resFC <- lfcShrink(dds, coef="condition_STLC_vs_DMSO", type="apeglm")
head(resLFC)
# Shrinkage is not needed for text-based analysis, So the estimator
# output is written to a different variable. This would be useful for graphing.
Order your data list in any way you’d like, here it’s ordered by pvalue:
resOrdered <- res[order(res$pvalue),]
summary(res)
##
## out of 20185 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 108, 0.54%
## LFC < 0 (down) : 135, 0.67%
## outliers [1] : 387, 1.9%
## low counts [2] : 3884, 19%
## (mean count < 11)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
You can filter your summary by any parameters you’d like, here its padj <0.1:
sum(res$padj < 0.1, na.rm=TRUE)
## [1] 243
Write your results to a file:
write.csv(as.data.frame(resOrdered),
file="0h_STLC_vs_DMSO.csv")
#you can write this file to output but it is not as detailed as the file
#that is output below, which contains gene descriptions.
Check if all mapping of the data is good:
#check if all mapping of the data is good
any(duplicated(rownames(res)))
## [1] FALSE
any(duplicated(colnames(res)))
## [1] FALSE
load these libraries to prepare to add gene annotations to the dataset:
library("AnnotationDbi")
library("org.Mm.eg.db")
Append gene names (symbols) to the results file:
ens.str2 <- rownames(resOrdered)
head(ens.str2)
## [1] "ENSMUSG00000058492" "ENSMUSG00000078154" "ENSMUSG00000056394"
## [4] "ENSMUSG00000027342" "ENSMUSG00000094777" "ENSMUSG00000094463"
resOrdered$symbol <- mapIds(org.Mm.eg.db,
keys=ens.str2,
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
## 'select()' returned 1:many mapping between keys and columns
head(resOrdered)
## log2 fold change (MLE): condition STLC vs DMSO
## Wald test p-value: condition STLC vs DMSO
## DataFrame with 6 rows and 7 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000058492 192.2996 10.89298 1.361804 7.99893 1.25501e-15
## ENSMUSG00000078154 304.9936 11.55989 1.540744 7.50280 6.24704e-14
## ENSMUSG00000056394 5710.2571 1.34065 0.189609 7.07059 1.54279e-12
## ENSMUSG00000027342 8893.8889 1.14944 0.164325 6.99490 2.65442e-12
## ENSMUSG00000094777 2254.9554 1.94985 0.298221 6.53827 6.22332e-11
## ENSMUSG00000094463 97.9846 9.92236 1.596438 6.21531 5.12223e-10
## padj symbol
## <numeric> <character>
## ENSMUSG00000058492 1.99722e-11 NA
## ENSMUSG00000078154 4.97077e-10 NA
## ENSMUSG00000056394 8.18400e-09 Lig1
## ENSMUSG00000027342 1.05606e-08 Pcna
## ENSMUSG00000094777 1.98076e-07 Hist1h2ap
## ENSMUSG00000094463 1.35859e-06 NA
this turns the resOrdered object into a dataframe in prep for writing to csv file.
resOrdered_df <- as.data.frame(resOrdered)
This now executes a block which does: 1) converts our DESeq2 results to a dataframe 2) makes a merged table that sums together gene names and descriptions
# this will make a dataframe out of merged gene names and shrot descriptions.
gene_desc <- merge(org.Mm.egSYMBOL, org.Mm.egGENENAME)
# we then create a new dataframe merging the gene descriptions to the modified DeSeq2 output.
res_desc <- merge(resOrdered_df, gene_desc)
head(res_desc)
## symbol baseMean log2FoldChange lfcSE stat pvalue
## 1 0610009B22Rik 560.50309 0.86889258 0.4851068 1.7911368 0.07327135
## 2 0610010F05Rik 1084.12469 -0.18228416 0.4195712 -0.4344535 0.66395915
## 3 0610010K14Rik 3366.96582 0.17743986 0.2121151 0.8365262 0.40285892
## 4 0610012G03Rik 1737.34507 -0.05600407 0.2516677 -0.2225318 0.82389990
## 5 0610030E20Rik 673.87868 -0.19946140 0.3437019 -0.5803325 0.56169040
## 6 0610040J01Rik 36.38102 -0.57157967 2.1793335 -0.2622727 0.79311121
## padj gene_id gene_name
## 1 0.6167701 66050 RIKEN cDNA 0610009B22 gene
## 2 0.9996441 71675 RIKEN cDNA 0610010F05 gene
## 3 0.9903657 104457 RIKEN cDNA 0610010K14 gene
## 4 0.9996441 106264 RIKEN cDNA 0610012G03 gene
## 5 0.9996441 68364 RIKEN cDNA 0610030E20 gene
## 6 0.9996441 76261 RIKEN cDNA 0610040J01 gene
this writes a file that contains gene names, DiffEx data, stats, and gene descriptions all together to the current working directory.
write.csv(res_desc,
file="gene_desc_0h_STLC_vs_DMSO.csv")
Your data is now readable in R, command line, python, excel, etc.!