The folks at EBI have produced a summary of all the high throughput sequence aligners commonly used (link), and there is a companion paper in the current issue of Bioinformatics. The timeline is a fascinating info-graphic.
There is also a table describing the features of each. One thing I noticed is that there are relatively few aligners for RNA data, despite RNA-seq being arguably the most important application in NGS right now. (I will cover the use of RNA specific aligners in future posts.) The paper discusses some of the generalities of selecting an appropriate alignment tool for the job but doesn't go so far as to suggest which one is "best".
In our RNA-seq series so far we've performed differential analysis and generated some pretty graphs, showing thousands of differentially expressed genes after azacitidine treatment. In order to understand the biology underlying the differential gene expression profile, we need to perform pathway analysis.
We use Gene Set Enrichment Analysis (GSEA) because it can detect pathway changes more sensitively and robustly than some methods. A 2013 paper compared a bunch of gene set analyses software with microarrays and is worth a look.
Generate a rank file
The rank file is a list of detected genes and a rank metric score. At the top of the list are genes with the "strongest" up-regulation, at the bottom of the list are the genes with the "strongest" down-regulation and the genes not changing are in the middle. The metric score I like to use is the sign of the fold change multiplied by the inverse of the p-value, although there may be better methods out there (link).
HISAT is a brand new RNA-seq aligner which promises great speed with a low memory footprint. The Nature Methods paper is worth a look.
So I thought I'd give it a test run with some simulated data to check its accuracy compared to other aligners.
I generated synthetic 100bp reads based on Arabidopsis cDNAs and then incorporated mutations with msbar.
I then aligned these reads to the Arabidopsis genome with default settings and counted (featureCounts) the number of correctly and incorrectly assigned reads with a mapping quality of 20.
Here is the result for unmutated (perfect) 100 bp reads.
HISAT shows similar accuracy to TopHat2 but was not as accurate as STAR.
Now here are the results of the mutation experiment.
A table of full results is shown at the bottom of the post. These results show that STAR consistently scores the greatest number of correctly assigned reads in these tests while keeping incorrectly assigned reads down below 0.1%. Olego and TopHat2 produce the fewest incorr…
In the last post of this series, I left you with a gene expression profile of the effect of azacitidine on AML3 cells. I decided to use the DESeq output for downstream analysis. If we want to draw a heatmap at this stage, we might struggle because the output provided by the DEB applet does not send back the normalised count data for each sample.
It is not really useful to plot all 5704 genes with FDR adjusted p-values <0.05 on the heatmap, so I will simply show the top 100 by p-value. Here are the general steps I will use in my R script below:
Read the count matrix and DESeq table into R and merge into one tableSort based on p-value with most significant genes on topSelect the columns containing gene name and raw countsScale the data per rowSelect the top 100 genes by significanceGenerate the heatmap with mostly default values
Google searches show that R has some quite elaborate heatmap options, especially with features from ggplot2 and RColorBrewer. In this example, I will use buil…