Monday, 19 January 2015

SRA toolkit tips and workarounds

The Short Read Archive (SRA) is the main repository for next generation sequencing (NGS) raw data. Considering the sheer rate at which NGS is generated (and accelerating), the team at NCBI should be congratulated for providing this service to the scientific community. Take a look at the growth of SRA:
Growth of SRA data (

SRA however doesn't provide directly the fastq files that we commonly work with, they prefer the .sra archive that require specialised software (sra-toolkit) to extract. Sra-toolkit has been described as buggy and painful; and I've had my frustrations with it. In this post, I'll share some of my best tips sra-toolkit tips that I've found.

Get the right version of the software and configure it

When downloading, make sure you download the newest version from the NCBI website (link). Don't download it from GitHub or from Ubuntu software centre (or apt-get), as it will probably be an older version. In the binary directory (looks like /path/to/sratoolkit.2.4.3-ubuntu64/bin) there will be a file called sratoolkit.jar. In linux use "java -jar sratoolkit" to open the graphical interface. in the preferences menu, enable the local repository and select a path for it. By doing this, you can then use sra-toolkit to "stream" fastq data (see below).

EDIT: if you are seeing an error like this one:

/data/app/sratoolkit.2.4.3-ubuntu64/bin/fastq-dump --split-files -A ERR366438

2015-02-15T21:47:01 fastq-dump.2.4.3 err: binary large object corrupt while reading binary large object within virtual database module - failed ERR366438

An error occurred during processing.
A report was generated into the file '/data/home/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to '' for assistance.

Then grab the new sra-toolkit version 2.4.4 which seems to fix problems with SRA archives using reference based compression (when submitters provide data in aligned bam format).

Try streaming the data

You can convert sra to fastq on-the-fly by doing either of the following:

fastq-dump -A SRR1722641 -O SRR1722641.fastq

fastq-dump -A SRR900186 -Z > SRR900186.fastq

Streaming paired-end data could be problematic. Use the following to save forward and reverse reads as separate files. Thanks to the folks at Biostars for this idea.

SRR=SRR1041311 ; fastq-dump -X 10 --split-files -I -Z $SRR \
| tee >(grep '@.*\.1\s' -A3 --no-group-separator \
> ${SRR}_1.fastq) >(grep '@.*\.2\s' -A3 --no-group-separator \
> ${SRR}_2.fastq) >/dev/null

Use download accelerator

The SRA team actually recommend using Aspera connect to speed up the download of SRA files. If the stream isn't working for you, give Aspera a try using this script. If you struggle to get Aspera configured, you can try a download accelerator such as axel or aria2c. Here's an example with axel.

axel -n5

After downloading the SRA archive, dump the fastq:

fastq-dump -A SRR900186.sra -Z --split-files

Via the browser

Here are two useful approaches suggested by SeqAnswers. You can download each fastq.gz file individually from your web-browser (not command line interface) replacing the digits after SSR in this link:

or batch download with a link like:,SRR352727,SRR364895&format=fastq

Alternatively find your study accession number (ie. SRP013698) and go to the SRA run selector:

Search with your SRP number, then click on the "Run" link. Click on the "Reads" tab, then click "Filtered Download", change the format to "FASTQ" and hit "Download".

SRA mirrors

Most of the data on SRA is mirrorred at ENA or DNAnexus.

You can download the compressed fastq files from ENA for forward and reverse reads

You can download the SRA archive from DNAnexus too.

Stream directly into your analysis pipeline

You can send the data straight through your QC and alignment pipeline without saving intermediate files. Here is an example using SRA toolkit for Olego alignment:

fastq-dump -A SRR764858 -Z \
| fastq_quality_trimmer -l 25 -t 20 -Q33 \
| olego -t 8  Athaliana.TAIR10.23.dna_sm.genome.fa - \
| samtools view -uSh - \
| samtools sort - SRR764858_sra.sort

And another using curl from EBI:

curl \
| pigz -d | fastq_quality_trimmer -t 20 -l 25 -Q33 \
| olego -t 8 Athaliana.TAIR10.23.dna_sm.genome.fa - \
| samtools view -uSh - \
| samtools sort - SRR764858_ebi.sort

Dump color-space sequence

Occasionally you'll come across data in color-space format. After downloading the SRA archive do the following.

abi-dump -A SRR1657115.sra

That will dump the sequence in fasta format (SRR1657115.sra.csfasta) and the quality string (SRR1657115.sra.qual) in separate files. Then I use to do quality trimming. Here's an example: -c SRR1657115.sra.csfast -q SRR1657115.sra.qual --moving-average 7:12 --min-read-length 25 > SRR1657115.fasta
Happy data mining!

Thursday, 15 January 2015

Generate an RNA-seq count matrix with featureCounts

Featurecounts is the fastest read summarization tool currently out there and has some great features which make it superior to HTSeq or Bedtools multicov.

FeatureCounts takes GTF files as an annotation. This can be downloaded from the Ensembl FTP site. Make sure that the GTF version matches the genome that you aligned to. FeatureCounts it also smart enough to recognise and correctly process SAM and BAM alignment files.

Here is a script to generate a gene-wise matrix from all BAM files in a directory.
#Generate RNA-seq matrix
#Set parameters

#Make the gene-wise matrix
featureCounts -Q $MAPQ -T $CPUS -a $GTF -o /dev/stdout *bam \
| cut -f1,7- | sed 1d > $GENEMX

The data are now ready to analyse with your favourite statistical package (DESeq, EdgeR, Voom/Limma, etc).

Consider attaching the gene name to give the data more relevance. To do that, first make a table of Ensembl IDs and gene names from the GTF file.
grep -w gene Mus_musculus.GRCm38.78.gtf | head -1000 \
| cut -d '"' -f2,6 | tr '"' '\t' | sort -k 1b,1 > ENS2genename.txt

Next, use unix join (or other method) to attach the gene name.
head -1 > header.txt
sed 1d | sort -k 1b,1 \
| join -1 1 -2 1 ENS2genename.txt - \
| tr ' ' '\t' | sed 's/\t/_/' \
| cat header.txt - >

Data matrix should now show the gene name.
Geneid c1_RNA c2_RNA c3_RNA
ENSMUSG00000000001_Gnai3 674 517 115
ENSMUSG00000000003_Pbsn 0 0 0
ENSMUSG00000000028_Cdc45 52 49 58
ENSMUSG00000000031_H19 175 514 772
ENSMUSG00000000037_Scml2 44 19 7
ENSMUSG00000000049_Apoh 0 8 1
ENSMUSG00000000056_Narf 1570 1396 1027
ENSMUSG00000000058_Cav2 2013 1751 254
ENSMUSG00000000078_Klf6 3822 2470 883

If you're interested in exon specific expression and alternative splicing, include the "-f" switch that will output counts for each exon feature.
#Make the exon-wise matrix
featureCounts -Q $MAPQ -T $CPUS -f -a $GTF -o /dev/stdout/ *bam \
| cut -f-4,7- | sed 1d > $EXONMX

Tuesday, 6 January 2015

Comparing expression profiles

One of the most common tasks in gene expression analysis is to compare different profiling experiments. There are three main strategies:
  1. Compare all data points - using a correlation analysis
  2. Compare sets of up and down-regulated genes - using a binomial or Fisher exact test
  3. Compare sets of genes within a profile - such as GSEA test
In this post, I'll describe how correlation analysis is used between expression data sets of all detected genes.

Merging data sets

No matter what type of correlation used, the profiling data sets need to be merged. This means selecting a field that can the datasets can be merged on. This could be a array probe ID, gene accession number or gene symbol as in this case. I will compare gene expression profiles from two experiments (azacitidine in human and mouse cells). The human gene profile was generated by RNA-seq and the mouse data set by microarray.

The human data is currently in CSV format from Degust  and looks like this:


The mouse gene symbols were converted to human as per a previous post and look like this:

MusGeneSymbol MouseAccession HumanAccession HumanGeneSymbol FoldChange P-value adjP-value
Krt8 ENSMUSG00000049382 ENSG00000170421 KRT8 6.9755573 2.24e-08 0.000346
Rpl39l ENSMUSG00000039209 ENSG00000163923 RPL39L 3.8096523 2.32e-08 0.000346
Sprr1a ENSMUSG00000050359 ENSG00000169469 SPRR1B -3.441516 6.46e-08 0.000378
Sprr1a ENSMUSG00000050359 ENSG00000169474 SPRR1A -3.441516 6.46e-08 0.000378

So here's a way to perform the join in the UNIX shell environment:

#first sort the mouse spreadsheet on the field to be 
#joined (human gene symbol)
sort -k 4b,4 mouse.xls > mouse_s.xls

#manipulate the human expression data
#remove header, convert to tabular file then sort
#on gene symbol and join with the mouse file
sed 1d human.csv | tr ',' '\t' | sort -k 1b,1 \
| join -1 1 -2 4 - mouse_s.xls > human+mouse.xls

Now have a look at the number of lines in the files, there will be fewer genes that were detected in both experiments, in this case, just 10914 genes for comparison.

  15586 human.csv
  16421 mouse.xls
  10914 human+mouse.xls

Pearson correlation based on fold change

This is perhaps the most simple analysis of two datasets could be done based on similarities of the expression fold changes.

#Open an R session
#load in the data
#plot the mouse fold change (col 14) against 
#the human fold change (col3)
plot(x$V14,x$V3, xlab="mouse", ylab="human")

Comparison of fold changes in mouse and human cells exposed to azacitidine.
If we want to plot the trend line and the Pearson correlation then we can follow the tutorial here.

Comparison of fold changes in mouse and human cells exposed to azacitidine now including trend line equation and correlation coefficient.

So it seems that there is a slight negative correlation with this method. We also see that a relatively few extreme values seem to be influencing the correlation.

Spearman correlation based on fold change

The Spearman correlation could be better because it is less susceptible to outliers. We rank the genes from highest fold change to lowest with the R rank command. Once the genes are ranked, use the Pearson procedure above to plot the graph.

#make a new dataframe containing the ranks
y<-data.frame(x$V1, rank(x$V14, na.last = NA , ties.method = c("average")) , rank(x$V3, na.last = NA , ties.method = c("average")))
plot(y$mouse,y$human,cex=.5,xlab="mouse rank", ylab="human rank")
spearmans_r = cor(y$mouse,y$human)
r_squared = spearmans_r^2
fit = lm(y$human~y$mouse)
y_intercept = as.numeric(fit$coeff[1])
slope = as.numeric(fit$coeff[2])
function_string = paste("f(x) = ", slope, "x + ", y_intercept,  sep="")
r_sq_string = paste("R^2 =", r_squared)
display_string = paste(function_string, r_sq_string, sep="\n")
mtext(display_string, side=3, adj=1)

Spearman correlation of fold changes between two expression profiling experiments.
From the graph, we see that apart from a few genes clustered in the corners, overall there is only a very slight negative correlation.

Pearson correlation based on p-value derived differential expression metric 

The fold change might not be the best metric to perform these comparisons because many of the genes with variable fold changes could have high p-values, and hence not be that reliable. The differential expression metric (DEM) I use is the signed log10-transformed p-value with the sign denoting the direction of change, positive for increasing and negative for decreasing (reference). I use the raw (not FDR) p-value because it has a smoother dynamic range. I presented the bash script for generating the DEM in a recent post. I used that script to generate DEM scores for each experiment then joined them as above.

Pearson correlation of differential expression metric (DEM) scored derived from p-values.
The result again shows only a very slight negative correlation.

Spearman correlation based on p-value derived differential expression metric

I ranked the genes by DEM score and plotted it.

Spearman correlation of differential expression metric (DEM) scored derived from p-values.
This result shows a vanishingly small negative correlation.


I used four methods to analyse similarity between the expression profiles. Unexpectedly, the four analyses all showed that azacitidine treatment produced quite different results in these two contexts. That could be due to factors such as the difference in cell type used (human AML3 vs mouse C2C12), the profiling platform (human by RNA-seq and mouse by array) or other factor.

Sunday, 4 January 2015

How to generate a rank file from gene expression data

Turning a gene expression profile into a ranked list is useful for comparing with other profiling data sets as well as an input for preranked GSEA analysis (example here). In this post, I describe a simple bash script called that can take gene expression data from a range of sources, such as edgeR, DESeq, GEO2R, etc., and generate a ranked list of genes from most up-expressed to most down-expressed based on the p-value.

#!/bin/bash converts a differential gene expression spreadsheet (XLS) into a rank file (RNK)

#Specify the input file
#Specify the gene ID column
#Specify the fold change value column
#Specify the raw p-value column

sed 1d $XLS | tr -d '"' \
| awk -v I=$ID -v F=$FC -v P=$P '{FS="\t"} $I!="" {print $I, $F, $P}' \
| awk '$2!="NA" && $3!="NA"' \
| awk '{s=1} $2<0{s=-1} {print $1"\t"s*-1*log($3)/log(10)}' \

| awk '{print $1,$2,$2*$2}' | sort -k3gr \
| awk '{OFS="\t"} !arr[$1]++ {print $1,$2}' \
| sort -k2gr > ${XLS}.rnk

After you save the script to a file, allow it to be executed

chmod +x

The usage of the script is as follows:
./ /path/to/spreadsheet.xls GeneID_column FoldChange_column P-value_column

So for a spreadsheet which looks like this one generated by Degust previously:
gene c aza FDR C1 C2 C3 A1 A2 A3
LYZ 0 -1.65809730597384 4.70051409587021e-16 17393 14675 16300 4951 5064 4841
IFI27 0 -3.40698193025161 1.01987358298544e-15 5695 5552 4630 475 532 442
SLC12A3 0 -3.46002052417193 6.78905193176251e-14 1129 1051 991 79 109 92
HLA-B 0 -1.41354726616356 6.78905193176251e-14 6678 5597 5567 2167 2281 2042
IFI6 0 -2.76026110717513 7.38087480096726e-14 5884 5044 4171 764 764 624
OAS3 0 -2.64899272132832 1.17759867975735e-13 5196 5812 4621 830 824 753
HLA-C 0 -1.39473306959746 1.18784239161695e-13 3715 3324 3382 1305 1350 1193
IGFBP5 0 -5.1043039727239 1.61487772234777e-13 1706 1282 1643 39 39 52
MX1 0 -2.84129882209861 1.61487772234777e-13 2161 2245 1909 277 339 242

So in this case we run it as follows:
./ degust.xls 1 3 4 

And have a look at the top and bottom 5 lines of output:
CAMK1D 9.80546
AC011899.9 9.76077
CGNL1 9.56363
GNG4 9.42442
MKX 8.93011

IFI6 -13.1319
HLA-B -13.1682
SLC12A3 -13.1682
IFI27 -14.9915
LYZ -15.3279

We see that the result is just what we're after.

EDIT: Mac users take note that you should try the R solution.