Pages

Tuesday, 28 October 2014

Geneclouds: unconventional genetics data visualisation

You have probably seen word clouds before - but have you tried with gene expression data?

I used the following bash script to process a DESeq spreadsheet from a a previous RNA-seq post. The script extracts the gene name and p-value of the genes with differential expression. I used awk to separate the up and down regulated genes into different files. The score used to inform the font size is the exponent of the p-value. So this works best when there are a lot of statistically significant genes p-values. The data looks like this:

AC011899.9:60
CAMK1D:54
GNG4:44
CGNL1:41
APCDD1:37
HSD11B1:33

Now go to the "advanced" tab of the Wordle page and paste in the data. Experiment, with the colours, layouts and be sure to increase the "maximum words" to get a real appreciation for the number of changes in your experiment. Here is an example I made using both the up and down regulated gene sets showing the effect of azacytidine on AML3 cells. The result is pretty amazing.
Differential gene expression wordcloud of azacytidine treated AML3 cells treated with azacytidine. The left panel shows the up-regulated genes and the right panel shows the down-regulated genes. Font size is proportional to the exponent of the p-value.

Script used to process data

awk '$6>0 && $8<0.05 {print $1,$8}' DESeq.xls \
| awk '{printf "%4.3e\t%s\n", $3 , $2}' \
| sed 's/e-/@/' | cut -d '@' -f2- | awk '{print $2":"$1}' > ups.txt

awk '$6<0 && $8<0.05 {print $1,$8}' DESeq.xls \
| awk '{printf "%4.3e\t%s\n", $3 , $2}' \
| sed 's/e-/@/' | cut -d '@' -f2- | awk '{print $2":"$1}' > dns.txt



Saturday, 25 October 2014

RNA-seq aligner accuracy tested with simulated reads

In the previous post we looked at how choice of RNA-seq aligner influenced results. In this post, we'll use simulated reads to empirically determine the accuracy of OLego, STAR, SubJunc and SubRead.

I made a pretty basic bash script (see below) to generate fasta format reads at uniform intervals along the length of transcripts without errors. The user can readily change the read length and the interval between read initiation. The script created 11.3 million 100 bp reads in 9 minutes so it is OK in terms of speed. It is designed to take Ensembl cDNA files as an input and has been tested on human and Arabidopsis. Here is what the generated reads look like

>AT2G01210.1:gene:AT2G01210_197863
CGTCAGCTTTCGTTCTGGGGAAGAGCGGAATCGGAATTGTCTACAAAGTG
>AT2G01210.1:gene:AT2G01210_197864
GTTCTAGAGAACGGGCTCACACTGGCCGTACGGAGATTGGGTGAAGGAGG
>AT2G01210.1:gene:AT2G01210_197865
GTCTCAGAGATTCAAGGAGTTTCAGACAGAAGTTGAAGCCATAGGGAAAC
>AT2G01210.1:gene:AT2G01210_197866
TAAAACATCCTAACATTGCTAGTCTTCGAGCTTATTATTGGTCTGTCGAT

As you can see in the example above, the name of the transcript (AT2G01210.1) and gene (AT2G01210) is provided in the fasta heading to allow the identification of correctly and incorrectly mapped reads. The first step was to generate read sets with a reads length of 50, 100 and 200 bp in length, align and count the mapped reads with featureCounts using a mapQ threshold of 10.


The results show that STAR produced the highest correct mapping rates at 50 and 100 bp; while at 200bp, SubJunc showed slightly better correct mapping rates. SubRead showed the higher incorrect mapping rate potentially as it is not optimised for spliced read alignment, but SubJunc was not much better. STAR showed the lowest incorrect mapping rates across all read lengths.

The next step was to incorporate mutations into the sequences, which I did with msbar utility, which is part of the EMBOSS suite.

Fig1. 50 bp reads with single nucleotide changes (upper panels), deletions (centre panels) and insertions (lower panels). Left hand side shows correctly mapped reads and Right hand side panels show the rate of incorrect read mapping at mapQ=10.
Fig2. 100 bp reads with single nucleotide changes (upper panels), deletions (centre panels) and insertions (lower panels). Left hand side shows correctly mapped reads and Right hand side panels show the rate of incorrect read mapping at mapQ=10.
These data show that when single nucleotide changes are incorporated, STAR aligner clearly performs better in identifying the most correct alignments while limiting incorrect read alignments to low levels. STAR, SubRead and SubJunc show similar proportions of correctly mapped reads with increasing insertions, however SubRead and SubJunc showed the highest proportion of incorrectly mapped reads.

Bottom line

STAR showed the best correct mapping rate and lowest error rate with exact reads and reads containing single base mismatches. STAR also performed robustly when dealing with indel-containing reads. Thus I would recommend STAR for all 50-200 bp single end RNA-seq analyses. SubJunc and SubRead showed relatively high incorrect assignment rates, and I would suggest using more strict mapQ thresholds for these aligners if you need to use them.

RNA-seq Read simulator script

#!/bin/bash
THISSCRIPT=Simrd4.sh
CDNA=Arabidopsis_thaliana.TAIR10.23.cdna.all.fa
RDLEN=200
STEP=5
OUTFILE=simreads.fasta
EXE=exe.sh
EXPL=explode

rm -rf $EXPL 2>/dev/null
mkdir $EXPL
cd $EXPL
grep -A20 '^##' ../$THISSCRIPT | sed 1d | sed "s/HOLDPLACE/${RDLEN}/" > $EXE
chmod +x $EXE

perl ../unwrap_fasta.pl ../$CDNA - | cut -d ' ' -f1,4 | tr ' ' ':' \
| tr -d '\()/"' | paste - - | tr '\t' '!' > tmp1
split --additional-suffix=.split -l 1000 tmp1 tmp.
ls tmp*.split > tmp2
parallel -j 8 ./$EXE < tmp2 > ../$OUTFILE
cd ..
#rm -rf $EXPL
exit
############
#!/bin/bash
RDLEN=HOLDPLACE
STEP=5
for CDNA in `cat $1`
do
NAME=`echo $CDNA | cut -d '!' -f1`
for OFFSET in `seq 1 $STEP $RDLEN`
do
#echo $CDNA $OFFSET
echo $CDNA | cut -d '!' -f2 \
| cut -c${OFFSET}- \
| sed "s/\(.\{$RDLEN\}\)/\1\n${NAME}\t/g"
done
done | awk -v L=$RDLEN 'length($2)==L {print $0}' | nl -n ln \
| awk '{print $2"_"$1"\t"$3}'| tr '\t' '\n'

Mutate sequences

#!/bin/bash
for FQ in simreads_50bp simreads_100bp simreads_200bp
do
for COUNT in 1 2 4 8 16
do
pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 4 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}snp.fasta.gz &
done; wait

pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 2 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}ins.fasta.gz &

pigz -dc ${FQ}.fasta.gz \
| msbar -sequence /dev/stdin -count $COUNT \
-point 3 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
| pigz > ${FQ}_${COUNT}del.fasta.gz &
done ; wait
done
wait

Align reads to genome

See previous post

Determine corrrectly mapped reads

#!/bin/bash
GTF=/path/to/arabidopsis.gtf
featureCounts -R -Q 20 -s 0 -T 8 -a $GTF -o simreads.fcount.txt *.bam

for i in *featureCounts
do
CNT=`sed 's/./\t/' $i | grep -w Assigned | awk '$1==$4 | wc -l`
echo $i $CNT
done

Wednesday, 22 October 2014

Forty-Four and Two

If you like science, you will love Forty-Four and Two. A new blog that delves into genetics and molecular biology for the wider audience. His latest post is an impressive poster describing some data we published together earlier this year in CMLS. Keep up the good work!

Thursday, 9 October 2014

RNA-seq aligners: Subread, STAR, HPG aligner and Olego | PART 2

In the previous post, I compared the speed of several RNA aligners. In this post, we'll take a closer look at the results generated by these different aligners; Subread/Subjunc, STAR, HPG aligner and Olego. In addition, we will use BWA MEM as an example of what happens when you use a DNA aligner to analyse RNA-seq. For the test, we've been using 101 bp Arabidopsis mRNA-seq data from GEO accession GSE42968. For simplicity, I use only the forward read in the paired-end dataset. I used fastx-toolkit to remove low quality bases from the 3' end (base qual threshold of 20).

One of the simplest ways to assess the quality of an alignment is to determine the proportion of reads that are mapped to the genome and the proportion that map to exons. After the aligners did their work, I used featureCounts to quantify the number of aligned reads with a mapping quality >10. Here is the data for the first sample in the series, SRR634969 which contained 14.5 million reads.

Read mapping statistics for SRR634969 with different aligners.
The first thing to notice from the table is that the BAM files do not contain the same number of lines. HPG, OLego, SubJunc and SubRead contain all the QC passed reads which is the normal procedure. STAR and BWA MEM on the other report more entries in the bam file than starting sequences. It would appear that BWA MEM outputs multiple alignment records if two parts of the query sequence align at different parts of the genome. This is potentially bad because it effectively double counts 15% of the reads which could distort the results. STAR on the other hand seems to output only mapped reads. STAR also outputs reads that align to >1 location (4.5% of reads) but assigns a sensible low map quality score to those reads.

HPG aligner showed the highest proportion of reads with map quality scores ≥10 (98.7%), but SubJunc had the greatest proportion of assigned reads (95.3%). Thus SubJunc gave the highest amount of usable data, albeit only slightly better than STAR and HPG.

Next, I loaded the data into R, scaled and log transformed it and generated an MDS plot.
MDS plot for SRR634969 when using different aligners. Data is scaled and log transformed.

Also, I generated a pairs plot to visualise the data correlation.
Pairwise comparison of transcript representation computed by different alignment software.

Both the MDS and pairs plot shows highly similar results for HPG, OLego, STAR, SubJunc and SubRead.

Next, I want to identify how many genes are significantly differently represented by the different aligners in one sample (SRR634969). To do this, I ran DESeq in no replicate mode. Genes with differential representation FDR≤0.05 are shown in the graph below.

Differentially represented genes in one data set (SRR634969) depending on alignment software (FDR≤0.05). Genes in the upper diagonal are higher in the software listed at the top, while genes in the lower diagonal are higher in the software listed on the LHS. Differential analysis was done with DESeq.
The table shows that results from OLego and STAR are quite similar. SubJunc and SubRead are also similar to each other. There were some differences between SubJunc/SubRead and OLego/STAR. Here are the top 20 lines of the comparison of STAR and SubJunc.

Differentially represented genes in SRR634969 dataset aligned with STAR or SubJunc. Positive values for fold change suggests higher representation in SubJunc data. Differential analysis was done with DESeq.
I was quite surprised to see such large changes between the two aligners. SubJunc seems to be picking up expression of genes that are not detected at all by STAR. The gene ATCG01180, that encodes a chloroplast 23S rRNA, seems to be picked up best by SubJunc/SubRead followed by HPG and to a lesser extent by OLego/STAR/BWA MEM. This suggests that SubJunc is better able to distinguish between two highly similar potential origins of RNA-seq reads.

So we find that there is a difference in the gene expression values depending on the choice of aligners (for a few genes at least). But does the choice of aligner impact the differentially expressed genes (DEGs) considering all samples in the experiment? I did just that, using default DESeq settings (manual section 2 and 3.1). Here is the number of DEGs for the different aligners:

BWA MEM 381
HPG 362
OLego 351
STAR 358
SubJunc 363
SubRead 365

As you can see, BWA MEM has the most, but clearly the double-counting issue likely means that there are a number of false positives there. Generally, the number of DEGs tracks the number of assigned reads pretty well; with HPG, STAR, SubJunc and SubRead showing more DEGs than OLego. I analysed the overlap between these differentially expressed gene sets using Venn SuperSelector at Bio-Analytic Resource.
Overlap of DGE gene sets derived from alignments by different software. Overlap analysis performed by Bio-Analytic Resource Venn SuperSelector.
The result shows that all the aligners suited to RNA-seq data agree, showing an overlap ratio of >95%. BWA MEM, despite not designed for RNA-seq, was still able to identify 94.8% of DGEs. So while SubJunc/SubRead gives slightly better mapping, this has only a minor effect on the number of DEGs identified.

Alignment script

$ cat RNA-seq-aligners.sh
#!/bin/bash
for FQZ in `ls *_1.fastq.gz `
do
FQ=`echo $FQZ | sed 's/fastq.gz/trim.fastq/'`
pigz -dc $FQZ | fastq_quality_trimmer -t 20 -l 25 -Q33 > $FQ &
done
wait

for FQ in *trim.fastq
do
STAR --readFilesIn $FQ --genomeLoad LoadAndKeep \
--genomeDir star/ \
--runThreadN 8 \
--sjdbGTFfile Arabidopsis_thaliana.TAIR10.23.gtf
wait
mv Aligned.out.sam ${FQ}.STAR.sam
done

for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam

for FQ in *trim.fastq
do
hpg-aligner rna -f $FQ \
-i hpgaligner-arabidopsis/ \
--report-n-hits=1  --report-n-best=1 \

-o ./${FQ}_hpg/ --cpu-threads=8 --log-level=10
done


for BAM in *_hpgalignments.bam
do
samtools sort $BAM ${BAM}.sort &
done
wait
rm *_hpgalignments.bam

for FQ in *trim.fastq
do
subread-align -T 8 \
-i Arabidopsis_thaliana.TAIR10.23.dna_sm.genome \
-r $FQ -o ${FQ}.subread.sam
done

for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam

for FQ in *trim.fastq
do
subjunc -T 8 \
-i Arabidopsis_thaliana.TAIR10.23.dna_sm.genome \
-r $FQ -o ${FQ}.subjunc.sam
done

for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam

for FQ in *trim.fastq
do
olego -t 8 \
Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa \
$FQ > ${FQ}.olego.sam
done

for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam

for FQ in *trim.fastq
do
bwa mem -t 8 Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa \
$FQ > ${FQ}.bwamem.sam
done

for SAM in *sam
do
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort &
done
wait
rm *sam
rm *fastq

for BAM in *bam ; do samtools index $BAM ; done
for BAM in *bam ; do samtools flagstat $BAM > ${BAM}.stats ; done

featureCounts -Q 10 -s 0 -T 8 -a Arabidopsis_thaliana.TAIR10.23.gtf \

-o ALL_cnt.xls *bam

Pairs plot

#install.packages("PerformanceAnalytics")
library(PerformanceAnalytics)
x<-read.table("SRR634969_cnt", header=T, row.names=1)
png(filename="pairs_SRR634969.png")
chart.Correlation(x,histogram = TRUE, method = c("pearson"))
dev.off()


MDS plot

x<-log2(scale(read.table("CountMatrix.xls", row.names=1, header=TRUE)))
pdf("MDSplot.pdf") 
plot(cmdscale(dist(t(x))), xlab="Coordinate 1", ylab="Coordinate 2", type = "n") ; text(cmdscale(dist(t(x))), labels=colnames(x), ) 
dev.off()

DESeq analysis of aligners

Analysis of SRR634969 in no replicate mode:
# Run comparisons of aligners
#source("http://bioconductor.org/biocLite.R")
#biocLite("DESeq")
library( "DESeq" )
cnt<-read.table( "SRR634969_cnts_fmt_filt.xls",header=TRUE, row.names=1)
design <- data.frame(row.names = colnames(cnt),condition = c("BM","HP","OL","ST","SJ","SR"))
countTable <- cnt
conds <- design$condition
cds <- newCountDataSet( countTable, conds )
cds <- estimateSizeFactors( cds )
cds <-estimateDispersions(cds,method="blind",sharingMode="fit-only",fitType="local")

res <- nbinomTest( cds, "BM", "HP")
write.table(res, "BMvHP.xls" , quote = FALSE , sep = "\t", row.names = FALSE)
#repeat to compare the other aligners

Analysis of all 6 samples in the experiment using default DESeq.

cnt<-read.table( "bwamem_cnt.xls_filt.xls",header=TRUE, row.names=1)
design <- data.frame(row.names = colnames(cnt),condition = c("c","c","c","t","t","t"))
countTable <- cnt
conds <- design$condition
cds <- newCountDataSet( countTable, conds )
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds )
res<-nbinomTest(cds,"c","t")
write.table(res, "bwamem_deseq.xls" , quote = FALSE , sep = "\t", row.names = FALSE)

#repeat to compare the other aligners

Wednesday, 1 October 2014

RNA-seq aligners: Subread, STAR, HPG aligner and Olego

In this post I compare the speed of Subread/Subjunc, STAR, HPG aligner and Olego. I will use real RNA-seq data from GEO accession GSE42968 and align to the Arabidopsis thaliana genome. All code used is found at the bottom of this post. The benchmarking was performed on a standard 8 core workstation with 8 GB RAM.

Index the Arabidopsis thaliana genome

The first step in any RNA-seq analysis is to prepare the genome for alignment. Subread was the quickest.
Time required to index the Arabidopsis genome (119.67 Mbp).

Align RNA-seq reads

I down sampled the sequence (SRA accesion:SRR634969) file to generate files containing 1M, 2M, 5M, 10M and 14M reads selected randomly. Fastq quality trimmer was used to trim low quality bases from the 3' end and filter low quality reads. Then I used Olego, Subread, Subjunc, STAR and HPG aligner to align the reads. (I wanted to test HPG aligner in both the BWT and SA modes but didn't have the 17 GB RAM to test SA algorithm sadly.) In all the below cases the sequence files were uncompressed fastq (101 bp reads) and output was uncompressed/unsorted SAM format. I monitored the execution time, peak memory and CPU utilisation using a benchmarking script. The below figures are averages of 3 replicates.

As anticipated,  STAR was fast but also very RAM hungry. Olego was the slowest, but required the least RAM. Memory usage of HPG aligner increased with the number of reads, suggesting that it loads in large chunks of reads into RAM before alignment.

Execution time for RNA-seq alignment for fastq files of varying sizes to the Arabidopsis genome. Read length is 101 bp. 

Peak memory usage for RNA-seq alignment to the Arabidopsis genome with fastq files of varying sizes.

CPU usage for RNA-seq alignment to the Arabidopsis genome with fastq files of varying sizes.

Performance summary

From the above execution time figures, I could calculate the time taken to load the index into memory as well as the time taken to process 1M reads. These data show that STAR takes the longest to read the index into memory, but STAR has a much faster performance once the index is loaded.


Bottom line

STAR was the stand-out performer in terms of speed, at least 3 times faster than its nearest competitor. Given that the user can specify that the index remain in memory after an alignment job means that the modest time taken to load the index into memory can be further reduced. If limited RAM prevent the use of STAR, then Subread and HPG aligners (bwt mode) offer marginal gains in terms of RNA-seq alignment speed compared to Olego. The difference between Subread and Subjunc speed was relatively small. We will assess HPG aligner in the SA mode in another post.

Code used

Download the RNA-seq data

$ cat url.txt
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634969/SRR634969.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634970/SRR634970.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634971/SRR634971.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634972/SRR634972.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634973/SRR634973.sra
ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR634/SRR634974/SRR634974.sra



#!/bin/bash
URLFILE=url.txt
NUM=8
for URL in `cat $URLFILE`
do axel --num-connections=$NUM $URL
done


Unpack SRA archives

#!/bin/bash
for SRA in SRR634969.sra
do
fastq-dump --gzip --split-3 $SRA
done

Check quality scores 

pigz -dc SRR634969_1.fastq.gz \
| fastx_quality_stats > SRR634969_1_quals.txt

Perform quality trimming and subsample

pigz -dc ../SRR634969_1.fastq.gz \
| fastq_quality_trimmer -t 20 -l 25 -Q33 \
| paste - - - - | shuf \
| pigz > SRR634969_1.tab.gz

for AMT in 1 2 5 10 14
do
NAME=`echo SRR634969_1.tab.gz | sed "s/tab.gz/${AMT}.fastq.gz/"`
pigz -dc SRR634969_1.tab.gz \
| head -${AMT}000000 \
| tr '\t' '\n' > $NAME
done

Download the Arabidopsis thaliana genome and annotation:

wget ftp://ftp.ensemblgenomes.org/pub/release-23/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa.gz

wget ftp://ftp.ensemblgenomes.org/pub/release-23/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.23.gtf.gz

Index the Arabidopsis thaliana genome:

Olego:
olegoindex -a bwtsw Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa

Subread:
subread-buildindex -o Arabidopsis_thaliana.TAIR10.23.dna_sm.genome Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa

STAR:
STAR --runMode genomeGenerate \
--genomeDir star/ \
--genomeFastaFiles star/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa \
--runThreadN 8

HPG aligner*:
hpg-aligner build-bwt-index \
-g Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.fa \
-i hpgaligner-arabidopsis/ -r 8

Run alignments

Olego:
olego -t 8 /olegoindex/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa SRR634969_1.1.fastq > SRR634969_1.1.fastq.olego.sam

Subread:
subread-align -T 8 -i subread/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome -r SRR634969_1.1.fastq -o SRR634969_1.1.fastq.subread.sam

Subjunc:
subjunc -T 8 -i subread/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome -r SRR634969_1.1.fastq -o SRR634969_1.1.fastq.subread.sam

STAR:
STAR --genomeDir star/ --readFilesIn SRR634969_1.1.fastq --runThreadN 8 --sjdbGTFfile Arabidopsis_thaliana.TAIR10.23.gtf

HPG-aligner:
hpg-aligner rna -f SRR634969_1.1.fastq -i hpgaligner-arabidopsis/ 
--report-n-best 1 -o ./SRR634969_1.1.hpg/ --cpu-threads=8

*Notes

HPG aligner needs dev versions of curl and gsl
sudo apt-get install libcurl3 libcurl3-gnutls libcurl4-openssl-dev
sudo apt-get install libgsl0-dev
HPG does not appear to tolerate wobble bases, this took a long time to figure out. use linux "tr" or other tool to replace wobble bases (ie: DKMRSWY) with N's.

Versions

subread-align and subjunc version 1.4.5-p1
OLego: version 1.1.5
STAR_VERSION "STAR_2.4.0d"
HPG-aligner v2.0 (2014)