Pages

Wednesday, 19 November 2014

microRNA aligners compared

Alignment of microRNA to the genome poses a particular challenge because the reads are short, and some microRNAs are nearly identical. Moreover, microRNAs themselves are subject to RNA editing (adenine-to-inosine conversion, non-templated base addition) and normal sequencing error rates. In this post, I'm going to test the performance of several aligners in aligning microRNA reads to the Arabidopsis genome. 

I downloaded the Arabidopsis genome from Ensembl plant and the latest miRbase release version 21. I used bowtie2 to align the 325 full-length hairpin transcripts to the Arabidopsis genome. I generated pseudo microRNA reads that uniformly cover the hairpin transcript at a range of lengths from 16 nt to 25 nt. I then aligned the reads to the Arabidopsis genome using these different aligners with the default settings. I then used bedtools and awk to count the correctly and incorrectly mapped reads at a mapQ threshold of 10. 

Table 1. Performance of several aligners in pseudo microRNA read alignment to the Arabidopsis genome.
Firstly note that there were no alignment with either Soap2 or BWA mem and so these aligners are not suitable for microRNA with the default settings. SubRead only started generating alignments at 24 nt in length so again, not suitable for microRNA with the default parameters. I graphed the performance of Bowtie1, Bowtie2, BWA aln, OLego and STAR.

Figure 1. Performance of several aligners in pseudo microRNA read alignment to the Arabidopsis genome.
Figure 1 shows that Bowtie1 has the highest correct mapping rate, but this comes at a cost of increased incorrect mapping. Bowtie2, STAR, BWA aln and OLego showed fairly similar results with 21mers aligned correctly ~81% and incorrectly ~2.2% of the time.

To simulate the challenges with real-world situations where the reads may have sequencing errors, RNA editing and bona fide SNVs, I mutated the reads using EMBOSS msbar. I incorportated up to 3 SNPs, and up to 2 insertions or deletions.

Figure 2.  Mutated pseudo microRNA read alignment to the Arabidopsis genome.
Next, I incorporated up to 2 random nucleotides to the 3 prime or 5 prime ends and performed the same test.
Figure 3.  End-modified pseudo microRNA read alignment to the Arabidopsis genome.
Then I summarised the results by calculating the average % correct/incorrect mapping rates for the different tests in the range of read lengths (16 to 25 nt).
Table 2. Aligner performance for mutated pseudo microRNA reads to the Arabidopsis genome. Read length analysed was 16 to 25 nt.
These test show that Bowtie1 is quite inaccurate, even for exact-matching reads and so I would caution against using it for microRNA. Of the remaining four aligners in this test, Bowtie2 showed the lowest incorrect mapping rates at the cost of lower correct mapping of mutated reads. BWA aln, OLego and STAR had very similar performance.

My follow up to this work was recently published in RNA Journal!

Tuesday, 4 November 2014

DNA aligner accuracy: BWA, Bowtie, Soap and SubRead tested with simulated reads

In the past few posts, we've looked at RNA-seq aligner performance in terms of accuracy and speed. In this post, I'll take a look at the accuracy of DNA aligners using simulated reads.

The first step is to download the genome of interest. I'm using Arabidopsis as its pretty small and good for quick benchmarking. I downloaded the genome from Ensembl plant ftp site (link).

Next step was to generate simulated reads that uniformly cover the genome at user-selected length and intervals. I couldn't find any previously made simulators to do exactly that, so I chained together some awk and bedtools commands (code at the bottom of the post) to generate the reads. I generated pseudo reads of 50, 100 & 200 bp in length at 10 bp intervals and output them in fasta format. Here is an example.

>1-0-50
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAA
>1-10-60
TAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT
>1-20-70
ACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCC

The sequence name is chromosome-start-end. Next I mutated the reads with msbar, a neat little EMBOSS tool which incorporates mutations (SNP/insertions/deletions) at a set rate into sequences. Then I aligned these reads back to the genome. Here are the aligners and versions I used. I used all default values.

BWA aln 0.7.5a-r405
BWA mem 0.7.5a-r405
Bowtie2 2.1.0
Soap 2.21
Subread 1.4.5-p1
STAR 2.4.0d

I included STAR here out of curiosity because it was the most accurate RNA-seq aligner in our last comparison. To determine the number of correctly and incorrectly assigned reads, I used samtools and awk to check the sequence header matched the mapping location. I set a mapQ threshold of 10 which is one that is commonly used. First I looked at how these aligners fared with unmutated, perfect reads.
Table 1: Aligning perfect reads to the Arabidopsis genome.
This result shows that all of the aligners had very low incorrect assignment rates, except for SubRead. BWA mem had the lowest correct assignment rate at all read lengths. Soap and Bowtie2 had the highest correct assignment rate at 50 bp; BWA aln and Soap had the highest correct assignment rate at 100 bp; and SubRead had the highest correct assignment rate at 200 bp. Even though STAR is a splice aware RNA-seq aligner, it performed pretty well on this dataset.

Then I analysed the mutated reads. I wasn't able to run all SubRead analyses as it was just taking too long.

Figure 1: Alignment of mutated 50 bp reads. Left-side panels show rate of correctly mapped reads, and right-side shows incorrectly mapped reads. Upper panels show single nucleotide polymorphisms(base changes), middle panels show 1-base insertions and lower panels show 1-base deletions.
At 50 bp length, STAR was most tolerant to SNPs, followed by Soap, BWA aln and BWA mem. Bowtie2 and SubRead were least SNP-tolerant. BWA mem was most tolerant to indels, followed by STAR, BWA aln and Bowtie2; while Soap was the least indel-tolerant. BWA mem showed the lowest incorrect mapping rates as a proportion of correctly mapped reads.

Figure 2: Alignment of mutated 100 bp reads. Left-side panels show rate of correctly mapped reads, and right-side shows incorrectly mapped reads. Upper panels show single nucleotide polymorphisms(base changes), middle panels show 1-base insertions and lower panels show 1-base deletions.
At 100 bp, STAR and BWA mem were the most SNP tolerant, followed by SubRead, Bowtie2, BWA aln and lastly Soap. BWA mem also showed the best tolerance to indels, followed by STAR, Bowtie2, SubRead, BWA aln and lastly Soap.

Figure 3: Alignment of mutated 200 bp reads. Left-side panels show rate of correctly mapped reads, and right-side shows incorrectly mapped reads. Upper panels show single nucleotide polymorphisms(base changes), middle panels show 1-base insertions and lower panels show 1-base deletions.
At 200 bp just like 100 bp, BWA mem and STAR were the most SNP tolerant, followed by SubRead, Bowtie2, BWA aln and lastly Soap. BWA mem also showed the best tolerance to indels, followed by STAR, Bowtie2, BWA aln and lastly Soap. STAR performed well in terms of correctly assigned reads, but showed a comparatively high false assignment rate.

Limitations of this analysis

I used Arabidopsis, so this might not translate 100% over to mammalian-sized genomes. Also I used indels of only 1 bp so this doesn't really resolve what happens with larger indel sizes. Real sequence data may contain masked bases, as well as mixtures of indels and SNPs which weren't considered here.

Bottom line

For aligning 50 bp reads to the genome BWA aln, Bowtie2 and BWA mem would be recommended. At 100 and 200 bp BWA mem was clearly the best. Soap is not recommended due to indel intolerance. SubRead showed a comparatively high false assignment rate, even with perfect reads making it a poor choice.

Here is the full table:
Table 2: Correct and incorrect read assignment of mutated reads with different aligners at 50, 100 and 200 bp lengths to the Arabidopsis genome. * Denotes abandoned test due to slow running times.

Simulate DNA sequence reads

#!/bin/bash
CHR=Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.g
GENOMEFA=Arabidopsis_thaliana.TAIR10.23.dna.genome_fmt.fa

READLENGTHRANGE='50 100 200'
STEP=10
PFX=simDNA_

for RDLEN in $READLENGTHRANGE
do
echo $RDLEN
bedtools makewindows -w $RDLEN -s $STEP -g $CHR \
| awk -v L=$RDLEN '($3-$2)==L' \
| bedtools getfasta -fi $GENOMEFA -bed /dev/stdin -fo /dev/stdout \
| tr ':' '-' | pigz > ${PFX}${RDLEN}bp.fasta.gz  &
done
wait

Mutate reads

#!/bin/bash
for FQZ in simDNA_50bp.fasta.gz simDNA_100bp.fasta.gz simDNA_200bp.fasta.gz
do
 for COUNT in 1 2 4 8 16
 do

 BASE=`echo $FQZ | sed 's/.fasta.gz//'`

 pigz -dc ${FQZ} \
 | msbar -sequence /dev/stdin -count $COUNT \
 -point 4 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
 | sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
 | sed 's/ /\n/' | pigz > ${BASE}_${COUNT}snp.fasta.gz &

 pigz -dc ${FQZ} \
 | msbar -sequence /dev/stdin -count $COUNT \
 -point 2 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
 | sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
 | sed 's/ /\n/' | pigz > ${BASE}_${COUNT}ins.fasta.gz &

 pigz -dc ${FQZ} \
 | msbar -sequence /dev/stdin -point 3 -count $COUNT \
 -block 0 -codon 0 -outseq /dev/stdout 2>/dev/null \
 | sed -n '/^>/!{H;$!b};s/$/ /;x;1b;s/\n//g;p' \
 | sed 's/ /\n/' | pigz > ${BASE}_${COUNT}del.fasta.gz &

 wait
 done
done


Alignment

==> soap.sh <==
#!/bin/bash
REF=/path/to/soap/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
PWD=`pwd`
HEADER=header.txt
for FQZ in *fasta.gz
do
pigz -dc $FQZ \
| perl /usr/local/bin/fasta_to_fastq.pl - - \
| soap -p 8 -a /dev/stdin -D $REF -o /dev/stdout \
| soap2sam.pl -p - \
| cat $HEADER - \
| samtools view -uSh - \
| samtools sort - ${FQZ}_soap.sort
done
for BAM in *soap.sort.bam ; do samtools index $BAM & done
for BAM in *soap.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

==> bwamem.sh <==
#!/bin/bash
REF=/path/to/bwa/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ | tee $FQ \
| bwa mem -t 8 $REF - \
| samtools view -uSh - \
| samtools sort - ${FQ}_bwamem.sort
done
for BAM in *_bwamem.sort.bam ; do samtools index $BAM & done
for BAM in *_bwamem.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

==> bwaaln.sh <==
#!/bin/bash
REF=/path/to/bwa/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ | tee $FQ \
| bwa aln -t 8 $REF - | bwa samse $REF - $FQ \
| samtools view -uSh - \
| samtools sort - ${FQ}_bwaaln.sort
done
for BAM in *_bwaaln.sort.bam ; do samtools index $BAM & done
for BAM in *_bwaaln.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

==> bowtie2.sh <==
#!/bin/bash
REF=/path/to/bowtie2/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
pigz -dc $FQZ \
| bowtie2 -p8 -f -x $REF -U /dev/stdin -S ${FQ}.sam
( samtools view -uSh ${FQ}.sam \
| samtools sort - ${FQ}_bowtie2.sort
rm ${FQ}.sam ) &
done
wait
for BAM in *_bowtie2.sort.bam ; do samtools index $BAM & done
for BAM in *_bowtie2.sort.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

==> subread.sh <==
#!/bin/bash
REF=/path/to/subread/index/Arabidopsis_thaliana.TAIR10.23.dna_sm.genome.fa
for FQZ in *fasta.gz
do
FQ=`basename $FQZ .gz`
subread-align -T 8 --gzFASTQinput --BAMoutput -i $SUBREADREF \
-r $FQZ -o ${FQ}.subread_unsorted.bam
samtools sort ${FQ}.subread_unsorted.bam ${FQ}.subread &
done
wait
rm *unsorted.bam
for BAM in *subread.bam ; do samtools index $BAM & done
for BAM in *subread.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

==> star.sh <==
STARREF=/path/to/star/index/
for FQZ in *.fasta.gz
do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -fdk $FQZ > $FQ
STAR --readFilesIn $FQ --genomeLoad LoadAndKeep \
--genomeDir $STARREF --runThreadN 8 \
wait
rm $FQ
mv Aligned.out.sam ${FQ}.STAR_unsorted.sam
samtools view -uSh ${FQ}.STAR_unsorted.sam \
| samtools sort - ${FQ}.STAR
rm ${FQ}.STAR_unsorted.sam
done
for BAM in *STAR.bam ; do samtools index $BAM & done
for BAM in *STAR.bam ; do samtools flagstat $BAM > ${BAM}.stats & done
wait

Determine correct and incorrect reads

#!/bin/bash
for TAG in subread STAR bowtie bwaaln bwamem soap
do
        rm ${TAG}_alignment.txt
        for BAM in `ls *${TAG}*bam | grep -v '\.00'`
        do
        CORRECT=`samtools view -q10 $BAM | tr ':-' '\t' \
        | awk '$1==$5 && $6<($3+100) && $6>($2-100)' | wc -l`

        INCORRECT=`samtools view -q10 $BAM | tr ':-' '\t' \
        | awk '$1!=$5 || $6>($3+100) || $6<($2-100)' | wc -l`

        echo $BAM $CORRECT $INCORRECT >> ${TAG}_alignment.txt
        done &
done
wait