Quick alignment of microRNA-Seq data to a reference

August 22, 2013

In my previous post, I discussed an efficient approach to align deep sequencing reads to a genome, which could be used for most applications (mRNA-Seq, ChIP-Seq, exome-Seq, etc).

MicroRNA datasets are rather a bit different because they often contain millions of reads from only a few thousand RNA species, meaning there is a lot of redundancy. We can speed up this analysis considerably if we collapse the sequences and remove the redundancy before genome alignment. You can see in the below script that several steps are done without writing a single intermediate file:

Adapter clipping
Sequence collapsing
BWA alignment
Bam conversion
Sorting bam file

for FQ in *fq
do
fastx_clipper -a TGGAATTCTCGGGTGCCAAGG -l 18 -i $FQ \
| fastx_collapser \
| bwa aln -t 30 $REF - \
| bwa samse $REF - ${FQ} \
| samtools view -uSh - \
| samtools sort - ${FQ}.sort &
done
wait
for bam in *bam ; do samtools index $bam & done ; wait
for bam in *bam ; do samtools flagstat $bam > ${bam}.stats & done ; wait

Using this method, the fastq is collapsed, meaning the sequences are now in a format which looks like this:

>1-14625

TGTAAACATCCTCGACTGGAAGCT

>2-11786

TGAGGTAGTAGGTTGTATAGTT

>3-8897

GTTTCCGTAGTGTAGTGGTTATCACGTTCGCCTTGG

>4-8495

TTTGGCAATGGTAGAACTCACACT

>5-7084

CAACGGAATCCCAAAAGCAGCTG

Where the sequence name now consists of a single number (unique identifier) as well as the number of times that exact sequence string was identified. Using this technique, I was able to align 12 data sets totaling 32 million reads in about 3 minutes. This is how the bam file looks:

36725-1 0 chr2 59549323 0 33M * 0 0 CCATACCACCCTGAACGCGCCCGATCTCGTCTG * XT:A:R NM:i:0 X0:i:34 XM:i:0 XO:i:0 XG:i:0 MD:Z:33

64660-1 0 chr2 59549328 0 28M * 0 0 CCACCCTGAACGCGCCCGATCTCGTCTG * XT:A:R NM:i:0 X0:i:34 XM:i:0 XO:i:0 XG:i:0 MD:Z:28

6064-8 0 chr2 59549329 0 24M * 0 0 CACCCTGAACGCGCCCGATCTCGT * XT:A:R NM:i:0 X0:i:36 XM:i:0 XO:i:0 XG:i:0 MD:Z:24

11925-3 0 chr2 59549331 0 28M * 0 0 CCCTGAACGCGCCCGATCTCGTCTGATC * XT:A:R NM:i:0 X0:i:31 XM:i:0 XO:i:0 XG:i:0 MD:Z:28

23045-1 16 chr2 59549331 0 24M * 0 0 CCCTGAACGCGCCCGATCTCGTCT * XT:A:R NM:i:0 X0:i:36 XM:i:0 XO:i:0 XG:i:0 MD:Z:24

I will cover the differential analysis aspect in another post. If you have tips for microRNA-Seq analysis I'd be glad to hear.

Search This Blog

Genome Spot

Quick alignment of microRNA-Seq data to a reference

Popular posts from this blog

Mass download from google drive using R

Data analysis step 8: Pathway analysis with GSEA

Extract properly paired reads from a BAM file using SamTools