My recent paper is an analysis of alignment tools for microRNA analysis with a strong focus on uniquely mapped reads. All that's OK, but in some organisms such as grasses (rice, barley, wheat, etc) you'll find that multimapped reads far outnumber uniquely placed ones. If you omit multimapped reads from the analysis, then you'll be excluding the majority of reads which is definitely a bad idea in any NGS analysis pipeline.
To demonstrate this, I downloaded smRNA-seq data from SRA (SRP029886) that consists of 3 datasets (SRR976171, SRR976172, SRR976173), clipped the adaptors off and mapped them to the genome with BWA aln then counted reads mapped to exonic regions uniquely (mapQ≥10).
There is an alternative to help with multimapped smRNA reads. The ShortStack pipeline uses uniquely placed reads in the vicinity of multimapped reads to estimate the number of tags that are originate from different multimapped regions.
|Table 2. ShortStack pipeline resolves multimapping smRNA reads with local unique alignment information.|
As you can see in the lower table, the local mapping information has helped to place 43.8% of reads that were previously ignored due to low mapQ. ShortStack also has an in-built miRNA gene identification tool that uses RNA folding info and read mapping positions to infer genes (Figure 1).
|Figure 1. Example of a miRNA gene identified from rice smRNA-seq data with ShortStack.|
From the 309,946 rice contigs identified, 94 were identified as small RNAs. Of the 94M mapped reads, MiRNA genes accounted for just 339,4147 reads (3.61%). If the majority of reads aren't miRNA, then what are they? I took a look at the top 10 contigs by assigned reads (Figure 2).
|Figure 2. Highest expressed rice contigs in smRNA-seq data.|
Just these top 10 contigs accounted for 36.6M reads (39% of mapped reads) and as you can see uniquely mapped reads represent just 39% of this figure. The counts determined by ShortStack could be used for differential expression analysis with GFold or DESeq
What I found really interesting was despite the very high expression of these contigs, their function remains more or less unexplored. There are few genes in these genomic regions that have any functional data ascribed, and there is no hint from the databases that these encode any type of non-coding RNA.
These data show that naive "uniquely" mapped read analysis is inappropriate for plants like rice that have a large proportion of multimapped smRNA reads. ShortStack was able to account for a much greater proportion of reads and was able to determine genomic regions containing highly expressed genes that likely have important but unexplored roles.