Wednesday, 21 November 2012

Bowtie alignment to a genome - single end

Recently I posted about BWA alignment of NGS data, so today I will demonstrate the use of Bowtie2 for aligning single end sequence data, for more information on advanced options, take a thorough look at the manpage. One of major strengths of Bowtie2, is that it's more tolerant for gaps than Bowtie1 and  BWA. Again you will need to have the reference genome and the raw fastq files in the otherwise empty directory.

The first thing we do is stipulate the name of the experiment and the name of the reference sequence.
Now index the genome with bowtie2-build:

bowtie2-build Hg19.fa Hg19.fa
Now to actually start the alignment for all the files with prefix "FC" and suffix "fq":

for FQ in `ls FC*fq | sed 's/.fq//'`
bowtie2 -x $REF -U ${FQ}.fq -S ${FQ}.sam -p 4 --met-file ${FQ}.met 2> ${FQ}.stats &
The script starts aligning all fq files to the human genome in parallel and each job uses 4 threads as specified by the "-p" option. Alignment output format is sam (not sai like BWA). It also outputs a file containing some alignment metrics as specified by the "--met-file" parameter. Once the sam files are generated, you can follow the SamTools procedure here to convert the data to bam format. You'll also notice that I've redirected the standard error (using "2>") to the stats file, which is generated automatically by bowtie once the alignment is complete. For more info on i/o redirection on Linux, there heaps of good tutorials out there (here, here, here).  Here is an example of what the stats file looks like coming from bowtie2 for an mRNA-seq data set:
20213469 reads; of these:
  20213469 (100.00%) were unpaired; of these:
    1714839 (8.48%) aligned 0 times
    12652875 (62.60%) aligned exactly 1 time
    5845755 (28.92%) aligned >1 times