Pages

Tuesday, 27 November 2012

Evaluate length of sequence strings

If you are working with sequence data often, you will sometimes need to look at the distribution of read lengths in a data set after quality and adapter trimming. From a bam file this can be done starting with samtools view, then cutting the sequence string out and then using either perl or awk to count the length of the sequence. The list of integers can then be piped to numaverage, a numutils program to evaluate the average, median and mode of a list of numbers.

To get the average length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage

To get the median length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage -M

To get the mode length
samtools view sequence.bam | cut -f10 | awk '{ print length}' | numaverage- m

To get the shortest length
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | head -1

To get the longest
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | tail -1

To get the shortest and longest
samtools view FC095_1.bam | cut -f10 | awk '{ print length}' | sort | sed -n '1p;$p'

To get a full distribution
samtools view FC095_1.bam | cut -f10 | head -100 | awk '{ print length}' | sort | uniq -c | sed 's/^[ \t]*//' | tr ' ' '\t'

Further reading

Saturday, 24 November 2012

Paper of the week - Guthrie card methylomics

Nearly every baby born in Australia since 1970 has had a few drops of blood taken and stored on a so-called Guthrie card, and this practise is widely adopted in the developed world. As DNA analysis technologies become ever more sensitive and economical, these cards will become ever more important in diagnosis of genetic disease and also in identifying genetic and epigenetic variations which contribute to complex disease. The paper I showcase today from Beyan et al, describes the development of genome-wide assays for DNA methylation using methylation microarrays and methylcytosine immunoprecipitation followed by Illumina sequencing (MeDIP-Seq). Authors find differential methylation regions which are stable from birth to 3 years of age.

The methodology is fairly novel, but the conclusions are a bit vague and it would have been best to apply Guthrie card analysis for a specific disease. It would be really neat if they analysed material from discordant twins for a complex disease i.e; juvenile diabetes. As the cost of sequencing decreases, and analysis pipelines become more streamlined, methods like arrays and MeDIP-Seq will become less attractive and will be overtaken by whole genome (oxo) bisulfite sequencing for identification of methylcytosine and hydroxymethylcytosine sites.
 2012 Nov;22(11):2138-45. doi: 10.1101/gr.134304.111. Epub 2012 Aug 23.
Guthrie card methylomics identifies temporally stable epialleles that are present at birth in humans.
Beyan HDown TARamagopalan SVUvebrant KNilsson AHolland MLGemma CGiovannoni GBoehm BOEbers GCLernmark ACilio CMLeslie RD,Rakyan VK.
Source
The Blizard Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London E1 2AT, United Kingdom;
Abstract
A major concern in common disease epigenomics is distinguishing causal from consequential epigenetic variation. One means of addressing this issue is to identify the temporal origins of epigenetic variants via longitudinal analyses. However, prospective birth-cohort studies are expensive and time consuming. Here, we report DNA methylomics of archived Guthrie cards for the retrospective longitudinal analyses of in-utero-derived DNA methylation variation. We first validate two methodologies for generating comprehensive DNA methylomes from Guthrie cards. Then, using an integrated epigenomic/genomic analysis of Guthrie cards and follow-up samplings, we identify interindividual DNA methylation variation that is present both at birth and 3 yr later. These findings suggest that disease-relevant epigenetic variation could be detected at birth, i.e., before overt clinical disease. Guthrie card methylomics offers a potentially powerful and cost-effective strategy for studying the dynamics of interindividual epigenomic variation in a range of common human diseases.
PMID:

22919074
 
[PubMed - in process] 
PMCID:
 
PMC3483543
 [Available on 2013/5/1]

Wednesday, 21 November 2012

Bowtie alignment to a genome - single end

Recently I posted about BWA alignment of NGS data, so today I will demonstrate the use of Bowtie2 for aligning single end sequence data, for more information on advanced options, take a thorough look at the manpage. One of major strengths of Bowtie2, is that it's more tolerant for gaps than Bowtie1 and  BWA. Again you will need to have the reference genome and the raw fastq files in the otherwise empty directory.


The first thing we do is stipulate the name of the experiment and the name of the reference sequence.
EXID=Experimentname
REF=Hg19.fa
Now index the genome with bowtie2-build:

bowtie2-build Hg19.fa Hg19.fa
Now to actually start the alignment for all the files with prefix "FC" and suffix "fq":

for FQ in `ls FC*fq | sed 's/.fq//'`
do
bowtie2 -x $REF -U ${FQ}.fq -S ${FQ}.sam -p 4 --met-file ${FQ}.met 2> ${FQ}.stats &
done
wait
The script starts aligning all fq files to the human genome in parallel and each job uses 4 threads as specified by the "-p" option. Alignment output format is sam (not sai like BWA). It also outputs a file containing some alignment metrics as specified by the "--met-file" parameter. Once the sam files are generated, you can follow the SamTools procedure here to convert the data to bam format. You'll also notice that I've redirected the standard error (using "2>") to the stats file, which is generated automatically by bowtie once the alignment is complete. For more info on i/o redirection on Linux, there heaps of good tutorials out there (here, here, here).  Here is an example of what the stats file looks like coming from bowtie2 for an mRNA-seq data set:
20213469 reads; of these:
  20213469 (100.00%) were unpaired; of these:
    1714839 (8.48%) aligned 0 times
    12652875 (62.60%) aligned exactly 1 time
    5845755 (28.92%) aligned >1 times


Monday, 19 November 2012

What? You're not using parallel compression yet?

Just in case you guys are struggling with (de)compression of collossal data sets, here's something which you'll find useful, a parallel zip archiver called PBZIP2. OK so it's not going to improve the quality of your sequencing data, but it could save you a bit of time.

Say I have a fastq file (FC001_sequence.fq) which needs compression on 8 threads:
pbzip2  -p8 FC001_sequence.fq
To decompress (-d) a file (FC001_sequence.fq.bz2) and keep (-k) the archived version on 10 threads:
pbzip2  -dk -p10 FC001_sequence.fq.bz2
To test the integrity of a compressed file:
pbzip2  -t FC001_sequence.fq.bz2

How to compress an entire directory:

tar cv directory | pbzip2 > directory.tar.bz2


Here is the help page with more examples:
Parallel BZIP2 v1.1.5 - by: Jeff Gilchrist [http://compression.ca]
[Jul. 16, 2011]               (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>
Usage: pbzip2 [-1 .. -9] [-b#cdfhklm#p#qrS#tVz] <filename> <filename2> <filenameN>
 -1 .. -9        set BWT block size to 100k .. 900k (default 900k)
 -b#             Block size in 100k steps (default 9 = 900k)
 -c,--stdout     Output to standard out (stdout)
 -d,--decompress Decompress file
 -f,--force      Overwrite existing output file
 -h,--help       Print this help message
 -k,--keep       Keep input file, don't delete
 -l,--loadavg    Load average determines max number processors to use
 -m#             Maximum memory usage in 1MB steps (default 100 = 100MB)
 -p#             Number of processors to use (default: autodetect [32])
 -q,--quiet      Quiet mode (default)
 -r,--read       Read entire input file into RAM and split between processors
 -t,--test       Test compressed file integrity
 -v,--verbose    Verbose mode
 -V,--version    Display version info for pbzip2 then exit
 -z,--compress   Compress file (default)
 --ignore-trailing-garbage=# Ignore trailing garbage flag (1 - ignored; 0 - forbidden)
Example: pbzip2 -b15vk myfile.tar
Example: pbzip2 -p4 -r -5 myfile.tar second*.txt
Example: tar cf myfile.tar.bz2 --use-compress-prog=pbzip2 dir_to_compress/
Example: pbzip2 -d -m500 myfile.tar.bz2
Example: pbzip2 -dc myfile.tar.bz2 | tar x

BWA alignment to a genome - single ends

Over the last few posts, we've discussed various ways to analyse the quality of a sequencing run, and curate the data sets in terms of demultiplexing, quality trimming and adapter clipping. I guess the next step is alignment. There are an increasing number of aligners out there for short read sequencing (listed here), but currently the most popular choices are BWA and Bowtie2. BWA is a quick and accurate tool, but might not be the best for gapped alignments (common in mRNA-Seq). Today I will demonstrate how to align short reads (single end) with BWA and convert the alignment to bam format with Samtools. The things you will need in the otherwise empty directory are:

  1. Reference genome in fasta format
  2. Sequence data in fastq format (.fq)
First thing to do is give the experiment a name and index the genome with BWA. The index step could take about 3 hours for a human size genome.

EXID=Experimentname
REF=Hg19.fa
bwa index -a bwtsw $REF
Next, use BWA to align. All of our fastq data sets have file names that begin with the prefix "FC" and end in the suffix "fq", so I use that to specify the input sequences. Also note that I've specified 4 threads, so if you are working on a big cluster, you can use more cores by increasing the number of threads per alignment or by sending the jobs to the background with ampersand "&" and "wait" command.
for i in `ls FC*.fq | sed 's/.fq//' `
do
bwa aln -t 4 $REF ${i}.fq > ${i}.sai
done
Next, we need to get the alignment into sam format using the samse command. This step can't be multi-threaded, but as you can see below, I have used the ampersand and "wait" command to send the samse jobs to the background and then wait until they're finished before proceeding to the next step.
for i in `ls  FC*.fq | sed 's/.fq//' `
do
bwa samse $REF ${i}.sai ${i}.fq > ${i}.sam &
done
wait
Next, we will convert the format to binary sam (bam) format with Samtools.
for l in `ls *.sam | sed 's/.sam//'`
do
samtools view -bhS ${l}.sam > ${l}.unsorted.bam &
done
wait
Now, putting the bam file into sorted order, and discarding the unsorted files.

for m in `ls *.unsorted.bam | sed 's/.unsorted.bam//'`
do
samtools sort ${m}.unsorted.bam ${m} &
done
wait
rm *unsorted*

Now that you have an alignment in the bam file, check the number of reads which have successfully mapped to the reference.

for o in `ls *.bam | sed 's/.bam//'`
do
samtools flagstat ${o}.bam | grep -v ^0 > ${o}.stats &
done
wait

Once you have confirmed that the alignment has worked, clean up some of the intermediate files.

rm *.sai *.sam

That wasn't too hard was it?

Have a look at some other approaches here.

Tuesday, 13 November 2012

Paper of the week - Next gen sequencing of ancient DNA

This paper was featured in the 12 October issue of Science, so really isn't from this week, nevertheless I thought it would be relevant given the previous post on library preparation.

Sequencing ancient DNA is a hugely challenging task. Not only is it very difficult to get any sort of yield of DNA from bones tens of thousands of years old, but the DNA itself is normally degraded to such an extent that conventional library preparation is highly inefficient. On top of this, there is the challenge to eliminate environmental contamination. To avoid much of this, the team, lead by Matthias Meyer at the Max Planck Institute for in Leipzig came up with a simple but efficient method to generate sequencing libraries from single stranded DNA. The basic steps in the library prep are:

  • Dephosphorylate
  • Heat denature
  • Ligate single stranded biotinylated adaptors
  • Immobilize on streptavidin beads
  • Generate the second strand with DNA polymerase
  • Ligate a second adaptor by blunt end ligation

Figure 1 from Meyer et al is the new strategy for library preparation.

The main benefit of this prep over standard techniques is the hugely increased yield (over 6 fold more library), as well as the ability to identify the ends of the molecule accurately and analyse strand breakage. The authors suggest that strand breaks occur at a higher frequency near guanines. On the downside to this method, I foresee some bias introduced by the ligation steps. Single stranded ligation of microRNA to adaptors in the small RNA library prep with T4 RNA ligases is known to be a biased process (paper, paper). Nevertheless, at high coverage levels and long reads (150 nt), the bias shouldn't necessarily lead to false conclusions in genome sequencing.

I see this library preparation method having a big impact in a few ways outside of the molecular archaeology field. I think it will make it easier to analyse degraded DNA from diverse types such as formalin-fixed, paraffin-embedded, dried blood spots, forensic samples and environmental DNA (eg, from soil). 

The other reason I think this is big news is that it could be useful in the field of epigenetics. Generating libraries for shotgun bisulfite sequencing requires micrograms of starting material to get just a tiny amount of library because the bisulfite treatment often leaves the DNA in a highly degraded state, consisting mostly of single strands. Moreover, methods like ChIP-Seq often need to get quality libraries from just a few nanogram of DNA, and if methods like this one from Meyer can improve yield then it is a big win. 

In addition to this new library prep method, the paper is well worth a read for its insights into human evolution over the last million years. Enjoy!

[Edit: the very detailed supplementary material is well worth a read!]

Science. 2012 Oct 12;338(6104):222-6. doi: 10.1126/science.1224344. Epub 2012 Aug 30.A high-coverage genome sequence from an archaic Denisovan individual.Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, Schraiber JG, Jay F, Prüfer K, de Filippo C, Sudmant PH, Alkan C, Fu Q, Do R, Rohland N, Tandon A, Siebauer M, Green RE, Bryc K, Briggs AW, Stenzel U, Dabney J, Shendure J, Kitzman J, Hammer MF, Shunkov MV, Derevianko AP, Patterson N, Andrés AM, Eichler EE, Slatkin M, Reich D, Kelso J, Pääbo S.
Source
Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, D-04103 Leipzig, Germany. mmeyer@eva.mpg.de
Abstract
We present a DNA library preparation method that has allowed us to reconstruct a high-coverage (30×) genome sequence of a Denisovan, an extinct relative of Neandertals. The quality of this genome allows a direct estimation of Denisovan heterozygosity indicating that genetic diversity in these archaic hominins was extremely low. It also allows tentative dating of the specimen on the basis of "missing evolution" in its genome, detailed measurements of Denisovan and Neandertal admixture into present-day human populations, and the generation of a near-complete catalog of genetic changes that swept to high frequency in modern humans since their divergence from Denisovans.
PMID: 22936568 [PubMed - indexed for MEDLINE]

Friday, 9 November 2012

Data set curation - quality trimming and adapter clipping

After demultiplexing (covered in the last post), you'll need to perform a few other steps before aligning Illumina sequence data to the genome reference. Primarily, these are quality filtering and adapter clipping. These may not be very important for short read data, but are pretty important when working with long reads, where the quality starts to drop off and the read might contain some adapter sequence.

Quality filtering can be done a few ways, by filtering out entire reads which have poor base quality scores, by converting poor quality base-calls to "N" or hard trimming reads to a certain length before the Q scores start to rapidly decline. I'd much rather use a quality based trimming tool which starts at the 3' end of the read and removes bases below a certain Q threshold. This can be done using fastq_quality_trimmer on the command line or in galaxy. You set the threshold you want to use, in this case Q30, as well as the minimum length of sequence to keep, which we have set at 37 nt.


fastq_quality_trimmer -t 30 -l 37 -i dataset.txt -o dataset_Q30.txt 

Now to remove the adapter sequence, there are a few software options (Trimmomatic, cutadapt, among others) but we again chose the fastx toolkit for this example. Keep in mind that the below adapter is for the Illumina Truseq genomic DNA kit and is different for other sequencing platforms. The "-l 37" parameter is the length of the shortest read to keep, so any read shorter than this is discarded, you can tailor this to your specific need.

fastx_clipper -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACATC -l 37 -i dataset_Q30.txt -o dataset_Q30_clip.txt


One thing I need to mention is that the above will work really well for fastq Illumina, but might not work for fastq sanger, which has different quality score characters. This incompatibility has affected a lot of people in the forums and can be solved by adding -Q33 as an option.

Now that the sequence data is now trimmed for bad bases and adapter contamination, we can start the analysis!

Thursday, 8 November 2012

Demultiplexing Illumina Sequence data

Demultiplexing is a key step in many sequencing based applications, but it isn't always necessary, as the newer Illumina pipeline software provides demultiplexed data as a standard. But if you need to do this yourself, here is an example using fastx_toolkit designed for sequence data with a 6nt barcode (Illumina barcode sequences 1-12). After a run, the Genome Analyzer software provides sequence files like this for read 1 (insert sequence):
FC123_1_1_sequence.txt
And for the barcode/index read:
FC123_1_2_sequence.txt
So here goes:
#Enter dataset parameters 
FC='FC123 FC124'
LANES='1 2 3 4 5 6 7 8'
#Create the bcfile
echo 'BC01_ATCACG     ATCACG
BC02_CGATGT     CGATGT
BC03_TTAGGC     TTAGGC
BC04_TGACCA     TGACCA
BC05_ACAGTG     ACAGTG
BC06_GCCAAT     GCCAAT
BC07_CAGATC     CAGATC
BC08_ACTTGA     ACTTGA
BC09_GATCAG     GATCAG
BC10_TAGCTT     TAGCTT
BC11_GGCTAC     GGCTAC
BC12_CTTGTA     CTTGTA' > bcfile.txt
for flowcell in $FC
do
for lane in $LANES
do
paste ${flowcell}_${lane}_1_sequence.txt ${flowcell}_${lane}_2_sequence.txt \
| tr -d '\t' fastx_barcode_splitter.pl --bcfile bcfile.txt --prefix ${flowcell}_${lane}_ --suffix .txt --eol
 &
done
done
wait
So you can see that we start by pasting read 1 and the index read side-by-side and pipe that straight into the fastx_barcode_splitter script which will demultiplex the datasets by the 12 barcodes specified in the bcfile. If there are any lines missing from either read 1 or It will run each lane in parallel, so be sure to use a computer with plenty of processors. For example, in the above script, I've specified all 8 lanes on 2 flow cells so I will be using 16 processors. OK, so we've demultiplexed, and now we need to trim off the 6nt barcode.
for dataset in `ls FC*BC*.txt | sed 's/.txt//'`
do
fastx_trimmer -t 6 -i ${dataset}.txt -o ${dataset}_trim.txt &
done
wait
The fastx_trimmer will remove the 6 nt from the end of the sequence and output the file with the suffix "_trim.txt". It will trim all the files which start with FC and contain BC and end with .txt, which is all the unambiguously demultiplexed datasets. Use caution when using the "&", as it will send many jobs into the background so if you're not working on a big server, you might crash the computer.


Friday, 2 November 2012

We're not alone: the genomics bioinformatics blogosphere

Go forth and explore!

http://seqanswers.com/

http://plindenbaum.blogspot.com.au/

http://gettinggeneticsdone.blogspot.com.au/

http://kevin-gattaca.blogspot.com.au/

http://core-genomics.blogspot.com.au/

http://massgenomics.org/

http://www.michaeleisen.org/blog/

http://nextgenseq.blogspot.com.au/

http://thepersonalgenome.com/

http://omicsomics.blogspot.com.au/

http://bioinformaticsblog.org/

http://bioinformatics.whatheblog.com/

http://blog.openhelix.eu/

http://www.rna-seqblog.com/

http://thegenomefactory.blogspot.com.au/

http://bcbio.wordpress.com/

http://flxlexblog.wordpress.com/

http://jermdemo.blogspot.com.au/

http://nsaunders.wordpress.com/

http://manuelcorpas.com/

http://thegenomefactory.blogspot.com.au/2012/11/sorting-fastq-files-by-sequence-length.html

Quality control of sequence datasets

Before investing a lot of time in analysis of a sequencing run, it is ALWAYS a good idea to ensure that your sequence data quality is up to scratch. If you have a high proportion of low quality bases in your dataset, you're likely to have many spurious alignments. These can cause problems for  virtually all NGS applications from mRNA-seq through to SNP detection and de novo assembly.

There are two main types of QC analysis for sequencing runs. The first type, which only describes the quality of the fastq file, and the second type, which describes the quality of the alignment (sam/bam file format). Lets begin with the simple fastq file analysis and we'll cover the alignment QC at a later stage.

The fastq file format has become the standard raw data format for high throughput sequence datasets. Each base is given a Phred quality score represented as an ASCII character. The higher the Qscore, the more confidence you can have in the identity of the base. But there are other things to watch out for, self-ligated adaptor contamination, sequence duplication levels (low diversity), GC content bias, over/under-representation of certain motifs (k-mers) and presence of other artifacts.

For this type of analysis, there are a growing number of software options available and I'll go through just a few.

Fastx_quality_stats

Part of the popular Fastx toolkit, Fastx quality stats script, gives an overview of the quality scores on a cycle-by-cycle basis. This is a really quick way to look at the general quality scores for a dataset, before proceeding with alignment. It provides these useful stats:
mean = Mean quality score value for this cycle
Q1 = 1st quartile quality score
med = Median quality score
Q3 = 3rd quartile quality score
IQR = Inter-Quartile range (Q3-Q1)
It also gives a count of each type of base called at each cycle. Version 0.0.13 also provides separate statistics sets for each of the four bases. One drawback of this package is that you don't get to see quality scores by location. For instance you won't be able to detect whether certain tiles on the flow cell are problematic. Fastx toolkit is designed to be run on the command line and is also commonly run from the Galaxy GUI (visit galaxyproject.org for further info). The command line scripts require installation of a few dependencies which can be tricky, but worthwhile. Here is an example of running the fastx_quality_stats for a example file called "dataset.txt". The stats output will be saved to a file "dataset_Qstats.txt".
fastx_quality_stats -i dataset.txt -o dataset_Qstats.txt
The fastx toolkit has a bunch of other scripts for performing quality trimming and adapter clipping which are extremely useful.



SolexaQA

Based in perl, SolexaQA is another option for quality analysis and read trimming. Its functionality is similar to Fastx toolkit, but provides quality-per-tile based statistics on top of the quality-per-cycle function. Installation is as easy as unzipping the perl scripts from the download page and following the installation instructions. One thing I really like, it that it can sub-sample the data sets to get a very quick impression of the quality across tiles and cycles. If you want the graphical outputs, there might be some hassles in getting the dependencies installed, but the finished product is worth it in my view.

Example output from SolexaQA
Check out the paper in BMC Bioinformatics.

Example running the SolexaQA script:
perl SolexaQA.pl dataset.txt

NGS QC Toolkit

Is a suite of perl scripts which are run on the command line using Linux or Windows. Like Fastx toolkit, it requires some installation of dependencies. When you do get it running, you get the benefit of parallelisation - running one job over as many processors as you have - a major advantage if you're working with huge data sets on large computing cluster. The IlluQC.pl script produces quality measures on a per-cycle basis like Fastx toolkit, but also provides GC content distribution profile, quality score distribution and gives an indication of how many adapter reads are present. Documentation is fairly good and the authors provide plenty of example commands which I found helpful.


Per-cycle quality for two Illumina datasets.

Check out the PLoS One paper for all the features. Here is an example of analysing single-end Illumina sequence datasets on one processor.

perl IlluQC.pl -se dataset.txt 1 A

"-se" stipulates single-end data set
"1" stipulates the primer/adaptor library used (1 = Genomic DNA/Chip-seq Library)
"A" stipulates automatic detection of the fastq type (either fastq illumina or fastq sanger)

Running in parallel, one can try the following:
perl IlluQC_PRLL.pl -c 20 -se dataset.txt 1 A

"-c" denotes the number of parallel processors to use


FastQC

Running in Java, FastQC can run in interactive mode for a few datasets or non-interactive for when you want to process a whole directory of files. It provides many of the metrics defined in the NGS QC toolkit such as per-cycle quality stats and %GC distribution, but also detects highly duplicated sequences and motifs/k-mers which could originate from library contamination or self-ligated adapter carry-over. FastQC has been adopted by many users on Galaxy and can be found in the Galaxy Toolshed.
Example of a per-cycle quality profile using FastQC
As for usability, it runs in Java so should run on all common operating systems once you have Java Runtime Environment (JRE) installed. The installation documentation is very good and the authors even have a video tutorial. You can run it non-interactively by entering the following on the Unix command line when you are in a directory of sequence files. Here is an example for a single fastq file:
perl fastqc dataset.txt
or for a range of files:
perl fastqc dataset[1-10].txt

SeqQC

Running on Windows and Linux, SeqQC is a command line tool which does the basics (per-cycle quality, nucleotide distribution) and goes the extra mile, detecting adapter sequences, over-represented motifs/k-mers, and low complexity repeats. You need to register to get the download link and need to enter a password to unzip the file which is completely unnecessary in my view. On Linux, the executable opens an interactive command line which is probably nice for newbies, but I found it a bit clunky as I'm used to running non-interactive pipelines.

Example SeqQC outputs.
This is a suite of python scripts, making them fairly easy to install (on Linux) once you have python, R, gcc and numpy installed. Following the install documentation is fairly easy and there is a comprehensive online manual. RSeqQC provides a few handy tools for analysis of quality scores, nucleotide frequency, duplication frequency and GC distribution. Most of the scripts, such as read_quality.py require an input sam/bam file, so a conversion is necessary before analysis. Have a look at the paper on RSeqQC published in Bioinformatics.
Example of quality analysis result of RNA-seq by RSeqQC.
Here is an example of running RSeqQC for a bam file:
read_quality.py -i example.bam -o output

TileQC

Relatively few tools are able to analyse quality in terms of location of a read on an Illumina flowcell. TileQC can identify specific issues such as bubbles, flares, flowcell peeling, leaky manifolds, oil spill and bleached squares which assists in troubleshooting. It uses SQL and R to rapidly process data and produce attractive graphs like this one. While R and SQL are really powerful tools, these are also unnecessarily complicated to get running compared to the above. Luckily there's thorough documentation to guide you up the learning curve.  Here is the link to the paper in BMC bioinformatics.

Examples of bad tiles using TileQC


NGSQC

This suite of python based scripts is fairly unique in that it can perform quality-by-cycle and quality-by-tile/location for not just Illumina sequence data, but also Solid data. Dependencies include bowtie surprisingly. The paper was featured in BMC Bioinformatics.
Example overview of a full Solid slide run

Picard

Picard is a suite of Java-based command line tools that have similar capability to SamTools in that it's able to view, manipulate and analyse sam and bam files. It has some useful commands for quality score analysis but these require a sam/bam file so the data needs to undergo conversion with the FastqToSam script followed by MeanQualityByCycle and QualityScoreDistribution. 

Summary

The QC approach you need depends very much on the type of experiment you're doing. If you're doing de novo assembly of long reads, you may need a higher stringency of quality checking than if you were doing short-read mRNA or ChIP-seq. At a very minimum, you will need to look at the quality-per-cycle metrics and presence of adapter contamination. FastQC is probably the most widespread QC suite in use today, because its not overly complicated to install or use locally or on Galaxy, is compatible with most platforms and produces graphical, intuitive outputs. I was pleasantly surprised by the comprehensive output of the NGS QC Toolkit, and having it run in parallel mode is a big win for those like me who don't want to spend all day with such mundane tasks. If you're a facility operator, it's likely that you'll need to master a combination of these, including ones which can look at quality-by-tile such as SolexaQA, TileQC or NGSQC. On the other hand, if you are a casual NGS user with a windows computer, you might find SeqQC useful.


Have you used another sequence QC program? Let us know your thoughts.