Sunday, 23 December 2012

Genome biology highlights for 2012

We are coming to the end of a big year in the field of genome science, one which was dominated by the swarm of papers from the ENCODE consortium. (The number of ENCODE articles, mostly in Nature, has reached 92! and has no sign of slowing down.) If you haven't had time an read them all, do so here.

Apart from ENCODE, there has been a quite astonishing pace in genomic studies. Here are just a few highlights from me:
  • The 1000 genomes project published it's main findings in Nature (link).
  • Non-invasive prenatal genome sequencing was described in PNAS (link), triggering a debate about the ethics of genomic studies of the unborn.
  • Single cell sequencing techniques improve with better methods of amplification, even allowing sequencing of 99 individual sperm (link) giving new insights into patterns of recombination, and opening new avenues for IVF testing.
  • Metagenomics analysis takes off. Whether it's environmental samples from the sea, hot springs or soil (JGI website) or even microbiomes of humans, insects, plants - there are new microbes being discovered by the day.
  • Big chunks of the wheat genome are finished thanks to advanced in chromosomal flow sorting sequencing techniques (link).
  • Barley genome sequence and comprehensive transcriptome analysis described in Nature (link)
  • Ribo-Seq goes mainstream with novel applications of ribosome profiling including identification of novel viral encoded proteins (link).
On the instrument side, the upgrade to Illumina HiSeq platforms allowing rapid run allowing 2 x 150 bp in 27 hrs is something that has gotten our lab fairly excited. Life Technologies started shipping the Proton sequencer, superseding the Ion Torrent. On the whole 3rd gen systems have failed to overtake the mature players while Helicos is more or less broke.

What have been your top 5 papers of the year?

Merry Christmas!

Friday, 14 December 2012

Paper of the week - Cooperative epigenetic effect of TETs and OGT

There have been a number of high profile profile articles in recent times discussing the function of TET proteins, mostly in the conversion of methylated cytosine (5mC) into hydroxymethylated cytosine (5hmC), the 5th base. Hydroxymethylcytosine is much rarer than methylcytosine and is thought to be an intermediate towards demethylation of cytosine, a mechanism which remains incompletely resolved. A paper last year showed that TET proteins also convert 5hmc to 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC), termed the 6th and 7th bases.

OGT on the other hand is a fairly unique protein because it is the only known known O-GlcNAc transferase in mammals. What is GlcNAc you say? It stands for N-acetylglucosamine, a hexosamine. There has been a series of papers (here, here, here) discussing OGT as a nutrient sensor, transferring GlcNAc during period of surplus nutrient supply. GlcNAc can be transferred to the same amino acids as phosphorylation, so there is a suggested crosstalk between the mechanisms (link) especially working in concert with the AMPK another sensor of cell energy supply (link). The reason that I'm going into this is because elevated GlcNAc has been linked to insulin resistance, a state of pre-diabetes - as has AMPK.

New data presented by Chen et al in Nature radically alter the assumed function of OGT, showing that it associates with TET2 and TET3 on the CpG island-containing promoters of genes. It appears that without TET proteins, OGT can't localise to the chromatin and modify histones (H2S112). Histone GlcNAc has previously been characterised as an activating histone mark, but this may partially be due to the cooperative effect of TET proteins performing modifications to methylcytosine. The data generated in this study was generate from mouse embyonic stem cells, but we think that the role of OGT on the chromatin could be important in the control of gene expression in the onset of pre-

TET2 promotes histone O-GlcNAcylation during gene transcription.
by Chen Q, Chen Y, Bian C, Fujiki R, Yu X
Nature. 2012 Dec 9;

Ten eleven translocation (TET) enzymes, including TET1, TET2 and TET3, convert 5-methylcytosine to 5-hydroxymethylcytosine and regulate gene transcription. However, the molecular mechanism by which TET family enzymes regulate gene transcription remains elusive. Using protein affinity purification, here we search for functional partners of TET proteins, and find that TET2 and TET3 associate with O-linked β-N-acetylglucosamine (O-GlcNAc) transferase (OGT), an enzyme that by itself catalyses the addition of O-GlcNAc onto serine and threonine residues (O-GlcNAcylation) in vivo. TET2 directly interacts with OGT, which is important for the chromatin association of OGT in vivo. Although this specific interaction does not regulate the enzymatic activity of TET2, it facilitates OGT-dependent histone O-GlcNAcylation. Moreover, OGT associates with TET2 at transcription start sites. Downregulation of TET2 reduces the amount of histone 2B Ser 112 GlcNAc marks in vivo, which are associated with gene transcription regulation. Taken together, these results reveal a TET2-dependent O-GlcNAcylation of chromatin. The double epigenetic modifications on both DNA and histones by TET2 and OGT coordinate together for the regulation of gene transcription.
PMID: 23222540 [PubMed - as supplied by publisher]

Sunday, 9 December 2012

List of NGS aligners

The folks at EBI have produced a summary of all the high throughput sequence aligners commonly used (link), and there is a companion paper in the current issue of Bioinformatics. The timeline is a fascinating info-graphic.
There is also a table describing the features of each. One thing I noticed is that there are relatively few aligners for RNA data, despite RNA-seq being arguably the most important application in NGS right now. (I will cover the use of RNA specific aligners in future posts.) The paper discusses some of the generalities of selecting an appropriate alignment tool for the job but doesn't go so far as to suggest which one is "best".

Illumina iGenomes

In an effort to make NGS data analysis more accessible, Illumina have provided packages of pre-indexed genomes and annotation files in one bundle. For instance you can get the Bowtie/TopHat indexes here for a range of organisms.
The iGenomes are a collection of sequence and annotation files for commonly
analyzed genomes. Each iGenome contains data for one species, downloaded from one source (UCSC, NCBI, or Ensembl), for one genomic build
Read here for more info.

Align RNA-seq reads to genome with TopHat

RNA-seq data analysis poses some interesting bioinformatic challenges. One of the most important considerations is the type of reference sequence to be used. If there is a whole genome sequence (WGS), then the WGS is the most appropriate reference, even if it is not annotated entirely. If there is no WGS, then you may need to use a curated cDNA library such as RefSeq, Unigene, or organism specific collection. There are a few reasons why aligning to the genome is the most correct method
  1. MapQ scores mean something - cDNA collections like RefSeq often have splice variants entered as different entries. This means that some variants will have common exons. Reads aligning to those exons will be assigned a mapQ of zero!
  2. Discovery - many of the cDNA sequence collections are incomplete, including RefSeq. Compare the number of entries from GENCODE to RefSeq and you will see that GENCODE is far richer in terms of number of known transcripts. Moreover, if you are aligning to a genome, you can perform something of an ab initio analysis and identify previously uncharacterised regions of high coverage.
  3. Accurate splicing analysis - alignment to genome allows for splicing analysis by quantifying expression on an exon-by-exon approach or by quantifying exon-exon junctions. 
The problem with RNA-seq is that we are trying to align short read to a really large reference sequence and that splicing introduces the possibility of having huge gaps (introns) in the alignment. TopHat was developed to perform ab initio splice junction analysis on mammalian sized genomes. It is optimised for reads about 75nt ot longer. Check out the paper which is well worth a read here.

TopHat uses the bowtie2 aligner, so you will need to have that installed too. For my Ubuntu installation,  also needed to install
  • "build-essential" which contains g++ libraries; can be done with "apt-get install build-essential"
  • "boost" which had to be set up manually (link)
After that, it was fairly straight-forward to download TopHat binaries to somewhere like /usr/local/bin, unpack it and export the contents to path (export PATH=$PATH:/usr/local/bin/tophat-2.0.6.Linux_x86_64/).

First step is to build the reference index with bowtie, in this case the human build Hg19.
bowtie2-build -f hg19.fa hg19
If you can't wait for the build, then you could try downloading the indexes supplied by Illumina here. Before we go ahead and start the alignment, check again that your sequence is high quality and free of adapter contamination. See some of my past posts for help (QC, Quality filtering, adapter clipping). TopHat's parameters are many, including setting the max number of mismatches or gaps for successful alignments, setting expected intron size range, setting rules for handling single and paired end data, ability to use strand-specific alignment. Like Bowtie2, there are also some "pre-set" parameters which allow researchers to run fast optimisation from "very fast" to "very sensitive". You can also provide TopHat a gtf file (genome annotation file, try GENCODE). In that  scenario, TopHat will preferentially align to exon regions before searching for exon junctions and then to other genomic regions. TopHat also has extensive fusion transcript identification tools which I'll discuss in depth in a future post. There are a nearly infinite combination of parameters to try, but here is an example using the gtf information for single end Illumina data:
tophat -p8 -o out_sample1 -G known_genes.gtf \
--transcriptome-index=transcriptome_data/known \
hg19 sample1_1.fq
One thing I really like about TopHat is that it automagically does many of the mundane tasks that are usually done by samtools. This means that we can go from fastq to bam in 1 command, as opposed to the 5 step pipeline I outlined in a previous post using BWA. The aligned and unaligned reads are stored in different bam files which is different to some other pipelines such as BWA.

In summary TopHat is probably the most sophisticated RNA-Seq aligner in use, although there are others on the market including GSNAP, MapSplice and SpliceMap. Next, we will try CuffLinks for downstream analysis of RNA-seq.

TopHat further reading:
TopHat paper
TopHat on BioWulf
RNA aligners reviewed

Thursday, 6 December 2012

UNIX utility: split

Even the most simple types of analysis (ie: count lines) can take a really long time with the colossal data sets  coming from sequencing experiments these days. Parallelization of jobs over multiple processors/cores/threads can make things faster, but the standard unix based tools we commonly use don't generally have the in-built capacity to run jobs over many processors. Every so often, I'll present some tools which can help in making NGS analysis jobs parallel, and hopefully give you some ideas for speeding up your jobs.

One of the most basic parallelisation tricks is to split a file into smaller fragments, process those smaller files in parallel and then concatenate/merge the final product. To split a file, you can use a tool such as sed to selectively output lines, another option is to use the split command which will divide the file by the specified number of lines.

Here's an example of splitting a fastq file (called sequence.fq) into 1 million line fragments:
split -l 1000000 sequence.fq sequence.fq.frag.
The output files will look like this:

Be careful that fastq sequence files have an unusual structure of 4 lines per sequence instead of 1 line per sequence format such as sam/bam format and as such, you want to make sure the number of lines per file is a multiple of 4. Now you're free to run batches of smaller jobs over many processors.

Venn diagrams for genes

One of the easiest ways to compare datasets is to identify common elements between them and show the overlaps in a venn diagram.
So when you're working with many data points such as thousands of genes, what is the best/fastest way to do it?

Venny - Intersect up to 4 lists and count the overlaps. See and download the overlaps between gene lists.  Easy to use. Basic graphical output with overlaps not to scale and basic colours.
Gene Venn - Intersect 2 or 3 lists. Overlap output in one text file. Overlaps not to scale, but graphics are somewhat more attractive than Venny.

BioVenn - Intersect 2 or 3 lists. High quality graphics in a range of formats (SVG/PNG), with extensive customisation which makes it perfect for publications. Overlaps are to scale. Optionally can map gene IDs to actual RefSeq/Ensembl/Affy accession numbers.

Venn SuperSelector - Intersect as many lists as you like. The output will be a matrix and the number of entries in the overlap. The first matrix is the pair-wise overlaps and then the 3-some overlaps and so on.

If you were to do this regularly in an automated fashion, then consider programming it in R (tutorial). If you just need the numbers of overlaps, you can try using the UNIX uniq command to find the duplicated values.
cat list1 list2 | sort | uniq -d
Which works, but UNIX comm command is a bit more sophisticated, outputting the common and distinct entries in a 3-field output.
comm list1 list2

Fastq to tabular - filter reads on length

Fastq sequence format is a little bit strange as it has a structure of 4 lines per data point (sequence). This format is not ideal for analysis for one reason - it is quite difficult to manipulate with unix tools because they are optimized for tab delimited data with one entry per line. One way to overcome this is to "flatten" the file so that each sequence read occupies just one line. The read name, sequence, quality information are separated by a tab on one line. To do this there is a trick you can try with unix paste command. Try it with this command:

paste - - - - < sequence.fq | head
If we then wanted to do something like filter reads based on sequence length, we can add a perl (or awk for that matter) one liner.
 paste - - - -  < sequence.fq | perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;' 
Which will output the length of the sequence string in the first field.

If we wanted to filter reads based on the length of the sequence read and extract reads with size equal to or greater than 18 nt, we can use an awk command to do that and then remove the length field with cut.
 paste - - - -  < sequence.fq| perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;' | awk '$1 >= 18' | cut -f2- 
Another example, originally by Torsten Seemann, shows that this tool can be used to sort fastq files based on sequence read length (link), then the data is "unflattened", back into the fastq format.
paste - - - -  < sequence.fq| perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;' | awk '$1 >= 18' | sort -n | cut -f2- | tr '\t' '\n'
Thanks to the Genome Factory!

Tuesday, 4 December 2012

Paper of the week - Role of Selfish DNA in Evolution

Its no secret that repeat DNA makes up the majority of the genome in a majority of "higher" eukaryotes. For a long time this repeat DNA has been considered "selfish" or "parasitic" DNA, its only feature was its prolific self-propagation as it hitchhiked a ride throughout evolutionary history at the energetic expense of the host.

Nina Fedoroff argues in a recent review in Science that these transposable elements are not "junk" at all, and in fact they play "a profoundly generative role in genome evolution" where the transposons provide novel mechanisms to generate genetic diversity. Nina explores the relationship between genome size and epigenetic complexity in the comparison of eukaryotes and prokaryotes, postulating that the innovation of epigenetic elaborate control mechanisms in bacteria paved the way for increases in genome size and complexity we see in eukaryotes. So rather than epigenetic mechanisms evolving to keep "parasitic DNA" in check, it was these epigenetic mechanisms precisely which actually silenced them just enough to make them tolerable in the genome where, on occasion they provide spontaneous recombination activity, giving rise to bursts of increasing genome size, occasionally providing bursts of fitness. Transposase activity is proposed to give species an adaptive advantage in the long term, assisting in the speed of genome evolution. It makes sense that these "copy-and-paste" tools have a beneficial effect on the genome on evolutionary scales, otherwise we probably wouldn't be carrying such a humongous variety and amount of them!

Science. 2012 Nov 9;338(6108):758-67.
Presidential address. Transposable elements, epigenetics, and genome evolution.
Fedoroff NV.
King Abdullah University of Science and Technology, Saudi Arabia.
PMID: 23145453 [PubMed - in process]

Sunday, 2 December 2012

Library preparation for RNA-seq

It is often said that RNA-seq is overtaking microarray as the gold-standard for transcriptome wide gene expression analysis, but what most people don't understand is, that this "gold standard" is far from standard. There are an ever increasing number of methods available to apply high throughput sequencing to gene expression analysis. Thus "RNA-Seq" is actually very generic term describing a range of techniques which aim to use sequencing to profile transcripts. Let's go through some of the more common applications  and work our way to the more niche applications.


This method aims to profile the pool of transcripts which encode for proteins. The idea is to enrich the RNA mixture for mRNAs by depleting the concentration of rRNAs and other abundant ncRNAs. Usually, this is achieved by polyA enrichment using hybridization to magnetic beads decorated with oligo dT strands. After the enrichment the resulting RNA is fragmented by heat exposure in the presence of divalent cations and then subjected to reverse transcription using a random hexamer primer followed by synthesis of the second strand cDNA. This process is then followed by the standard protocol for DNA library prep that I mentioned in a previous post (End Repair, A-tailing, adaptor ligation, size selection and amplification). The benefit of using this method is that it does reduce rRNA by about 90%, but it also depletes for ncRNAs which don't have polyA tails. It is also a fairly long procedure at 2 days. It requires about 1ug of input RNA.

Total RNA-Seq

As the name suggests, the aim is to profile the entire pool of RNAs in a sample (excluding small RNAs), which is done simply by omitting the polyA enrichment step of the mRNA-Seq method. Otherwise, all other steps from the reverse transcription onwards are the same. It is about half a day quicker than mRNA-Seq method and only requires about 100ng of input RNA. This is the method of choice for profiling prokaryote transcriptomes. The random priming method generally results in the depletion of small RNAs so the name "Total RNA-Seq" is kinda wrong.

rRNA depletion methods

These methods start with a total RNA library prep and then incorporate a step which allows specific depletion of highly abundant sequences such as ribosomal RNAs and others. Kits using this type approach include the Ribo-minus and Ribo-zero. Double stranded nuclease (DSN) normalisation was also used to perform this depletion, but seems to have fallen out of favour in recent times. rRNA depletion methods are ideal for situations when you have less than the 1ug required for the poly-A enrichment method. These methods also allow the deep analysis of longer ncRNAs which might not have polyA tails.

Directional/Strand Specific RNA-Seq

Capturing strand direction is important in differentiating expression of nearby and overlapping transcripts which is important at many loci. There have been a few strand specific methods developed, including those which perform 3' and 5' ligation of RNA oligos followed by reverse transcription and PCR to generate the dscDNA library in a way similar to the small RNA method below (link). Newer methods include one in which the ligated RNA product is reverse transcribed on the flowcell (link), one method called di-tagging uses a 5' tagged random hexamer sequence for first strand synthesis, followed by integration of a specific 3' sequence using a so-called terminal tagging oligo which allows for subsequent PCR amplification from very low starting material amounts (link). The Illumina Truseq method utilises a 5' tagged random hexamer for reverse transcription followed by standard second strand synthesis and library prep. Strand specificity is provided by the 5' tag incorporated in the first strand synthesis. These methods are slowly becoming mainstream as kits have been released by Illumina and NEB.


Cap analysis gene expression is a tool which allows the capture of 5' ends of transcripts for library prep and sequencing (linklink). Capturing the 5' cap allows for high resolution profiling of transcription start sites and significantly streamlines the bioinformatics analysis of differential gene expression. The RNA is reverse transcribed and the single strand cDNA is ligated to a specially designed biotinylated double stranded adapter which allows subsequent synthesis of the second strand. The 5' caps are cleaved from the rest of the cDNA by a class II restriction enzyme, commonly MmeI. The resulting DNA fragments can be prepared for sequencing using standard adapter ligation and amplification, with extra care taken in the size selection step to isolate the molecules containing the 32nt cap. CAGE-Seq has its uses, especially in resolving the question of alternative transcriptional start sites, but otherwise has not been widely adopted.

Single molecule methods/Ultra low input

These methods generally begin with minute amounts of RNA, for instance from a FACS/MACS run or from a laser-captured microdissection or even single cells. Some of these methods include Ovation RNA Amplification System V2 from NuGEN and the Clontech SMARTer Ultra Low RNA Kit. The di-tagging method mentioned above is also suitable on the 100-500pg range. A note of warning here that any method which involves amplification of RNA with many rounds of PCR or rolling-circle amplification could allow the misincorporation of bases at higher levels. (This happened in our lab with a kit that will remain nameless and resulted in an error rate of 2% and subsequent mis-alignment of our 36nt reads all over the genome. We thought initially that somehow DNA had contaminated our starting material but it was all due to the increased error rates. When we finally found the problem, we eliminated all reads with any mismatches to the genome (over 85% unfortunately) and like magic, those remaining reads began to make perfect sense.) Thus, I would suggest that when you begin with Ultra-low analysis, that you increase read lengths and thoroughly compare error rates to more standard mRNA-seq procedures.


A niche method which begins with the immunoprecipitation of RNA-binding proteins followed by the method described for total RNA-seq or directional RNA-seq. We have adapted this technique in our lab for analysis  of RNA which is bound to the chromatin and allows us to look at certain types of chromatin, for instance those regions which have "repressive" or "active" histone marks. Applied specifically to AGO proteins, this allows the high resolution mapping of mRNAs being targeted by AGO at any point in time in a method called CLIP-Seq or HITS-CLIP (link). When applied specifically to ribosomes in a method called Ribo-Seq, it reveals an exquisite level of resolution of the rate or translation of each transcript (link). IP of adenosine bases methylated at the N6 position followed by sequencing allowed the resolution of a transcriptome wide map of this novel modification (link).

Small RNA-Seq

Starting with total RNA (optionally microRNA-enriched with miRvana), the pool of RNAs are subjected to 3' RNA adapter ligation followed by 5' adapter ligation, reverse transcription, PCR amplification and size selection for the fragment size range of interest. The adapters are modified to enhance the incorporation of microRNAs over other types of RNA. The barcode can be incorporated in the adapter sequence or later in the PCR primer. Early-release kits had big issues with barcode specific bias which was tracked down to the sequence specificity of RNA ligases. These biases are largely nullified in newer kits by placing the barcode at the distal end of the universal adapter sequence. Nevertheless, I would highly recommend that you perform a quality control test by analysing one biological sample with all available barcodes. We have been using the Illumina TruSeq small RNA kits (QC was fine) and will soon be testing the NEB small RNA kit. These small RNA kits also capture microRNAs, but also other sequenced such as piRNAs, Y-RNAs, tRNAs and to a lesser extent mRNAs which could originate from nacent transcripts or degradation products.

Bisulfite RNA-Seq

Some cytosines in RNA are methylated, commonly in rRNAs, tRNAs and miRNAs. The bisulfite sequencing method used to analyse methylated cytosine in DNA has been adapted to RNA. This has been used to interrogate specific transcripts (i.e, tRNAs, rRNAs; link) and is applicable on a genome-wide basis.

In summary, these methods each have their strengths and weaknesses. There is no method which will capture the complete complexity of RNAs found in the cell, but I hope this post will help you on the way to selecting the right tools for the job at hand.

Further reading:
Perspective from Illumina