Showing posts from 2013

A Biomarker Gene Set From ENCODE Expression Data

MSigDB contains thousands of gene sets which have been mined from a range of genome wide studies and these are a valuable resource for gene ontology and pathway analysis. You probably know that many gene sets are curated by KEGG, REACTOME and BIOCARTA - as well as dry lab scientists who specialise in analysing these data sets and curating gene sets. What you may not know is that if you follow a few basic guidelines, you can start generating your own custom gene sets and these can become a valuable resource for running gene ontology and Gene Set Enrichment Analysis (as per the graphic)

For instance within our lab, we have extensively used ENCODE ChIP-Seq data to help us to analyse our mRNA-seq data and this has provided a huge leg-up in generating hypothesis and designing follow-up experiments. For this example, I want to show an overview of how I made biomarker gene sets for a bunch of cell types analysed by ENCODE. Biomarker gene sets can useful in array or mRNA-Seq analysis that you…

A selection of useful bash one-liners

Here's  a selection of one liners that you might find useful in data analysis. I'll update this post regularly and hope you can share some of your own gems.

Compress an archive:
tar cf backup.tar.bz2 --use-compress-prog=pbzip2  directory Extract a tar archive:
tar xvf backup.tar   Head a compressed file:
bzcat file.bz2 | head  pbzip2 -dc file.bz2 | head zcat file.gz | head  Uniq on one column (field 2):
awk '!arr[$2]++' file.txt Explode a file on the first field:
awk '{print > $1}' file1.txt Sum a column:
awk '{ sum+=$1} END {print sum}' file.txt Put spaces between every three characters
sed 's/\(.\{3\}\)/\1 /g' file.txt Join 2 files based on field 1. Both files need to be properly sorted (use sort -k 1b,1)
 join -1 1 -2 1 file1.txt file2.txt Join a bunch of files by field 1. Individual files don't need to be sorted but the final output might need to be sorted:
awk '$0 !~ /#/{arr[$1]=arr[$1] " " $2}END{for(i in arr)print i,a…

Quick alignment of microRNA-Seq data to a reference

In my previous post, I discussed an efficient approach to align deep sequencing reads to a genome, which could be used for most applications (mRNA-Seq, ChIP-Seq, exome-Seq, etc).

MicroRNA datasets are rather a bit different because they often contain millions of reads from only a few thousand RNA species, meaning there is a lot of redundancy. We can speed up this analysis considerably if we collapse the sequences and remove the redundancy before genome alignment. You can see in the below script that several steps are done without writing a single intermediate file:

Adapter clippingSequence collapsingBWA alignmentBam conversionSorting bam file
for FQ in *fq
fastx_clipper -a TGGAATTCTCGGGTGCCAAGG -l 18 -i $FQ \
| fastx_collapser \
| bwa aln -t 30 $REF - \
| bwa samse $REF - ${FQ} \
| samtools view -uSh - \
| samtools sort - ${FQ}.sort &
for bam in *bam ; do samtools index $bam & done ; wait
for bam in *bam ; do samtools flagstat $bam > ${bam}.stats & done ; wait
Using thi…

Align Next Gen Sequencing Data Like A Boss

We know that sequencing platforms are generating more and more output, and more of this data is finding itself in online databanks. The volume of data is a blessing because it is so information rich, but it is also increasingly becoming a problem - we need more machine time to process it.

I was recently aligning dozens of mouse encode RNA-seq data sets while we had a server fault. I was left with a mess of half finished jobs and intermediate files and basically had to start the whole process again. This prompted me to make an alignment pipeline that would be more resistant to unexpected interruptions. To do this, I took inspiration from a post at SEQanswers which piped the output from the BWA aligner and generated bam files without any intermediate files.

But I wanted to take this a bit further, by going from compressed fastq to bam file without writing any intermediate files, doing quality trimming on the fly and saving a lot of storage in the process. So here are two examples, one f…

Generating a custom gmt file for gene set analysis

Pathway and gene set analysis is a common procedure for interpretation of RNA-seq or other genome-wide expression assays. Most of the time, we use GSEA to tell us whether our gene sets of interest are up- or down-regulated. We can use gene sets from KEGG, Reactome, GOMSigDB and other sources, but you can also generate your own gene sets. The format used for GSEA is gmt. I'm going to take you through two examples of generating custom gene sets:

Generate gene sets from published data sets using GEO2R Let's say you're interested in the transcription factor STAT1. I found a dataset in GEO called "Knockdown of STAT1 in SCC61 tumor xenografts leads to alterations in the expression of energy metabolic pathways", which has a paper in BMC Med. Most uploaded array data sets can be reanalysed with GEO2R, which runs the array analysis tool Limma but this is embedded in the webpage and has a GUI which makes it very accessible for biologists.

Click this link to go directly t…

Generate an RNA-seq count matrix: Part 2 - count over exons

In the previous post, I mentioned that in order to do counting, we need some regions (exons) to count over. We generated a BED file based on GENCODE exons which looks like this:

chr93325586233255925ENSG00000107262.11_BAG1 Now we can actually count how many reads from each of our samples maps to these exons. To do that there are a few different programs: htseq-countsamtools viewGATK countreadsBams2mxBedTools multicov For this demo, I will outline use of BedTools for RNA-Seq. I think BedTools nulticov is a good option for 3 main reasons: It can handle multiple bam files at the same time, producing a count table/matrix which is pretty much ready for statistical analysis.It is aware of mapQ scores of alignments, meaning that if the program comes across non-unique alignments, you can set it to ignore or include those reads.The BedTools package is well documented, widely used and the code under constant main…

Generate an RNA-seq count matrix: Part 1 - GTF2BED

Previously we took a look at how to align mRNA-Seq reads to a genome with BWA and Tophat. The next step in the pipeline is to count reads which align on exons and then to get them into a count matrix that we can submit for statistical analysis with EdgeR or another program.

At this point in the process, we need to decide which annotation set to use. For human, we commonly use RefSeq and we are beginning to use GENCODE annotation because of its very comprehensive nature. Most of the programs which count reads require a BED file to indicate which genomic regions should be counted.

For this blog post, I will show you how we generate an exon BED file for GENCODE genes. The script below will download the GTF file and generate a temp file list of all exons present in the GTF in the form of a BED file (chromosome, start, end) with the gene accession number and gene name pasted into one field.


Recent papers

Some work that I've been involved with over the past few months. 

TitleAnalysis of the barley leaf transcriptome under salinity stress using mRNA-Seq Authors Mark Ziemann, Atul Kamboj, Runyararo M Hove, Shanon Loveridge, Assam El-Osta, Mrinal Bhave Journal name Acta Physiologiae Plantarum Pages 1-10 Publisher Springer-Verlag Description Abstract: Salinity is a threat to crops in many parts of the world, and together with drought, it is predicted to be a serious constraint to food security. However, understanding the impact of this stressor on plants is a major challenge due to the involvement of numerous genes and regulatory pathways. While transcriptomic analyses of barley (Hordeum vulgare L.) under salt stress have been reported with microarrays, there are no reports as yet of the use of mRNA-Seq. We demonstrate the utility of mRNA-Seq by analysing cDNA libraries derived from acutely salt-stressed and unstressed leaf material of H. vulgare cv. Hindmarsh. The data yielded >50 m…

Regulation of gene expression by long non-coding RNAs

Gene regulation is a really complicated thing. We have covalent marks to DNA, histones and transcription factors. Chromatin remodeling and long range enhancer interactions. Enhancer elements located in introns of genes hundreds of kilobases away from the gene they're controlling. Transcriptional control from microRNA networks and now there is an emerging model for the function of some of the thousands of long non-coding RNAs which are just now being uncovered with high resolution (directional) transcriptome analysis.

Many of you which studied molecular biology at Uni would (should) remember the model for how X chromosome inactivation is achieved. The mechanism centers around XIST, one of the first non-coding RNA genes identified. Expression of XIST from the inactive X chromosome essentially wraps it up at the same time that repressive epigenetic marks are established through its interaction with the Polycomb Repressive Complex 2 (PRC2). Sounds simple enough, but the model also inv…