Wednesday, 31 December 2014

2014 Wrap-Up

The year has gone so fast! Lets go through some of the major points of 2014 and predict what 2015 has in store.

Sequencing hardware

It began with announcements from Illumina of the HiSeqX10 as well as the release of the NextSeq500. Later, X-10 technology was included in V4 HiSeq2500 kits resulting in a substantial increase in sequence output from that instrument. There was also the announcement of the NeoPrep automated library preparation system that will be officially released in the 1st half of 2015. NeoPrep looks like an attempt to use microfluidics to perform generate libraries; if this is successful, it would be a major advance in reducing bottlenecks in sample preparation. Microfluidics are also able to reduce the volumes of reagents required, making the process cheaper. In addition, labour costs per library will be drastically reduced given that automation will be commonplace for routine protocols. Regarding 3rd gen platforms, we saw some mixed reviews from Oxford Nanopore MiniIon users while PacBio Sequencing long read sequencing is becoming more mainstream (including a Nature article), and we hope that a larger user base will spur on further improvements in sequencing chemistry. Could 2015 be the year that protein sequencing goes "next generation"? Time will tell, on the other hand, enhanced bioinformatic algorithms (like SWATH) in proteome mass spectrometry have some pundits already announcing that next generation proteomics is here.

Genome research

I just wanted to touch on some interesting points of research that have the potential of major advances in genomics. The first is the human knockout project, that is the search for individuals that carry normally deleterious mutations but have a normal phenotype. Those individuals and the other alleles they carry could be a goldmine for drug targets. This has promoted researchers, both academic and private-sector in gold-rush like enthusiasm to search for these individuals and characterise their precious genomes. Another technology that has gained real traction was CRISPR/Cas9 genome editing, not just for in vitro applications, but also for generation of custom rodent gene edited lines and more recently showing potential in gene therapy. Some conditions that show promise include Hepatitis B virus, cardiovascular disease and muscular dystrophy. Recently, other researchers have shown promise in restoring normal genotype to mutant spermatozoa with broad applicability for families to eliminate deleterious alleles from their lineage. This holds great promise for molecular medicine and also major ethical challenges.

Reflections on Genome Spot

This year I put an emphasis on quality blog posts including in-depth series on RNA-seq analysis, benchmarking data processing and NGS aligner accuracy. The number of clicks to the blog increased significantly this year and total pageviews topped 20,000 today. I also learned that my audience is mostly USA-based, mostly uses Windows computers, and mainstream internet browsers. My post on Geneclouds was my most popular post this year with 1185 hits, most of those coming from Reddit and Twitter. I also opened a Twitter account so follow me there @mdziemann. In 2015, I will be writing on ChIP-seq, DNA methylation analysis, variant calling, science papers and user friendly tools for any and all NGS applications.

My 2014 in academia

I had a good year for publications :)
Identification of salt-tolerant barley varieties by a consolidated physiological and molecular approach A Kamboj, M Ziemann, M BhaveActa Physiologiae Plantarum 37 (1), 1-12 2015
Dynamic changes in the cardiac methylome during postnatal development CB Sim, M Ziemann, A Kaspi, KN Harikrishnan, J Ooi, I Khurana, L Chang, ...The FASEB Journal, fj. 14-264093 2014
Development of Novel Activin-Targeted Therapeutics JL Chen, KL Walton, SL Al-Musawi, EK Kelly, H Qian, M La, L Lu, ...Molecular Therapy 2014
Non-referenced genome assembly from epigenomic short-read data A Kaspi, M Ziemann, ST Keating, I Khurana, T Connor, B Spolding, ...Epigenetics, 00-00 2014
Deep sequencing reveals novel Set7 networks ST Keating, M Ziemann, J Okabe, AW Khan, A Balcerczyk, A El-Osta Cellular and Molecular Life Sciences, 1-16 2014
Endothelial Transcriptome in Response to Pharmacological Methyltransferase Inhibition J Okabe, AZ Fernandez, M Ziemann, ST Keating, A Balcerczyk, A El‐Osta ChemMedChem 2014
Vascular histone deacetylation by pharmacological HDAC inhibition H Rafehi, A Balcerczyk, S Lunke, A Kaspi, M Ziemann, KN Harikrishnan, ...Genome research, gr. 168781.113 2014
Rapid Development of Non-Alcoholic Steatohepatitis in Psammomys obesus (Israeli Sand Rat) B Spolding, T Connor, C Wittmer, LLF Abreu, A Kaspi, M Ziemann, G Kaur, ...PloS one 9 (3), e92656 2014
Maternal overnutrition programs changes in the expression of skeletal muscle genes that are associated with insulin resistance and defects of oxidative phosphorylation in adult male rat offspring C Latouche, SE Heywood, SL Henry, M Ziemann, R Lazarus, A El-Osta, ...The Journal of nutrition 144 (3), 237-244 2014
Deep sequencing reveals increased DNA methylation in chronic rat epilepsy K Kobow, A Kaspi, KN Harikrishnan, K Kiese, M Ziemann, I Khurana, ...Acta neuropathologica 126 (5), 741-756 2013

To all my readers, thanks for your support in 2014. Have a safe, happy and successful 2015!

Monday, 29 December 2014

Interspecies gene name conversion

In this post, I'll provide a step-by-step guide to perform interspecies gene name conversion of gene expression data. This is a necessary step in the comparison of profiling data from two different experiments with different species (human and mouse), and allows us to use extensive human-centric gene set libraries in MSigDB when analysing non-human mammalian profiling data (such as mouse).

I performed GEO2R analysis of mouse expression data (GSE30192) to analyse the effect of azacitidine on mouse C2C12 myoblasts. The data looks like this:

"ID" "adj.P.Val" "P.Value" "t" "B" "logFC" "Gene.symbol" "Gene.title"
"1420647_a_at" "0.000346" "2.24e-08" "56.073665" "8.699524" "6.9755573" "Krt8" "keratin 8"
"1423327_at" "0.000346" "2.32e-08" "55.685912" "8.686447" "3.8096523" "Rpl39l""ribosomal protein L39-like"
"1433438_x_at" "0.000378" "4.92e-08" "48.124512" "8.381385" "3.8895855" "Mela" "melanoma antigen"
"1460256_at" "0.000378" "6.75e-08" "45.243998" "8.234349" "3.5680721" "Car3" "carbonic anhydrase 3"


Now to convert the gene names to human we will use the databases at Ensembl biomart to download mouse and human gene information separately, join these files together and then attach them to the original profiling data as depicted in the schematic below. (Some other homology resources are found at HomologeneMGI and Phytozome)

Use Ensembl biomart to download a list of mouse gene accession numbers and Associated Gene Names, as in the screenshot below.
The file should look a bit like this:
Ensembl Gene ID Associated Gene Name
ENSMUSG00000064372 mt-Tp
ENSMUSG00000064371 mt-Tt
ENSMUSG00000064370 mt-Cytb
ENSMUSG00000064369 mt-Te

Now do the same for human gene names. Click the "Homologs" button to also select mouse gene accession numbers. It should look like this:
Ensembl Gene ID Mouse Ensembl Gene ID Associated Gene Name
ENSG00000180383 ENSMUSG00000074678 DEFB124
ENSG00000162444 ENSMUSG00000028996 RBP7
ENSG00000165583 ENSMUSG00000079704 SSX5
ENSG00000165583 ENSMUSG00000079701 SSX5

Joining files

Now we need to join these 2 files on the mouse gene accession number. The data need to be "cleaned" and sorted on the field to be joined. The same needs to be done with the profiling data.

#Mouse file: remove the top line and sort on accession number
sed 1d MusENSG2symbol.txt | sort -u \
| sort -k 1b,1 > MusENSG2symbol_s.txt

#Human file: remove top line, select only genes that have mouse 
#homologs and sort on mouse accession number
sed 1d Hum2Mus_ID.txt | awk 'NF==3' | sort -u \
| sort -k 2b,2 > Hum2Mus_ID_s.txt

#Use the UNIX join command to join the two files on the mouse 
#accession number and then sort the output on the mouse gene symbol 
#(column 2)
join -1 1 -2 2 MusENSG2symbol_s.txt Hum2Mus_ID_s.txt \
| sort -k 2b,2 > Hum2Mus_ID_genenames.txt

#Clean and sort the profiling data
#Extract only the mouse gene name, fold change, p-value and 
#adj p-value from the GEO2R result
sed 1d GSE30192_GEO2R.xls | tr -d '"' \
| awk '{OFS="\t"} {print $7, $6, $3, $2}' \
| awk 'NF==4' | grep -wv NA \
| sort -k 1b,1 > GSE30192_GEO2R_s.xls

#Join the gene name/ID key file to the profiling data, sort on 
#p-values then remove redundant human entries
join -1 2 -2 1 Hum2Mus_ID_gnennames.txt GSE30192_GEO2R_s.xls \
| sort -k6g | awk '!arr[$4]++'  > GSE30192_GEO2R_converted.xls

Now the data looks like its ready for comparison with data from human experiments:
MusGeneSymbol MouseAccession HumanAccession HumanGeneSymbol FoldChange P-value adjP-value
Krt8 ENSMUSG00000049382 ENSG00000170421 KRT8 6.9755573 2.24e-08 0.000346
Rpl39l ENSMUSG00000039209 ENSG00000163923 RPL39L 3.8096523 2.32e-08 0.000346
Sprr1a ENSMUSG00000050359 ENSG00000169469 SPRR1B -3.441516 6.46e-08 0.000378
Sprr1a ENSMUSG00000050359 ENSG00000169474 SPRR1A -3.441516 6.46e-08 0.000378

Some thoughts on interspecies gene name conversions

Keep in mind though that these type of conversions have limitations. While human and mouse are fairly closely related, there are many genes that are not conserved. On the other hand, there could be gene copies (paralogs) in one of the species meaning the conversion is one-to-many. In that case, the genes could have the same function, or on the other hand, the paralogs could have different functions. Due to these limitations, many genes are lost along the way. From the 45101 probes on the original dataset, 39315 were assigned to a mouse gene name. Those probes were assigned to 21477 unique gene names, and after assigning these to human homologs we are left with just 16420 genes.

Tuesday, 16 December 2014

User friendly RNA-seq differential expression analysis with Degust

There is a need to make bioinformatics tools more user friendly and accessible to a wider audience. We have seen that Galaxy, GEO2RGenevestigator and GenePattern have each developed a huge following in the molecular biology community, and this trend will continue with introduction of new RNA-seq analysis tools. Previously, I posted about differential gene expression analysis of RNA-seq performed by the DEB online tool. In this post, I introduce Degust, an online app to analyse gene expression count data and determine which genes are differentially expressed. Degust was written by David R. Powell (@d_r_powell) and was Supported by Victorian Bioinformatics Consortium, Monash University and VLSCI's Life Sciences Computation Centre.

In this test, I'll be using the azacitidine mRNA-seq data set that I have previously analysed. To make the count matrix, I used featureCounts.

First step in the process is to your RNA-seq count data. It can be done in tab or comma separated formats. Once uploaded, you're given a configuration screen to specify the format of the data and the sample groups. Make sure you specify which column contains the gene names/accession numbers. Hit the "view" button and you'll get the smear plot. You can use the mouse to highlight genes. At the bottom there is a table of most statistically significant genes and the search function allows you to quickly find your favourite genes.
Degust interface with smear plot.
You also get an MDS plot to visualise the similarity between samples. What I really like is the graph in the lower right corner showing the relative magnitude of the first 6 MDS dimensions.
MDS plot.
And a parallel coordinates plot (which is probably more interesting with 3 or more sample groups).
Parallel coordinates plot.
So now I'm going to compare the results of Degust with those from DEB using  both DESeq and edgeR as well as the recommended edgeR script obtained from the user manual. I used an awk command to filter the spreadsheets to identify differentially expressed genes (DEGs). I then used the Venny online tool to calculate the overlaps between the DEG lists.
Venn diagram of differentially expressed genes. * Denotes the use of DEB online tool.
You can see that edgeR from the DEB server identified the most DEGs, followed closely by my edgeR script (included below), then Degust and lastly DESeq. My edgeR script differed from the DEB edgeR version likely because I'm using edgeR_3.4.2 & limma_3.18.13 while DEB likely uses an older version. Degust was more similar to the edgeR analyses compared to DESeq, which appears conservative in comparison. Looking at the Degust R code, you can see that it uses Voom to normalise count data to make it compatible for use with Limma statistical analysis as described in a recent Genome Biology paper.

In summary, Degust is a valid tool for RNA-seq analysis for simple comparisons (ie unpaired, 2-sample groups) that is faster and more user friendly than DEB. Its attractive, intuitive and responsive interface suggests that it will be a popular tool for expression analysis.

It would be great if it could also deal with more complicated experimental setups such as sample pairing, ANOVAs and GLMs; and I see downstream pathway analysis such as GSEA as a natural extension to Degust.

R code

x<-read.delim("", row.names="Symbol")
write.table(topTags(et,n=20000), file="aza_mRNA_edger.xls", quote=FALSE, sep= "\t")