Sunday, 19 February 2017

How NGS is transforming medicine

Last month, I gave a talk at our departmental meeting, describing in general terms how high throughput sequencing technology was having real impacts in medicine and human health, as well as some emerging trends to watch out for in coming years.

Here's the link

Departmental presentation

Friday, 17 February 2017

Introducing the ENCODE Gene Set Hub

TL;DR We curated a bunch of ENCODE data into gene sets that is super useful in pathway analysis (ie GSEA).
Link to gene sets and data:
Poster presentation: DOI:10.13140/RG.2.2.34302.59208

Now for the longer version. Gene sets are wonderful resources. We use them to do pathway level analyses and identify trends in data that lead us to improved interpretation and new hypotheses. Most pathway analysis tools like GSEA allow us to use custom gene sets, this is really cool as you can start to generate gene sets based on your own profiling work and that of others.

There is huge value in curating experimental data into gene sets, as the MSigDB team have demonstrated. But overall, these data are under-shared. Even our group is guilty of not sharing the gene sets we've used in papers. There have been a few papers where we've used gene sets curated  from ENCODE transcription factor binding site (TFBS) data to understand which TFs were driving responses to various stimuli [1-5]. Coauthors and I realised the need for our ENCODE gene set resource to be public. Our goal is to release these gene sets (and code) as a public resource and write a paper demonstrating the utility of this resource in enhancing interpretation of experimental data.

In this first piece of work, we selected ENCODE data that are present online in "peak" format, which includes TFBS, histone mods and DNase-seq peaks. We have curated lists of genes that contain such peaks at transcription start sites as well as enhancers.

The end goal, to publish really high quality gene sets, pushed us to think about ways to optimise gene set curation and how to even measure gene set quality. There was not a lot of literature to base our work off. We reasoned that real gene expression profiles are strongly non-random as we know clusters of co-regulated genes exist. The average enrichment score (ie NES for GSEA) of all genes in a set could be useful in detecting non-random sets of genes. The problem lies in that as you change parameters, the number of members of the gene sets change, and as you add more members to a set, the NESs tend to increase and raw enrichment scores (ESs) tend to decrease. So we reasoned that we should somehow normalise for the number of genes in the set. We devised a metric that was useful in estimating association between gene sets and profiling data:

AM=(NES^2) / Size

Where AM is the association metric, NES is the normalised enrichment score for each gene set and "Size" is the number of genes in the gene set. In our work, the average AM of all sets in the GSEA run is reported. The higher the AM, the less random the gene sets are. In the process of turning peak data into gene sets we came into a few key questions:
  1. Does it matter how many genes there are in a set?
  2. Does it matter how far away a peak is from a TSS or enhancer (regulation landmark)?
  3. Which gene annotation set we should use?
To address these, we again needed to come up with a good way to evaluate how AM changes as we increment parameters like gene set size and distance to regulatory landmark. We wrote a script to perform gmt subtraction and this generates new gene sets for each increment in a parameter sweep.

Using this approach as shown in our poster presented at the 2017 Lorne Genome Conference we were able to define AM while varying parameters like the size of the gene set and distance of peaks to the regulatory landmarks.

Our poster for Lorne Genome Conference 2017 (DOI:10.13140/RG.2.2.34302.59208)
We found that gene sets of 2000 or more should be avoided, and the distance between peak and regulatory landmark can be 1 kbp to 10 kbp. Results were consistent for TFBS, histone mod and DNase-seq peaks at both TSSs and distal regulatory elements.

While there are already curated TF binding site data from ENCODE at Harmonizome, these are not readily accessible in bulk. Our work is unique for the following reasons

  • Our analysis looks at TFBS at enhancers as well as promoters
  • Our work summarises mouse ENCODE data
  • Gene set size and peak-landmark distance are optimised
  • Includes chromatin accessibility and histone mod data in addition to TFBS
During the course of this work, we also developed MultiRes which is a visualisation tool for multiple layers of regulation using ComplexHeatmap/R. For instance, in the poster, we used Reactome pathways as a primary analysis (red and green) and layered ENCODE transcription factor binding site data over as a secondary analysis (blue, white and red).

MultiRes plot showing Reactome and ENCODE TFBS analyses.
MultiRes can be used for other layers of regulation, such as Starbase microRNA target gene sets.

In the next iteration of the Encode Gene Set Hub, we will be adding gene sets from FAIRE-seq, RNA-seq and other ENCODE data.

If you use these gene sets in your work, I'd love to hear some feedback. leave your comments below.

  1. Kobow et al, 2013
  2. Rafehi et al, 2014
  3. Keating et al, 2014
  4. Okabe et al, 2014
  5. Tuano et al, 2016

Saturday, 31 December 2016

2016 wrap-up

What a rollercoaster.

On the good side, we've seen advances in sequencing methods including improvements in Nanopore sequencing and single cell methods becoming more common. We also saw 10x genomics come to the party with an emulsion PCR approach that can be applied to produce synthetic long reads or single cell barcoding. There were major results from larger cohort studies exemplified by ExAC, which is revealing more about human genetic variation, while Blueprint and GTEx are revealing more about determinants of gene regulation. The #openaccess is growing rapidly in the bioinformatics community, along with the growth of preprint popularity which I hope is adopted more widely in the biomedical sciences.

On the not so good side, Illumina has made made no new instrument announcements nor any substantial updates to existing sequencing systems. There were a few papers (example) describing the methylation EPIC array announced in 2015. Prices for Illumina reagents continue to increase, especially in Australian dollars, which is starting to erode some of the goodwill in researchers.

Closer to home, I had quite a mixed year in science.

You'll recall that we announced Digital Expression Explorer in October of 2015. It was the first real attempt at mining the entire contents of transcriptome sequencing in SRA and making it a useful resource for non-bioinformaticians. While the site received alot of positive feedback from the community, our submission to Bioinformatics didn't go well. There were a few sticking points with the reviewers, the most important ones being the process of dataset QC and filtereing as well as the use of single reads even when the second read was present. We made major updates to the site to please them but the reviewers didn't really comprehend that when dealing with >10^6 datasets there will not be a pipeline that works perfectly for 100% of them. The rejection from Bioinformatics was bitter, but I'm not a quitter - there will be a new version landing in mid 2017 that will have more advanced features including transcript level quantification and customisable gene sets to choose from (Ensembl and RefSeq). While there have been similar efforts like ReCount, Rail-RNA, BioXpress and EBI Expression Atlas, noone has yet to release as many curated RNA-seq datasets from as many species as we have.

This year saw my first pre-print, which I have received quite a bit of feedback, some good and some bad. There might be a revision to the paper in late 2017 and eventually submission to a journal.

There was a small win for our group's paper on small RNA mappers, which was a huge effort. In each round of review, the reviewers requested more analyses and so the published paper is about twice as long as the initial submission. The original submission was much better ;)

We had a smashing success with the publication of the Gene Name Errors paper in Genome Biology. It was really a wake-up call to the field that we should be more careful with the data we publish and should be adopting better data analysis practises. It also shined a light on many of the issues that are emerging in supplementary content. As a whole, scientists rely strongly on supplementary materials but they are given little, if any scrutiny.

Throughout the the year there were middle authorships in papers related to our group's work on epilepsy, metabolic syndrome, stem cell differentiation and heart failure. These co-authorships reflect my bread-and-butter work: contribution in the generation and analysis of sequencing data in multidisciplinary collaborations.

This blog received its 200,000 visit! Although this achievement is somewhat diminished by huge amounts of bot traffic.
Periodic spikes in web visits are indicative of bot activity

2017: A new beginning for all of us*

Many of us lost someone this year. Could be a favorite musician, artist or someone close to us. Through loss we realise the fragility of life and the importance of not wasting our time here. 2017 is likely to throw us as much adversity to us as 2016 did, so take this chance to take stock and live this year as if it were your last. If 2016 taught me anything, it's that we should be working on stuff that interests people and dump obscure projects that aren't going anywhere (you'll have to be honest/blunt with collaborators). Reflect on the good things around you this year. I welcomed my newborn daughter in October and saw my 4yo son develop in so many ways. We even moved into a place of our own.

A few weeks ago, our group (El-Osta) moved from Baker IDI to Monash Uni situated a few doors down at the Alfred Hospital campus. Initially I was unenthused about this, but the change does allow the group to reinvigorate and take on new opportunities that would not be possible at Baker IDI. The move killed my only PI project, but we'll be seeking new funds to replace the lost Baker IDI internal grant.

I can give a few clues as to what I'll be working on in 2017: A new version of DEE; a bunch of new gene set resources; new work on the pitfalls of inbred mouse models; MinION data from our lab; single cell RNA-seq data from our lab; and perhaps a paper called "Supplementary data horror stories".

I'm looking forward to making 2017 count, its a new start, a chance to do better.

*Apologies to Left of Field Record for paraphrasing this record title

Thursday, 29 December 2016

MSigDB gene sets for mouse

I recently needed to convert MSigDB gene sets to mouse so I thought I would share.


Below is the code used to do the conversion. It requires an input GMT file of human gene symbols as well as a human-mouse orthology file. You can download the ortholog file here. As the name suggests, it is based on data downloaded from Ensembl Biomart version 87.

Running the program converts all human gmt files. It requres gnu parallel which can be easily installed on Ubuntu with "sudo apt-get install parallel"


  NAME_DESC=`echo $line | cut -d ' ' -f-2`

  GENES=`echo $line | cut -d ' ' -f3- \
  | tr ' ' '\n' | sort -uk 1b,1 \
  | join -1 1 -2 1 - \
  <(cut -f3,5 mouse2hum_biomart_ens87.txt \
  | sed 1d | awk '$1!="" && $2!=""' \
  | sort -uk 1b,1) | cut -d ' ' -f2 \
  | sort -u | tr '\n' '\t' \
  | sed 's/\t$/\n/'`

  echo $NAME_DESC $GENES | tr ' ' '\t'
export -f conv
for GMT in `ls *gmt | grep -v mouse ` ; do
  NAME=`echo $GMT | sed 's/.gmt/_mouse.gmt/'`
  parallel -k conv < $GMT > $NAME


Tuesday, 20 September 2016

Gene name error scanner webservice

Over the past few weeks, we've had a lot of feedback about our paper describing the sorry state of Excel auto-correct errors in supplemental files in spreadsheets.

In our group, we've discussed a number of ways that these errors could be minimised in future. One suggestion was to publish a webservice which permits reviewers and editors to upload and scan spreadsheets for the presence of gene name errors. So that's what I did. I took some basic file upload code in php and customised it so that it runs the shell script described in the paper. You can access the webservice here. We've been testing it for a few days and seems to work fine, except for the auto-generated email which I presume is being blocked by our IT group.

The code for the webservice is up at GitHub, so you can modify it and host another instance if you want. The code should run on Ubuntu machines that can run Apache2, php and other dependencies. The github repo contains an that should be able to install dependancies and configure the webpage in more or less 1 command. Modification of the php.ini is required to accommodate larger upload filesizes than the default, like the following example:

upload_max_filesize = 50M
post_max_size = 50M

I would like to thank other contributors to this tool: Yotam Eren, Antony Kaspi & Assam El-Osta

Saturday, 27 August 2016

My personal thoughts on gene name errors

Well, our paper "Gene name errors are widespread in the scientific literature" in Genome Biology has stirred up some interest. There are a lot of reasons why this article has taken off beyond what I initially envisioned:

  1. Most tech-savvy people hate Excel
  2. People over-rely on Excel, when there are better alternatives for analytics
  3. Everyone has experienced an auto-correct fail and can relate
  4. People love "bloopers"
  5. People are interested when scientists get it wrong (especially other scientists)

 In this post I want to share a few things:

  1. Some responses to journalist questions
  2. List of media coverage and whether they are reporting things accurately
  3. A look into the scripts used themselves
  4. Future directions
I'll also be answering your questions, so pop them in the comments section. 

Some responses to journalist questions

Why did you do this?

We saw that the problem was first described in 2004, but these errors were present in files from papers in high-ranking journals. We made some bash scripts to look at one journal and were surprised how common the problem was.

How much of a problem is it really? Does it matter that these datasets are corrupted?

As most of the supplementary files were looked at contained unambiguous accession numbers in addition to gene symbols we are able to infer the read gene name most of the time. Also this problem mostly affects the 20 or so genes that have names that look like a date. So in reality its a minor but embarrassingly common problem. There are still 166 supplementary files that need to be fixed.

Why is it still going on, given the problem has been known about for a decade or more?

There is perhaps an over-reliance on tools like Excel which are not suited to very large data sets. There is a resistance to learn the "correct" tools like Matlab, R & Python, in which researchers can trace exactly how the data is processed in a reproducible way.

Anything else?

Perhaps the supplementary data files accompanying these papers aren't given the same level of scutiny that the main aricle receives. Also, the reviewer's field of expertise may not extend to the subtleties of data analytics methods.

Tell me, do you know how they created those Excel files?  Why didn't they use the text import function

Most, but not all of these data files are imported as text or csv formats from instrumentation such as DNA sequencers, gene microarrays or proteomics screens. Many of these files appear to be heavily modified from their original format and contain colour coding, modified column headers and additional columns. Some files are comparisons of 2 or more datasets or more in the same worksheet. A smaller number of files was simply a filtered list of gene names that could be a group of candidate genes for future analysis. Many users don't know about the specific data import settings and simply use the default settings.

What's really interesting is whether these errors matter - are any of the research conclusions wrong? That would be a powerful point to make.

As most error-containing files I screened also had accession numbers or other identifying information, the risks to altering the conclusions of the study by gene name errors are minimal, but embarrassingly common.

Were the Excel sheets merely a way to package data and results for readers, although the original analysis was done correctly in SPSS or R ?

Yes statistical analyses of large datasets is done in R, Matlab, Python, etc and its common to save the data in XLS file so that other researchers in the study can open and inspect the data.

Specifically for the misinterpreted genes, MARCH1, SEPT1,  did any papers focus on these? If so, it should be easier to check whether the results in the paper reflect the original clean data or the mangled Excel data

I didn't look at this specifically. But it would be interesting to see whether this is true.

Would Spreadsheet Auditing help reduce the incidence of these mistakes?

I've been discussing the drawbacks of spreadsheet software with colleagues and one solution could be to develop a spreadsheet software that has a complete changelog so that the files can be fully transparently audited and the modifications of these files can be done in a reproducible way.

Media coverage

Here I'll list the media reports and give my 2c on whether they are accurate. Please let me know if I've missed any.

HowStuffWorks <<==This is my favourite, the video is a must see!

Quartz Contacted me, Assam and others about the issue. It is a great read - balanced and well researched.

BBC  Mostly accurate, but the flashy title lays the blame on Microsoft, whereas I reckon it lies primarily on the researchers, reviewers, editors & database curators. Microsofts primary market is business and home applications, not genomics.

WinBeta Overblown. The headline is inaccurate. Its important to note that its not 20% of all genomics papers, but 20% of those that share lists of gene names in spplementary files. It is a very important distinction and is why we chose our words so carefully in the paper. The suggestion from the authors to leave a leave suggestion note at MS Excel UserVoice instead of sending the paper to Genome Biology is laughable. Primarily because this would not have gotten so much attention if the paper was never submitted, fellow scientists and journal editor wouldn't have received the message that they need to tighten up the way that supplementary data are reviewed. Secondly I'm an academic and writing papers is what we do for a living :)

SoftPedia News Like WinBeta, the headline is inaccurate and overblown, but otherwise accurate.

Slate Magazine Totally overblown, they've copied inaccuracies from the SoftPedia article without checking the original source. Good & accurate
ITWire Good & accurate

The Register Overblown headline. But the comment at the bottom of the piece is apt. "The Register could suggest that gene boffins just stop using spreadsheets for jobs to which they are not entirely-well-suited, but we suspect the idea would be as futile in that field as it is in every other. ®"

Inquirer Overblown headline. I can sense a trend here. Also the report contains a blunder, saying the paper was published in BioMed Central, where it was actually in Genome Biology. The link between gene name errors and Jurassic Park is a bit far fetched & random.

Digital Journal Overblown headline. Otherwise good.

GenomeWeb Paywalled content, try searching for it with Google News to access it. Is mostly accurate.

Science World Report Article claims we "blame" MS Excel, which is not the case. We lay most of the the blame with researchers, reviewers and editors. Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". The article does a good job of highlighting good comments from other discussions on the net.

I4U News This looks like an exact copy of the WinBeta piece.

Techaeris Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". Otherwise its OK.

The Times Need to register to The Times to read. Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". The paper features some great comments from Ewan Birney, EBI Director.

Washington Post Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". The piece also highlights Excel's ability to turn dates into 5 digit numbers - we actually saw this a lot too. The article finishes by advocating the use of R over Excel, which we agree.

Popular Mechanics Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". 

Neowin Good & accurate

AfterDawn Good & accurate

WinBuzzer They got my Boss' name incorrectly spelled. That's one of his pet peeves.

The Tech News Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". 

Slashdot Reposted from WinBeta, Including the horrible headline. Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". Also 344 comments is a lot, even if most of them seem random.

Süddeutsche Zeitung (German) This is a great article and completely correct. 

Futurezone (German) Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". 
Globo (Portugese) Good (translated by Google )
N1 (Serbian) Good (translated by Google )


SiliconANGLE Good & accurate

News in Proteomics Research Mostly great! Also describes how to turn off autocorrect features in Excel - which is as I understand it only fixing autocorrect when it's typed not when data is pasted or imported.

Language Log Great! Describes a researcher's long battle with Excel genes in the literature.

MentalFloss Good & accurate

What You're Doing Is Rather Desperate Great post By Neil Saunders, legend in the genomics, community, about his unending dislike for bioinformatics done in Excel. Great post from an IT point of view. Mostly advising people to use Excel correctly.

Walking Randomly Discusses how Excel use leads to more errors than just date conversions. As scientists we should be using scripts to make our work reproducible.

Sysmod by Patrick O'Beirne. Great post highlighting how easy the problem is to avoid by using the Excel import features. Also goes into detail about further checks which can be done to avoid errors.

The Allium "Scientific community capitulates to Microsoft, officially changes all gene names to dates" What a fantastic satirical piece!

Thanks to everyone that's reported, shared, tweeted, commented, reddited, posted and emailed us about the story. I think we've partly achieved our goal of raising awareness of the issue. Let's see whether the 166 files are corrected and whether any new gene name errors crop up in future :)

I'll finish this post with a word-cloud of the most common Excel gene name errors. In my next post I'll discuss the scripts used to identify the errors.

Saturday, 9 July 2016

Analyzing repeat rich plant smRNA-seq data with ShortStack

Small RNA expression is difficult to analyse. They're small molecules anywhere from 18 -25 nt for miRNAs, they occur as identical or near identical family members and are subject to RNA editing as well as errors from the sequencer.

My recent paper is an analysis of alignment tools for microRNA analysis with a strong focus on uniquely mapped reads. All that's OK, but in some organisms such as grasses (rice, barley, wheat, etc) you'll find that multimapped reads far outnumber uniquely placed ones. If you omit multimapped reads from the analysis, then you'll be excluding the majority of reads which is definitely a bad idea in any NGS analysis pipeline.

To demonstrate this, I downloaded smRNA-seq data from SRA (SRP029886) that consists of 3 datasets (SRR976171, SRR976172, SRR976173), clipped the adaptors off and mapped them to the genome with BWA aln then counted reads mapped to exonic regions uniquely (mapQ10).
Table 1. Mapping of rice small RNA reads to the reference genome with BWA aln.
So as you can see, the proportion of reads that are assigned and form the count matrix contains less than 20% of the original reads. Clearly this approach will only capture about a fifth of whats going on.

There is an alternative to help with multimapped smRNA reads. The ShortStack pipeline uses uniquely placed reads in the vicinity of multimapped reads to estimate the number of tags that are originate from different multimapped regions.
Table 2. ShortStack pipeline resolves multimapping smRNA reads with local unique alignment information.
As you can see in the lower table, the local mapping information has helped to place 43.8% of reads that were previously ignored due to low mapQ. ShortStack also has an in-built miRNA gene identification tool that uses RNA folding info and read mapping positions to infer genes (Figure 1).
Figure 1. Example of a miRNA gene identified from rice smRNA-seq data with ShortStack.
From the 309,946 rice contigs identified, 94 were identified as small RNAs. Of the 94M mapped reads, MiRNA genes accounted for just 339,4147 reads (3.61%). If the majority of reads aren't miRNA, then what are they? I took a look at the top 10 contigs by assigned reads (Figure 2). 
Figure 2. Highest expressed rice contigs in smRNA-seq data.
Just these top 10 contigs accounted for 36.6M reads (39% of mapped reads) and as you can see uniquely mapped reads represent just 39% of this figure. The counts determined by ShortStack could be used for differential expression analysis with GFold or DESeq

What I found really interesting was despite the very high expression of these contigs, their function remains more or less unexplored. There are few genes in these genomic regions that have any functional data ascribed, and there is no hint from the databases that these encode any type of non-coding RNA.

These data show that naive "uniquely" mapped read analysis is inappropriate for plants like rice that have a large proportion of multimapped smRNA reads. ShortStack was able to account for a much greater proportion of reads and was able to determine genomic regions containing highly expressed genes that likely have important but unexplored roles.