Saturday, 27 August 2016

My personal thoughts on gene name errors

Well, our paper "Gene name errors are widespread in the scientific literature" in Genome Biology has stirred up some interest. There are a lot of reasons why this article has taken off beyond what I initially envisioned:

  1. Most tech-savvy people hate Excel
  2. People over-rely on Excel, when there are better alternatives for analytics
  3. Everyone has experienced an auto-correct fail and can relate
  4. People love "bloopers"
  5. People are interested when scientists get it wrong (especially other scientists)

 In this post I want to share a few things:

  1. Some responses to journalist questions
  2. List of media coverage and whether they are reporting things accurately
  3. A look into the scripts used themselves
  4. Future directions
I'll also be answering your questions, so pop them in the comments section. 

Some responses to journalist questions

Why did you do this?

We saw that the problem was first described in 2004, but these errors were present in files from papers in high-ranking journals. We made some bash scripts to look at one journal and were surprised how common the problem was.

How much of a problem is it really? Does it matter that these datasets are corrupted?

As most of the supplementary files were looked at contained unambiguous accession numbers in addition to gene symbols we are able to infer the read gene name most of the time. Also this problem mostly affects the 20 or so genes that have names that look like a date. So in reality its a minor but embarrassingly common problem. There are still 166 supplementary files that need to be fixed.

Why is it still going on, given the problem has been known about for a decade or more?

There is perhaps an over-reliance on tools like Excel which are not suited to very large data sets. There is a resistance to learn the "correct" tools like Matlab, R & Python, in which researchers can trace exactly how the data is processed in a reproducible way.

Anything else?

Perhaps the supplementary data files accompanying these papers aren't given the same level of scutiny that the main aricle receives. Also, the reviewer's field of expertise may not extend to the subtleties of data analytics methods.

Tell me, do you know how they created those Excel files?  Why didn't they use the text import function

Most, but not all of these data files are imported as text or csv formats from instrumentation such as DNA sequencers, gene microarrays or proteomics screens. Many of these files appear to be heavily modified from their original format and contain colour coding, modified column headers and additional columns. Some files are comparisons of 2 or more datasets or more in the same worksheet. A smaller number of files was simply a filtered list of gene names that could be a group of candidate genes for future analysis. Many users don't know about the specific data import settings and simply use the default settings.

What's really interesting is whether these errors matter - are any of the research conclusions wrong? That would be a powerful point to make.

As most error-containing files I screened also had accession numbers or other identifying information, the risks to altering the conclusions of the study by gene name errors are minimal, but embarrassingly common.

Were the Excel sheets merely a way to package data and results for readers, although the original analysis was done correctly in SPSS or R ?

Yes statistical analyses of large datasets is done in R, Matlab, Python, etc and its common to save the data in XLS file so that other researchers in the study can open and inspect the data.

Specifically for the misinterpreted genes, MARCH1, SEPT1,  did any papers focus on these? If so, it should be easier to check whether the results in the paper reflect the original clean data or the mangled Excel data

I didn't look at this specifically. But it would be interesting to see whether this is true.

Would Spreadsheet Auditing help reduce the incidence of these mistakes?

I've been discussing the drawbacks of spreadsheet software with colleagues and one solution could be to develop a spreadsheet software that has a complete changelog so that the files can be fully transparently audited and the modifications of these files can be done in a reproducible way.

Media coverage

Here I'll list the media reports and give my 2c on whether they are accurate. Please let me know if I've missed any.

HowStuffWorks <<==This is my favourite, the video is a must see!

Quartz Contacted me, Assam and others about the issue. It is a great read - balanced and well researched.

BBC  Mostly accurate, but the flashy title lays the blame on Microsoft, whereas I reckon it lies primarily on the researchers, reviewers, editors & database curators. Microsofts primary market is business and home applications, not genomics.

WinBeta Overblown. The headline is inaccurate. Its important to note that its not 20% of all genomics papers, but 20% of those that share lists of gene names in spplementary files. It is a very important distinction and is why we chose our words so carefully in the paper. The suggestion from the authors to leave a leave suggestion note at MS Excel UserVoice instead of sending the paper to Genome Biology is laughable. Primarily because this would not have gotten so much attention if the paper was never submitted, fellow scientists and journal editor wouldn't have received the message that they need to tighten up the way that supplementary data are reviewed. Secondly I'm an academic and writing papers is what we do for a living :)

SoftPedia News Like WinBeta, the headline is inaccurate and overblown, but otherwise accurate.

Slate Magazine Totally overblown, they've copied inaccuracies from the SoftPedia article without checking the original source. Good & accurate
ITWire Good & accurate

The Register Overblown headline. But the comment at the bottom of the piece is apt. "The Register could suggest that gene boffins just stop using spreadsheets for jobs to which they are not entirely-well-suited, but we suspect the idea would be as futile in that field as it is in every other. ®"

Inquirer Overblown headline. I can sense a trend here. Also the report contains a blunder, saying the paper was published in BioMed Central, where it was actually in Genome Biology. The link between gene name errors and Jurassic Park is a bit far fetched & random.

Digital Journal Overblown headline. Otherwise good.

GenomeWeb Paywalled content, try searching for it with Google News to access it. Is mostly accurate.

Science World Report Article claims we "blame" MS Excel, which is not the case. We lay most of the the blame with researchers, reviewers and editors. Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". The article does a good job of highlighting good comments from other discussions on the net.

I4U News This looks like an exact copy of the WinBeta piece.

Techaeris Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". Otherwise its OK.

The Times Need to register to The Times to read. Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". The paper features some great comments from Ewan Birney, EBI Director.

Washington Post Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". The piece also highlights Excel's ability to turn dates into 5 digit numbers - we actually saw this a lot too. The article finishes by advocating the use of R over Excel, which we agree.

Popular Mechanics Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". 

Neowin Good & accurate

AfterDawn Good & accurate

WinBuzzer They got my Boss' name incorrectly spelled. That's one of his pet peeves.

The Tech News Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". 

Slashdot Reposted from WinBeta, Including the horrible headline. Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". Also 344 comments is a lot, even if most of them seem random.

Süddeutsche Zeitung (German) This is a great article and completely correct. 

Futurezone (German) Repeat after me: "it's not 20% of all papers, just those with Excel gene lists". 
Globo (Portugese) Good (translated by Google )
N1 (Serbian) Good (translated by Google )


SiliconANGLE Good & accurate

News in Proteomics Research Mostly great! Also describes how to turn off autocorrect features in Excel - which is as I understand it only fixing autocorrect when it's typed not when data is pasted or imported.

Language Log Great! Describes a researcher's long battle with Excel genes in the literature.

MentalFloss Good & accurate

What You're Doing Is Rather Desperate Great post By Neil Saunders, legend in the genomics, community, about his unending dislike for bioinformatics done in Excel. Great post from an IT point of view. Mostly advising people to use Excel correctly.

Walking Randomly Discusses how Excel use leads to more errors than just date conversions. As scientists we should be using scripts to make our work reproducible.

Sysmod by Patrick O'Beirne. Great post highlighting how easy the problem is to avoid by using the Excel import features. Also goes into detail about further checks which can be done to avoid errors.

The Allium "Scientific community capitulates to Microsoft, officially changes all gene names to dates" What a fantastic satirical piece!

Thanks to everyone that's reported, shared, tweeted, commented, reddited, posted and emailed us about the story. I think we've partly achieved our goal of raising awareness of the issue. Let's see whether the 166 files are corrected and whether any new gene name errors crop up in future :)

I'll finish this post with a word-cloud of the most common Excel gene name errors. In my next post I'll discuss the scripts used to identify the errors.