Thursday, 7 January 2016

Is paired end RNA-seq better than single-end for gene-wise gene expression analysis?

Something I've wondered about is whether for RNA-seq it's worth forking out the extra cost of sequencing both ends as opposed to single end.

To test this, I went back to a paired end data set present in GEO (GSE55123, 2x 36bp), cleaned the data with Skewer, then mapped the reads with STAR in either paired-end mode or single-end mode (using just read 1).

I then used featureCounts to quantify number of tags aligned to each gene. I excluded genes with fewer than 10 reads per sample on average. Then I ran edgeR at Degust to identify differentially expressed genes (DEGs@FDR<0.05). I used a shell script to quantify the overlap in DEGs. Then I ranked them based on the p-value from most up-regulated to most down-regulated and compared their positions in the rank.

Here's the result of the overlap analysis. You can see that PE fastq detected more genes but identified fewer DGEs than SE.

Detected in PE:15919
Detected in SE:15275
Detected in both:14750
Detected in either:16444
Detected in PE only:1169
Detected in SE only:525

DGE in PE:4651
DGE in SE:5755
DGE in both:4356
DGE in either:6050
DGE in PE only:295
DGE in SE only:1399


Then I looked at the breakdown of where the 4651 PE DGEs ended up in the SE analysis:
DGE in PE and SE: 4356
PE DGEs not detected in SE: 138
PE DGEs detected but not significant in SE: 157

I then plotted the PE rank versus the SE rank for genes that were detected above threshold in both PE and SE. Rank is based on the sign of the fold change and the p-value. The result is near perfect.

But I was curious to see whether the plot looked different if I included genes that were below the detection threshold in either SE or PE. The result is more consistent with the overlap analysis.
In conclusion, these results show that going for the cheaper single end option loses about 10% of detected genes.  It was quite surprising to see that single end data gave 23% more DGEs. I'm not sure whether these SE specific DGEs are actually real - we can't really know. Nevertheless, 93.6% of DGEs identified in PE were also identified in SE, so there aren't many DGEs that are missed in SE data. As the data set used here is relatively short length (36bp), as compared to current HiSeq standard (50bp), the choice to go single end will likely lose you only 5-10% of genes. This proportion of genes lost would be lower still at 100bp length. Unless you're interested in performing splicing analysis, my recommendation would be for 100bp SE RNA-seq with 40M reads per sample. I'd be interested in your own experiences with SE and PE RNA-seq.

Friday, 1 January 2016

2015 Wrap-Up

Its that time of year again where we can reflect on the year that was, hit the reset button and focus on the trend that will dominate 2016.

Sequencing hardware

This time last year, we welcomed the NextSeq500, NeoPrep and updates to HiSeq2500. Throughout the year there were further announcements from Illumina on the HiSeq4000 and the technology appears to be improving incrementally from here on. Indeed while cluster numbers have improved, there have only been modest improvements in read length and pricing in Australian dollars has not decreased as substantially as we may have predicted. I'm really excited about the developments in 3rd gen technology coming from Oxford Nanopore and Pacific Biosciences in that the read lengths and accuracy are starting to improve. The growing user base is definitely spurring improved basecalling and error correction algorithms that will further increase the user base. Metagenomics has really been the main beneficiary of 3rd gen long read seq and that tight community should take credit for the improvements that have taken place. My prediction for 2016 is that in one way or another long read sequencing will start to take a big chunk of market share from Illumina. Long read sequencing will enable de novo assembly of individual whole human genomes with high accuracy. I predict a challenging year for Illumina unless they come to the party with better long reads.

Genome research

The two major trends of 2015 were genome editing with CRISPR/Cas9 and single cell genomics. CRISPR/Cas9 has matured to become a relatively easy and cheap way to generate accurately targeted genomic knock-outs, knock-ins and even control chromatin modifications. My view is that every serious molecular biology lab should have this in their tool kit. In the epigenetics and chromatin biology space that I work, I expect that CRISPR/Cas9 modification of chromatin marks will become a standard feature in high impact papers in order to demonstrate a functional requirement of these marks in development and disease. Single cell genomic profiling approaches have included RNA, DNAse hypersensitivity, DNA methylation and chromatin mark profiling. Especially in the transcriptomics space, I see Drop-Seq being implemented in studies that unravel the mechanisms of disease in defined cell populations. Labs that resist this trend will start getting left behind in the ever competitive arena of grant funding. Perhaps the 3rd growing trend of 2015 was personalised medicine. Growing size of exome and genome projects around the world are producing huge amounts of data that we are just starting to understand. The understanding of these sequences with respect to health and complex disease outcomes will only be apparent in decades when the health outcomes are fully known. My view is that we should invest more money to sequence genomes of centenarians and people that have died of a defined cause that is a major health priority (heart disease, stroke, cancer, dementia, etc). Maybe RNA-seq will become a high-end molecular check-up?

Bioinformatics and academia

I've noticed some strong trends in this space that's worth discussing. The first is "Good Bioinformatics Practise".  Some best practises that you should adopt in your 2016 resolution are:
  • Learn from the best - software carpentry, booksRosalind assignments. Advance your techniques, use the right tools for the job and develop your weaknesses.
  • Version control ALL code. Turn your scripts into tools. Develop throw away code into programs that are really good at doing just one thing. Publish your code in Github and promote your work to attract a larger user base that will help eliminate bugs.
  • Keep a notebook. Just as a lab scientist keeps a daily book of experiments and results, you should keep notes on the progress of code and data analyses. Try iPython notebook or R Markdown. Try an electronic lab book and share it with your supervisor.
  • Think end-to-end. Resist the urge to bring your data into Excel for plotting. Try it in R or Python. End-to-End facilitates reproducible analysis that is easy to incorporate later improvements. Use functions throughout the code to simplify each task making each part easier to debug. 
  • No more genomics papers without code supplied. The phrase "custom scripts were used" should be replaced by "all scripts to reproduce the figures and tables here is deposited on GitHub".
Some other trends included use of lightweight RNA-seq mappers (Sailfish, Kallisto, etc) that are faster and require less RAM than their conventional aligners. Cloud analysis - less reliance on own server infrastructure in favour of cloud-based solutions.

In academia, there is a split emerging between late career scientists (professors) and early career scientists (ECS). Obviously this doesn't characterise 100% of academics out there, just my personal point of view. Let me demonstrate some examples:
  • Selecting a journal: Prof wants a high impact factor journal, despite the fact it could take years to get an acceptance. Same paper submitted to several journals with each review costing 2-6 months. The delays are detrimental to the first author ECS, with increasing chance of being "scooped". ECS want to publish it where it will be open access irrespective of JIF, and let the community judge its quality and allow it to accumulate citations immediately (think arXiv). Sadly JIF is still used as a proxy of quality by many competitive grant reviewers.
  • Social media: Prof couldn't care less about SM distractions. ECS see SM as a way to disseminate own work and opinions to distinguish their portfolio from their supervisors. ECS use SM to keep up-to-date with the latest developments in their field. ECS use SM to develop new collabs and develop original projects.
  • Funding and career progression: Average age of grant winners increasing. Overall research proposals become more conservative. More grants going to fewer "super lab heads". These super lab heads don't have the time to actually contribute to any of the funded projects under way, rather focus purely on gaming the system in the next grant app. Chance of ECS ever winning a competitive grant diminishes as track-record (JIF collection) become ever-more important as opposed to the originality and quality of the concept in the grant app. Rather Profs poach project ideas conceived by ECS and omit ECS from the list of co-investigators (co-investigator list consists purely of other "super lab heads"). ECS even outright ghost-write grant apps for Profs. ECS have their apps rejected in any year to find them plagiarised (accepted) in the next funding round.
  • Writing a manuscript: ECS write the entire paper, while Prof provides minimal edits amounting to word salad. ECS teams work on initial drafts MS with cloud docs efficiently. Prof insists on email attachment ping-pong and gets constantly mixed up as to which MS is the current version. Prof insists on data "hokey-pokey", including new work and then taking it out, then putting it in again and shaking it about, then taking it out finally.

Reflections on Genome Spot

2015 was a highly successful year for the blog. I focused again on methods for RNA-seq, transcriptomics and practical tips to get things done. Here are some of my favourite posts:

All time clicks grew from 20k to 62k and we saw greater participation of Twitter users and many more comments this year than all previous years combined. In 2016 you can expect posts along the same themes. I'll be looking at circular RNAs, state of the art tools in pathway analysis, harnessing datasets such as GTEx and IHEC to make new gene set databases and developments related to Digital Expression Explorer.

My 2015 in academia

Overall my papers in 2015 were pretty modest, I worked with my PhD students and external collaborators to publish a few papers related to plant biology. Behind the scenes I've been working on developing my own portfolio including Digital Expression Explorer and methods in MicroRNA alignments - these should be published early in 2016.

To all my readers, commenters, collaborators, colleagues, thanks for your support in 2015. Have a safe, happy and prosperous 2016!