Share and backup data sets with Dat

If you work in genomics, you'll know that sharing large data sets is hard. For instance our group has shared data with our collaborators a number of ways:

DVDs, hard drives and flash drivesFTPHightailGoogle Drive linksAmazon linksSCP/PSCPrsync
But none of these are are ideal as we know data sets change over time and none of the above methods are suited to updating a file tree with changes. If changes occur, then it quickly becomes a mess of files that are either redundant or missing entirely. Copied files could become corrupted. What we need is a type of version control for data sets. That's the goal of dat.

So now I'll take you through a simple example of sharing a data set using dat.

#Install instructions for Ubuntu 16.04
$ sudo npm cache clean -f
$ sudo npm install -g n
$ sudo n stable
$ sudo npm install -g dat

# Files I'm sharing on PC 1: DGE table and 3 genelists (3.4 MB)
$ tree
├── Aza_DESeq_wCounts.tsv
└── list
    ├── Aza_DESeq_wCounts_bg.txt
    ├── Aza_DESeq_wCounts_dn.…

Update on DEE2 project for Jan 2018

Today I'd like to share some updates on the DEE2 project, which I wrote about  in an earlier post. The  project code can be viewed on github here. Pipeline and images As the pipeline was recently finalised, I was able to roll out the working docker image. To facilitate users without root access, this image was ported to singularity. This took a lot of effort and some expertise from our local HPC team to get things working (many thanks to the Massive/M3 team). The singularity image is available from the webserver (link) and instructions for running it are available on github here. I have started testing a heavyweight singularity image, which includes the genome indexes, which will be more efficient for running jobs with large genomes and will make it available once testing is complete. Queue management It may sound simple to write a script to determine which datasets have been completed and add new datasets to the queue but when taking about tens of thousands of datasets it's …

Update on the DEE project Dec 2017

Back in 2015, our group described DEE, a user friendly repository of uniformly processed RNA-seq data, which I covered in detail in a previous post. Ours was the first such repository that wasn't limited to human or mouse and included sequencing data from a variety of instruments and library types. The purpose of this post is to reflect on the mixed success of DEE and outline where this project is going in future.

Overall I've received a lot of positive feedback from users and a number of citations to our poster. Thanks to everyone who used, gave suggestions, comments, bug reports, etc! However our attempt to have the repository published wasn't so successful due to reviewer niggles over what I consider minor points but hard to implement quickly. The main points raised by reviewers were:

Is it reasonable to treat all data sets as if they were single end? For this one, the reviewers were split, one said it was OK and the other was adamant that it was unacceptable despite my …

Diagnosing PCR duplicates from cluster duplicates

NovaSeq, HiSeqX and HiSeq4000 Illumina sequencers have patterned flowcells which have a different chemistry as compared to random clustered flowcell systems (Hiseq2500 & MiSeq) which is known to cause duplicates during the clustering process. For some background on the issue, see these previous blog posts:

QC Fail blog Steve WingettEnseqlopedia blog by James Hadfield In my recent whole genome bisulfite sequencing experiment using TruSeq methylation library prep kits and NovaSeq, I noticed a high proportion of duplicate reads and wanted to investigate whether these were "cluster" duplicates, ie generated during the clustering process due to ExAmp chemistry or were duplicates generated during the PCR step. Generally cluster duplicates occur in the immediate proximity on the flowcell surface and PCR duplicates are expected to occur uniformly throughout the flowcell surface.
To diagnose this, I used the diagnose-dups tool by Dave Larson which can be found on Github here. I wr…

Considerations in performing whole human genome bisulfite sequencing on the Illumina NovaSeq system

Today at the NGS workshop at WEHI, Melbourne, I presented some findings related a pilot study of 12 methylomes studied with whole genome bisulfite sequencing. Two of those libraries were also sequenced on the HiSeq4000 platform to similar depth so there were some subtle but interesting differences between the systems. What we found was that the actual sequence coverage obtained was substantially less than that projected due to 2 problems. Firstly that the insert size was too small - which looks like it could be due to the inner workings of the Illumina TruSeq methylation kit. And secondly that there was a high proportion of duplicate reads observed - that is same strand and coordinates which are likely not independent observations. I will need to look into further detail at whether these are PCR duplicates or "cluster" duplicates. Perhaps the library prep or clustering protocols need some tweaking for bisulfite sequencing.

So as promised, here is the link to the slides.

Upset plots as a replacement to Venn Diagram

I previously posted about different ways to obtain Venn diagrams, but what if you have more than 4 lists to intersect? These plots become messy and not easy to read. One alternative which has become popular is the upset plot. There is an excellent summary of the philosophy behind this approach in this article and academic paper here. An example plot is below:

In this post, I'll describe how to get from lists of genes in text files and present it as an UpSet plot using R. As with most R packages, you'll find that loading in the data is the hardest part, and that data import is the least documented aspect.

First I'll generate some random gene lists using a quick and dirty shell script. My complete list contains 58302 genes and looks like this:
$ head -5 Homo_sapiens.GRCh38.90.gnames.txt ENSG00000000003_TSPAN6 ENSG00000000005_TNMD ENSG00000000419_DPM1 ENSG00000000457_SCYL3 ENSG00000000460_C1orf112
This is the script which generates random subsets of genes with the suffix &quo…

Minitalk: Understanding gene regulation in complex disease with deep sequencing

Today I gave a presentation on experiment design and use of ChIP-seq and MBD-seq to understand gene regulation. The target audience consisted of biomedical scientists with little background in genomics but were curious to incorporate deep sequencing into their studies.

Link to the slides HERE.

As always I love getting feedback - so leave your questions and comments below!