Monday, 30 March 2015

Quantifying repeat sequences from short read next gen sequencing data

Repetitive sequences pose significant challenges for the analysis of short read sequence data. Any sequence read that is shorter than the repeat unit is going to fail at being uniquely placed in the genome. At a given read length, genome can therefore be divided into mappable and unmappable compartments. For analyses such as gene expression and chromatin profiling, we generally ignore those unmappable regions and ignore reads that map to multiple positions. But what if we were actually interested in the abundance of repeat sequences in the dataset? How could it be quantified with short read sequencing? Why is that even relevant?

The "Why"

About 50% of the human genome is comprised of repetitive DNA (Ref). While the function of these elements in many cases remains unknown, some have been well defined. Alu elements have been shown to be a birthplace of enhancers (Ref). Endogenous retroviruses have been domesticated to boost antiviral response in B cells after exposure to T cell–independent type 2 (TI-2) antigens (Ref). Repeat elements can be found embedded in fusion transcripts (Ref) or expressed alone in somatic tissues (Ref). These elements are normally repressed by DNA methylation and other chromatin marks (Ref,Ref). Quantification of these elements in MeDIP-seq, ChIP-seq, mRNA-seq could lead to discoveries in how epigenetic processes regulate repeat expression. Quantification of these elements in genome resequencing projects could reveal differences in the repeat contents of individuals.

The "How"

Traditional method for identification of repeat DNA is using RepeatMasker (RM), which utilises a BLAST-based search engine (RM-BLAST) as default. This is normally fine for a relatively small number of relatively long sequences, but is not very well suited to millions of short reads that are generated from deep sequencing (or NGS) platforms these days. So in this post, I'll compare the accuracy of RM and NGS aligner BWA for quantifying repeat sequences from short reads.

Test data set

I generated simulated reads from the entire RM repeat library using a method I used previously. In addition, I generated a bunch of simulated reads from the E. coli genome as a type of background set. In total I had 16558094 reads but this would take too long to analyse with RM so I used "shuf -n" to select only a million reads for analysis. Here is what the reads look like, you can see the header can  be used to match text to contig names in the alignment results.


Simulated reads results

Then I used some bash tools to check the mapping of reads to the correct class of repeats.

So it looks like BWA and Bowtie2 yield a greater proportion of correctly mapped reads compared to RM with the default settings. The number of incorrectly mapped reads remained relatively low for BWA and Bowtie2 given how challenging repeat DNA is to map. Then I looked in a bit more detail at the main repeat elements in human (LINE-1, Alu and LTR).

This data shows that the identification of repeat sequences was consistent between these three approaches.

Effect of DNMT inhibitor azacitidine on repeat abundance in RNA-seq data

So now I want to see how highly these repeat DNA classes are detected in RNA-seq data.The data set I chose is one that I've used before, an mRNA-seq experiment looking at the effect of the DNA methylation inhibitor 5-azacitidine (Ctrl Vs Aza). I used fastq quality trimmer to remove any unreliable reads and then aligned these reads to the human specific repeat library with BWA. I aggregated the reads based on their class, performed reads-per-million normalisation and performed some rudimentary differential analysis.
Detection of repeat sequences in RNA-seq data using BWA alignment to human-specific repeat library.

These data show that there is a trend of decreasing SVA retrotransposon and increasing MIR and TcMar Mariner abundance due to azacitidine treatment. To check this more thoroughly I will need to append this repeat-centric count matrix to the gene-wise count matrix and analyze it with edgeR or DESeq. I was really expecting a more drastic change in repeat content from azacitidine because my previous analysis found >6000 differentially expressed genes with edgeR. Specifically, I was looking for an increase in the abundance of LTR and LINE-1 elements, as that would be consistent with a loss of methylation due to azacitidine mediated DNMT inhibition. 


Despite the lack of a drastic change in repeat abundance, we show that RepeatMasker and BWA are both valid methods for detecting and quantifying repeats from reads as short as 50 bp. Given BWA and other NGS aligners are an order of magnitude faster than RepeatMasker, these should be useful for routine fast screening of repeat sequences in RNA-seq, ChIP-seq and other data.

Friday, 20 March 2015

Paired-end fastq quality control with Skewer

Quality trimming and adapter clipping paired end reads is a tricky business. For paired-end alignments to work, the order of the sequences in both files (forward and reverse) needs to be retained. While there are plenty of read trimmers available open-source (think FASTX-toolkit, SeqTK, CutAdapt, etc), I haven't found many that:

  • Retain pairing info
  • Run parallel and are fast
  • Are easy to set up and run
  • Have good docs

Then I fell in love with Skewer. It smashed through 14.5 million gzipped read pairs, doing adapter clipping and quality trimming in two minutes on my 8-core workstation. It auto-detects quality encoding so you can safely analyse any Illumina data. Awesome!

$ skewer -t 8 -q 20 SRR634969.sra_1.fastq.gz SRR634969.sra_2.fastq.gz
Parameters used:
-- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
-- maximum error ratio allowed (-r): 0.100
-- maximum indel error ratio allowed (-d): 0.030
-- end quality threshold (-q): 20
-- minimum read length allowed after trimming (-l): 18
-- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected)
-- number of concurrent threads (-t): 8
Thu Mar 19 22:19:26 2015 >> started
|=================================================>| (100.00%)
Thu Mar 19 22:21:31 2015 >> done (124.590s)
14555046 read pairs processed; of these:
   35892 ( 0.25%) short read pairs filtered out after trimming by size control
   21534 ( 0.15%) empty read pairs filtered out after trimming by size control
14497620 (99.61%) read pairs available; of these:
 5122437 (35.33%) trimmed read pairs available after processing
 9375183 (64.67%) untrimmed read pairs available after processing
log has been saved to "SRR634969.sra_1.fastq-trimmed.log".

Thursday, 19 March 2015

HISAT vs STAR vs TopHat2 vs Olego vs SubJunc

HISAT is a brand new RNA-seq aligner which promises great speed with a low memory footprint. The Nature Methods paper is worth a look.

So I thought I'd give it a test run with some simulated data to check its accuracy compared to other aligners.

I generated synthetic 100bp reads based on Arabidopsis cDNAs and then incorporated mutations with msbar.

I then aligned these reads to the Arabidopsis genome with default settings and counted (featureCounts) the number of correctly and incorrectly assigned reads with a mapping quality of 20.

Here is the result for unmutated (perfect) 100 bp reads.

Table 1. Alignment of simulated 100 bp cDNA sequences to the Arabidopsis genome.
HISAT shows similar accuracy to TopHat2 but was not as accurate as STAR.

Now here are the results of the mutation experiment.

Figure 1. Accuracy of mapping mutated 100bp reads. Left hand side graphs show the correct mapping rates and right hand side shows incorrect mapping rates. Top panels show effect of single nucleotide changes, middle panels show effect of single base insertions and lower panels show single base deletions.
A table of full results is shown at the bottom of the post. These results show that STAR consistently scores the greatest number of correctly assigned reads in these tests while keeping incorrectly assigned reads down below 0.1%. Olego and TopHat2 produce the fewest incorrectly assigned reads, but are quite slow. HISAT will still be very useful when speed and memory footprint are a concern.

Table 2. Results of the mapping mutated 100bp reads to the Arabidopsis genome (data shown in Fig1)

==> <==

for FQZ in *fasta.gz ; do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -dk $FQZ
hisat -p 8 -f -x $X -U $FQ \
| samtools view -uSh - \
| samtools sort - ${FQ}_hisat.sort
rm $FQ

==> <==
for FQ in *.fasta.gz
pigz -dc $FQ \
|olego -t 8 $IDX $FQ \
| samtools view -uSh - \
| samtools sort - ${FQ}_olego.sort

==> <==
for FQZ in *.fasta.gz
pigz -fdk $FQZ
FQ=`echo $FQZ | sed 's/.gz//'`
STAR --readFilesIn $FQ --genomeLoad LoadAndKeep \
--genomeDir $IDX --runThreadN 8 \
mv Aligned.out.sam ${FQ}.STAR.sam
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort
rm $SAM )&
rm $FQ

==> <==
for FQZ in *.fasta.gz
pigz -fdk $FQZ
FQ=`echo $FQZ | sed 's/.gz//'`
subjunc -T 8 -i $IDX \
-r $FQ -o ${FQ}.subjunc.sam
samtools view -uSh $SAM \
| samtools sort - ${SAM}.sort
rm $FQ ${FQ}.subjunc.sam )&

==> <==
for FQZ in *fasta.gz ; do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -dk $FQZ
tophat2 -p 8 -o ${FQ}.tophat $IDX $FQ
rm $FQ

Monday, 16 March 2015

Count ChIP-seq reads across a promoter with featureCounts

When analyzing ChIP-seq data, we sometimes like to focus on promoters as a proxy of gene activation especially for histone marks that are normally associated with gene activation such as H3K9 acetylation or H3K4 tri-methylation.

FeatureCounts has emerged as a competitor to HTSeq and BedTools MultiCov for counting reads across features (ie, exons, genes, promoters). FeatureCounts is great for RNA-seq because it can natively read GTF annotation files, but can't read BED format (that we use a lot in ChIP-seq analysis).

In order to make featureCounts work, we need to extract the TSS coordinates and convert to a BED-like format that it can read (SAF format). In the below script, I extract the positions of the TSSs using a grep search followed by stripping only the neccessary information from the GTF, then using awk and BedTools to expand the region  around the TSS by 3kbp.

#Generate an SAF file for TSS intervals from a GTF

#Specify some parameters

#Extract TSS coordinates for fwd strand genes
grep -w transcript $GTF | awk '$7=="+"' \
| cut -d '"' -f-2,10 | cut -f1,4,5,9 | sed 's/gene_id "//' \
| tr '"' '_' | awk '{OFS="\t"} {print $1,$2,$2+1,$4}' \
| bedtools slop -i - -g $CHR -b $TSS \
| awk '{OFS="\t"} {print $4,$1,$2,$3,"+"}' > $FWD

#Extract TSS coordinates for reverse strand genes
grep -w transcript $GTF | awk '$7!="+"' \
|  cut -d '"' -f-2,10 | cut -f1,4,5,9 | sed 's/gene_id "//' \
| tr '"' '_' | awk '{OFS="\t"} {print $1,$3-1,$3,$4}' \
| bedtools slop -i - -g $CHR -b $TSS \
| awk '{OFS="\t"} {print $4,$1,$2,$3,"-"}' > $REV

cat $FWD $REV > $SAF
rm $FWD $REV

#Account for the non-promoter regions
cut -f2-4 $SAF | sortBed | mergeBed -i - \
| complementBed -i - -g $CHR \
| awk '{OFS="\t"} {print "NA_"$1":"$2"-"$3,$0,"+"}' \
| cat $SAF - > $SAFCOMPL

The last part of the script generates SAF coordinates for complement regions, ie., the parts of the genome that are not located near TSSs. This is useful because sometimes the promoters only contain a small fraction of the original data. For my H3K9ac ChIP-seq, only 10-20% of reads are placed near promoters. When including complement regions, I'm able to include over 80% of reads.

Here's how I use featureCounts to count reads in meta-features (if a gene has many TSSs, then the counts are aggregated). You'll notice that I run featureCounts twice once without the complement regions and once with the complement regions.

featureCounts -Q 20 -T 20 -F SAF -a $SAF -o $OUT *bam

featureCounts -Q 20 -T 20 -F SAF -a $SAFC -o $OUTC *bam

Here's the result of the promoter only counts:
Status Sequence1.bam
Assigned 6610765
Unassigned_Ambiguity 1162622
Unassigned_MultiMapping 0
Unassigned_NoFeatures 40595847
Unassigned_Unmapped 1783674
Unassigned_MappingQuality 9157142
Unassigned_FragementLength 0
Unassigned_Chimera 0

Now the result of the promoter & complement counts:
Status Sequence1.bam
Assigned 47112044
Unassigned_Ambiguity 1257190
Unassigned_MultiMapping 0
Unassigned_NoFeatures 0
Unassigned_Unmapped 1783674
Unassigned_MappingQuality 9157142
Unassigned_FragementLength 0
Unassigned_Chimera 0

FeatureCounts can also be run at a feature-level to output counts for each TSS of each gene by adding the "-f" switch.

Now the tabulated reads are ready for differential enrichment analysis with edgeR or similar.

Using Named Pipes and Process Substitution

Little known UNIX features to avoid writing temporary files in your data pipelines explained by Vince Buffalo in his digital notebook. Introducing named pipes and process substitution.

Thursday, 12 March 2015

Are we ready to move beyond MSigDB and start a community-based gene set resource?

Gene sets are distilled information about molecular profiling experiments and can generated based on other features shared by groups of genes such as chromosomal position, sequence, co-regulation, functional information, etc.

These are a valuable resource because they suggest similarities between different molecular profiling experiments or phenomona and lead researchers into understanding the factors that drive the trends in profiling experiments such as gene expression assays by microarray or RNA-seq.

To truly grasp the importance of quality gene sets, consider that the original paper describing the GSEA algorithm has accumulated 3144 citations since 2003, while the paper describing the software and wider applicability of GSEA has 7166 citations. The latter paper has also attracted positive comments from experts in the field on PubMed, here is one that I couldn't agree with more. In the words of Rafael Irizarry, "The idea of analyzing differential expression for groups of genes, as opposed to individual genes, was an important step forward in the analysis of gene expression data."

How it currently works

MSigDB (The Molecular Signatures Database) is the gene set resource that powers GSEA and many other pathway analyses. MSigDB is a collection of annotated gene sets that currently consists of 10,295 human gene sets from a range of sources. See Table 1.
Table 1. Human gene sets from MSigDB in version 4.0.
As you can see, the folks at MSigDB have done a huge amount of work in collecting gene sets from a variety of sources including several that are empirically defined from microarray or other profiling experiments as well as other gene ontology sources and bioinformatically generated sets. There is some literature on some recommendations to make gene sets, but this is far from being a community-wide consensus.

In addition to MSigDB, you can find gene sets in other places too. Here is a non-exhaustive list.
  • ChEA, an on-line resource of ChIP-derived transcription factor targets
  • GeneSetDB a collection of sets from a variety of sources
  • CMAP a collection of gene profiles of human cells treated with bioactive small molecules
  • WikiPathways a community-powered resource of signalling and biochemical pathways
  • Database of Microarray Marker Genes Microarray analysis portal
  • CleanEX Gene symbol based microarray data analysis portal
  • PAGED Human disease-centric gene 
  • Enrichment Map Gene Sets human and mouse gene sets from a variety of sources.
  • As tables and supplementary information in journal articles

You can also generate them yourself in a number of ways:
  • Through GEO2R for microarray expression studies (example)
  • Mining processed data such as ENCODE project or Epigenome Roadmap Project (example)
  • By performing the entire analysis again for other types of studies including proteomic, genomic, transcriptomic
  • Based on features of protein or gene sequence or structure
  • Meta-analysis of any of the above (example)

What we should be doing better

  • Establish solid guidelines for how researchers should go about generating gene sets of their own.
  • Include non-coding RNAs, especially microRNAs that have been sadly neglected
  • Make a one-stop-shop for all species, not just human.
  • Demonstrate provenance: How was the data processed? How were the statistics done? Who generated the gene sets? The gene set is an important artefact that is used to guide researchers, so why don't they have their own methods section attached.
  • Promote reproducibility: could they be generated today again? What's stopping us from posting the computer code that was used to generate the gene set and reproduce it as a condition of distribution?
  • Boost the update cycle: new data is deposited on GEO and other databases daily, but new gene sets are only being added sporadically. For example MSigDB v4.0 was last updated May 31, 2013. In this fast-paced world of genomics, we need to stay informed of recent studies without needing to undergo primary analysis for all of them. This also extends to when genome annotations are updated and modified, so that when new genes are discovered, the gene sets themselves can be updated too.
  • Target community support: The generation of gene sets is dominated by MSigDB curators. While undoubtedly they have done a splendid job with MSigDB, this means that the rest of the genomics community take a back seat and don't actively participate in submitting new gene sets. Furthermore, the processes used to evaluate validity of gene sets aren't transparent and perhaps the process requires further rigour and standardisation.
  • Give due credit: Insufficient credit is given to the people that curate these gene sets. Post-analysis of data and generation of gene sets is, for many journals, considered to be a trivial exercise, so there is little incentive for researchers to generate and share them, despite their community value.
In many ways, the above points are a symptom of how things were done in 2006 when MSigDB was first published.

So what is the solution?

There could be a few different solutions, it could be as simple as a wiki. Wiki's work, they have been used for similar cases such as WikiPathways and WikiGenes, but I feel that the amount of work required to curate and maintain these gene sets is large enough that most researchers wouldn't be sufficiently motivated to curate and submit gene sets. Moreover, Wikis have been known to be susceptible to vandalism (both well-meaning and malicious) and spam. As these gene sets are to be used by other researchers, quality control needs to be the primary consideration.

The solution is a specialist peer-reviewed journal! 

A peer-reviewed journal can set the standard for valid methods to generate gene sets. Submission could follow a strict template to minimise the work of editors. The source data and methods used are completely and clearly described. The code is housed in a GitHub repository. Gene set authors efforts are rewarded by (rapid) open access publication of their gene sets. Gene set authors (and the authors of the original study) are cited when those gene sets appear in subsequent studies. 

Rather than being controlled by a small set of human genetics researchers in the USA, the journal would service a global community studying all domains of life. The scope need not be limited to gene sets and could evolve according to the dominant analytical methods of the time; such as gene ranks. Moreover, the data may not be limited to genes, and could include proteins, carbohydrates, lipids and other biomolecules. 

For this concept to get off the ground, we will need an experienced and dedicated team to perform a range of tasks, so contact me if you are interested to volunteer as:

  • Editorial board member
  • Peer reviewer
  • Copy editor
  • Proofreader
  • Website developer/maintainers
  • Database developer/curators
  • Author!

Most of all, we need to spread the word about how important quality molecular signatures are to understanding biology. We need to promote the concept, tell your professor, share on social media, tell anyone!

We also need a journal title - so I'll put it out there for you to suggest!

Tuesday, 3 March 2015

Extracting specific sequences from a big FASTA file

Say you have a huge FASTA file such as genome build or cDNA library, how to you quickly extract just one or a few desired sequences?

Use samtools faidx to extract a single FASTA entry 

first index, then you can extract almost instantaneously.
$ samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
real 0m37.422s
$ time samtools faidx Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa MT
real 0m0.003s

Use bedtools getfasta extract portions of a FASTA entry

Requires the regions of interest be in BED format.
$ head Homo_sapiens.GRCh38_CpG_Islands.bed
1 10413 11207
1 28687 29634
1 51546 51882
1 121270 121549

The sequences of interest are extracted like this:
$ bedtools getfasta -fi Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -bed Homo_sapiens.GRCh38_CpG_Islands.bed -fo Homo_sapiens.GRCh38_CpG_Islands.fasta

Make a blast database to access bunches of sequences quickly

Note: you will need to download and install the BLAST+ package from NCBI or via Ubuntu software centre.
It is compatible with protein and DNA sequences, but you'll need to specify that accordingly.
$ makeblastdb -in Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -dbtype nucl -input_type fasta -parse_seqids
Adding sequences from FASTA; added 194 sequences in 72.3594 seconds.

You can extract sequences singly:
$ blastdbcmd -db Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -dbtype nucl -entry MT
>lcl|MT dna_sm:chromosome chromosome:GRCh38:MT:1:16569:1 REF

Or in a batch using a file like this.
$ cat Contigs.txt 

Then call the file like this.
$ time blastdbcmd -db Homo_sapiens.GRCh38.dna_sm.primary_assbly.fa -dbtype nucl -entry_batch  Contigs.txt

real 0m0.149s

Related info

Monday, 2 March 2015

Sambamba vs Samtools

Samtools has been the one main tools for reading and writing aligned NGS data since the SAM alignment format was initially proposed. With the amounts of NGS data being generated increasing at a staggering rate, informatic bottlenecks are beginning to appear and there is a rush to make bioinformatics as parallel and scalable as possible. Samtools went parallel only recently, and there is a new competitor to Samtools called Sambamba, so I'll just do a few quick tests to see how it stacks up to Samtools on our 32-core server with 128 GB RAM (standard HDD).

We have an 1.8 GB bam file to work with.
$ du -sh *
1.8G  Sequence.bam

Now convert to unsorted sam format.
$ samtools view -H Sequence.bam > header.sam
$ samtools view Sequence.bam | shuf \
| cat header.sam - > Sequence_shuf.sam

The sam file is 9.9 GB. Lets try 1-thread SAM-to-BAM conversion and sorting with Samtools.
$ time samtools view -Shb Sequence_shuf.sam \
| samtools sort - Sequence_samtools.test

real 18m52.374s
user 18m30.619s
sys 1m5.952s

Now multithreaded with Samtools
$ time samtools view -Shb -@16 Sequence_shuf.sam \
| samtools sort -@16 - Sequence_samtools_prl.test

real 6m24.779s
user 20m36.528s
sys 2m31.423s

Now Sambamba which is natively parallel
$ time sambamba view -S -f bam Sequence_shuf.sam \
| sambamba sort -o Sequence_sambamba.bam /dev/stdin

real 7m9.870s
user 17m42.158s
sys 2m9.758s

Now to index the BAM files
$ time samtools index Sequence_samtools.bam

real 0m37.755s
user 0m36.321s
sys 0m1.351s

$time sambamba index Sequence_samtools.bam

real 0m15.180s
user 0m48.981s
sys 0m6.354s

When converting SAM to sorted bam, Sambamba is quicker than Samtools single-threaded, but was not quicker than multi-threaded Samtools in this test. What I did notice, was that multithreaded Samtools used a lot of RAM, so Sambamba may be useful where RAM is more limiting. It seems that disk I/O is potentially a limiting resource so it would be interesting to test again with SSD. Sambamba was much quicker in indexing BAM files as advertised. Sambamba offers a few new interesting features compared to Samtools and is also CRAM-file compatible.

Have you used Sambamba? Did it speed-up your work?