Pages

Thursday, 16 July 2015

Genome methylation analysis with Bismark


Bismark is currently the de facto standard for primary analysis of high throughput bisulfite sequencing data. Bismark can align the reads to the genome and perform methylation calling. In this post, I'll go through Illumina whole genome bisulfite sequence (WGBS) alignment and methylation calling using Bismark. First I want to mention that this post is just a summary, not meant to be a user manual or thorough troubleshooting guide. Fortunately, Bismark has some of the best documentation for any bioinformatics suite and is mandatory reading. The Bismark crew are very proactive with responding to user queries on various forums as well.

First step in getting Bismark to work is to index the genome, in this case with Bowtie2:

bismark_genome_preparation --bowtie2 /pathto/refgenome/

Conventionally, multiplexed libraries will be sequenced over a number of lanes. Resist concatenating or merging the smaller fastq files for each patient/sample until after the alignment, as the concatenated files will be huge for deeply sequenced samples (30x). Using smaller chunks of less than 100 million read pairs has the benefit of requiring smaller disk space requirements for intermediate files and server failure is likely to have a less detrimental impact.

After quality trimming of paired end reads with Skewer, the files have names such as this (yours may have a different naming convention and the script will need to be modified accordingly):

Sample1_lane01_1.fq-trimmed-pair1.fastq.gz
Sample1_lane01_1.fq-trimmed-pair2.fastq.gz

The following script cycles through the files with the "pair1.fastq.gz" suffix and finds the corresponding read2, uncompresses the gzipped fastq, and uses bowtie2 alignment after C>T/A>G conversion. The script uses a non-directional setting but this will depend on the exact library preparation method. If unsure, just use a million or so read pairs to test directionality of the seq data before running these long jobs. The script will generate gzipped sam files for each input read pair.

It is really important to note that while I set parallelism (number of threads) to 4 for bowtie2 alignment, Bismark actually runs 4 instances of Bowtie2 in parallel (4x4=16), so its important not to overload your system with too many jobs in parallel. The script below will run a maximum of 2 bismark jobs on any server (total of 32 cores). This will need to be adjusted for your circumstance

#!/bin/bash
#How to prepare a genome for bismark alignment
#bismark_genome_preparation --bowtie2 $REF

wgbsaln(){
##Specify a reference genome directory
REF=/pathto/refgenome/

###################################################
# Start pipeline
###################################################
FQZ1=$1
#find read2 - this needs customisation for filenames
FQZ2=`echo $FQZ1 | sed 's/pair1.fastq.gz/pair2.fastq.gz/'`
#unzip read 1 and 2
pigz -dfkp 6 $FQZ1 & pigz -dfkp 6 $FQZ2
FQ1=`echo $FQZ1 | sed 's/.gz$//'`
FQ2=`echo $FQZ2 | sed 's/.gz$//'`
wait

###################################################
# Run Bismark with bowtie2 alignment
###################################################
bismark --non_directional --bowtie2 -gzip \
-p 4 $REF -1 $FQ1 -2 $FQ2
rm $FQ1 $FQ2
rm ${FQ1}_C_to_T.fastq ${FQ2}_G_to_A.fastq
pigz ${FQ1}_bismark_bt2_pe.sam
}
export -f wgbsaln
parallel -u -j2 wgbsaln ::: *pair1.fastq.gz

Jobs can also be run over several servers with the following setup (further reading here). 

parallel -u -j2 -S server1,server2,server3 wgbsaln ::: *pair1.fastq.gz

The alignment will take quite a bit longer than a standard gDNA alignment. When it's finished, the compressed sam files can be concatenated (without uncompressing):

cat Sample1_*.sam.gz > Sample1.sam.gz

These compressed sam files can be used directly in methylation calling, but before we start, take a moment to look at the M-bias graphs, as suggested in the manual to determine whether the first and last few bases of read 1 and 2 show any strong biases. If so, you'll need to ignore those bases in the methylation calling. You might also want to move the concatenated sam.gz alignments to a new directory to perform methylation calling with the following script.

#!/bin/bash
#################################################
# Extract methylation scores
#################################################
methextr(){
REF=/pathto/refgenome/
SAMGZ=$1
SAM=`echo $SAMGZ | sed 's/.gz//'`
bismark_methylation_extractor -p --bedgraph --counts \
 --buffer_size 20G --cytosine_report --genome_folder \
$REF --multicore 10 --gzip --ignore 3 --ignore_r2 3 $SAMGZ
}
export -f methextr

parallel -u -j3 methextr ::: Sample1.sam.gz Sample2.sam.gz SampleN.sam.gz

The script is based upon an example in the manual that generates a genome wide CpG report in addition to the Bedgraph output that look like this.

==> Sample1.sam.bismark.cov <==
20 60179 60179 34.4827586206897 10 19
20 60180 60180 46.6666666666667 7 8
20 60358 60358 0 0 1
20 60426 60426 92.3076923076923 12 1
20 60427 60427 92 23 2
20 60432 60432 84.6153846153846 11 2
20 60433 60433 72 18 7
20 60551 60551 38.8888888888889 7 11
20 60552 60552 46.6666666666667 14 16
20 60578 60578 47.3684210526316 9 10

==> Sample1.sam.CpG_report.txt <==
20 60179 + 10 19 CG CGT
20 60180 - 7 8 CG CGT
20 60426 + 12 1 CG CGA
20 60427 - 23 2 CG CGG
20 60432 + 11 2 CG CGA
20 60433 - 18 7 CG CGG
20 60551 + 7 11 CG CGA
20 60552 - 14 16 CG CGT
20 60578 + 9 10 CG CGT
20 60579 - 25 15 CG CGC



In future posts, I'll go through the process of calling differentially methylated cytosines and regions.

Friday, 3 July 2015

Weighing the benefits of RNA-seq error correction

Sequencing data contains two major types of errors, ones that are incorporated during library preparation and ones incorporated during sequence reading. While errors of the former are difficult to correct as they occur without clues from the base quality scores, the latter type do correlate with lower base quality scores and so it is possible to identify these ambiguous base calls and compare them to a library of high confidence k-mers from the same sequencing run. Error correction of this type has been show to improve de novo assemblies and SNP detection.

A recent report posted in bioRxiv shows error correction gives a drastic improvement in transcriptome assembly, with the author benchmarking the performance of five different correction software packages (Bless, Lighter, SGA, SEECER & BFC). The author reports that these software options vary in the aggressiveness of error correction, with SGA being the most aggressive, Lighter the least aggressive and BFC and SEECER performing well over a range of parameters and read depths. Marcel Shulz (developer of SEECER) has put together a great slideshow to provide an overview of error correction for RNA-seq.

So the question remains whether error correction is worthwhile for standard RNA-seq expression profiling? Ie, cases where there is a good reference genome and the end-point is up- or down-regulation of genes and pathways.

The potential of error correction in RNA-seq is a difficult topic to examine, firstly because the normal procedure of using simulated reads in these type of analyses doesn't produce the same error profile of a real sequencing run. One way to approach the problem is to use the proportion of concordant and discordant paired-end read mapping as a measure of "correctness". Theoretically, if a read contains many errors, it may not map to its true gene of origin because it maps to another location or doesn't map at all.

In this evaluation, I'm using BFC (Pubmed). The selection is arbitrary, but the author (Heng Li) has a good reputation for producing accurate, robust and efficient software. In order to test the effectiveness of BFC error correction versus the standard procedure (quality trimming prior to mapping), I took a paired-end Illumina HiSeq2500 mRNA-seq dataset and performed different types of processing. The dataset I selected (GSE63887) that was recently published in Cell Reports and is relevant to my current field of work.

These different processing procedures reflect the different ways BFC can be run (from the manpage):

  • Simplest case, correct each dataset in isolation (bfc-i)
bfc reads_1.fq > corrected_reads_1.fq
bfc reads_2.fq > corrected_reads_2.fq
  • Next case, correct each pairs of reads (bfc-p)
cat reads_1.fq reads_2.fq > reads_1+2.fq
bfc reads_1+2.fq reads_1.fq > corrected_reads_1.fq
bfc reads_1+2.fq reads_2.fq > corrected_reads_2.fq
  • Last case, correct reads based on the k-mer profile of the entire experiment (bfc-e) 
cat *reads*.fq > pooled-reads.fq
bfc pooled-reads.fq reads_1.fq > corrected_reads_1.fq
bfc pooled-reads.fq reads_2.fq > corrected_reads_2.fq

I also included a control data-set that was left without error correction (no-bfc) and in addition, I performed standard base quality trimming and filtering using Skewer that I described previously prior (default settings) to any error correction procedure. Which brings the tally up to 8 different pipelines. Then I performed alignment of the reads to the human genome with STAR (single-end mode), followed by FeatureCounts read summarisation (also single end mode), with the nifty "-R" option to obtain the read summarisation result for each read. I used a custom shell script to count the read pairs that were concordant or discordant for each sample.

The quality profiles show a typical drop in base quality towards  the end of the reads, with the reverse read being lower quality compared to the forward read. There was also some differences between the samples with regard to quality profiles, with the best and worst datasets shown below (FastQC).
SRR1693847_1 best quality forward readset
SRR1693847_2 best quality reverse readset


SRR1693849_1 worst quality forward readset
SRR1693849_1 worst quality reverse readset
And here are the base quality averages for a subsample (2%) of reads using Fastx-quality-stats (Fastx-Toolkit) as well as total number of reads.
GSE63887 dataset reads and mean base quality (over full read length)
After mapping these reads with STAR in single end mode followed by featureCounts, I was able to count the concordant and discordant read pairs with a shell script. I could then compare the different data processing pipelines.
Quantification of concordant mRNA-seq paired end reads (both reads mapQ≥20) 


Quantification of discordant mRNA-seq paired end reads (both reads mapQ≥20) 

So you can see that unprocessed data yields 65.6% concordant pairs which increases marginally to 66.1% in the BFC corrected data. In contrast, using Skewer trimmed data yielded 67.7% concordance. Using BFC correction after Skewer trimming gave 67.8% concordant mapping. The proportion of discordant pairs was highest in the unprocessed data 1.15% and was lowest in the data that had undergone Skewer trimming prior to BFC correction.

Then I looked at the mismatch rate, to see whether error correction or quality trimming was most beneficial. I looked at the logs produced by the STAR aligner, which calculates a mismatch rate for all uniquely mapped reads found.

Mismatch rate per base in uniquely mapped reads (STAR output)

As you can see, the highest error rates are present in the unprocessed data, and the lowest error rates are found in the datasets that have undergone both quality trimming and error correction. Error correction alone had a greater impact on mismatch rate than quality trimming alone.

I checked the error profile with RSeqQC, which gives detail about mismatch types over the read length for (dataset SRR1693849_1) 

As you can see, Skewer quality trimming reduces the rate at which mismatches occur toward the end of the read. BFC lowers the overall mismatch rate, but seems to incorporate higher than expected numbers of G2A, T2C, A2G and C2T mismatches.

So to see whether the hit rate of calling variant in RNA-seq is improved by quality trimming or error correction, I extracted all the sites that have at least three non-consensus mismatch base-calls (using samtools mpileup) and checked these positions against dbSNP (common_all_20150603.vcf.gz). I used the SRR1693849_1 dataset and only looked at chromosome 1. 

We see that both bfc correction and skewer quality trimming have a positive impact on identification of common variants present in dbSNP.

Bottom line

The increase of alignment performance after BFC error correction was only minimal compared to standard quality trimming procedures. Thus the benefit of error correction for differential expression analysis would thus be minimal, considering how computationally intensive it is. On the other hand, error correction does appear to improve mismatch error rates in uniquely aligned reads, which could be really handy for de novo transcriptome assembly and calling variants from RNA-seq data and allele-specific gene expression analysis.