Pages

Monday, 24 August 2015

Using paired end read concordance to determine accuracy of RNA-seq pipelines STAR/featureCounts and Kallisto

This post is a collection of code used to compare accuracy of Kallisto and STAR/featureCounts in my previous post.

Download from SRA and convert to fastq format

__________________________________________________________________

Run skewer to quality trim the data.


$ cat run_skewer.sh
#!/bin/bash
for FQZ1 in *_1.fastq.gz ; do
FQZ2=`echo $FQZ1 | sed 's/_1.fastq.gz/_2.fastq.gz/'`
skewer -t 8 -q 20 $FQZ1 $FQZ2
done
__________________________________________________________________

Run bfc to correct errors.


$ cat run_bfc.sh
#!/bin/bash
for FQ in *.fastq.gz ; do
OUT=`echo $FQ | sed 's/.fastq.gz$/_bfc.fastq.gz/'`
bfc -t 8 $FQ | pigz > $OUT &
done
wait
__________________________________________________________________

Run STAR aligner to align reads to human genome in single end mode.



$ cat run_star.sh
#!/bin/bash
DIR=/refgenome_hsapiens/
GTF=refgenome_hsapiens/Homo_sapiens.GRCh38.78.gtf
for FQZ in `ls *gz` ; do
FQ=`echo $FQZ | sed 's/.gz//'`
pigz -dkf $FQZ
STAR genomeLoad LoadAndKeep --genomeDir $DIR --readFilesIn $FQ --runThreadN 8 \
--sjdbGTFfile $GTF --outSAMattributes NH HI NM MD
rm $FQ
mv Aligned.out.sam ${FQ}.STAR.sam
mv Log.final.out ${FQ}_starlog.txt
( samtools view -uSh ${FQ}.STAR.sam | samtools sort - ${FQ}.STAR
rm ${FQ}.STAR.sam
samtools index ${FQ}.STAR.bam
samtools flagstat ${FQ}.STAR.bam > ${FQ}.STAR.bam.stats ) &
done
STAR genomeLoad Remove --genomeDir $DIR
wait
__________________________________________________________________

Run featureCounts to generate read-gene mapping info. Note mapQ threshold set at 20.



$ cat run_fcount.sh

#!/bin/bash

#DIR=refgenome_hsapiens/

GTF=refgenome_hsapiens/Homo_sapiens.GRCh38.78.gtf
EXPT=skewered
MX=${EXPT}.mx
MXF=${EXPT}_f.mx
featureCounts -Q 20 -R -T 8 -a $GTF -o $MX *bam
sed 1d $MX | cut -f1,7- > $MXF

__________________________________________________________________

Analyze featureCounts read-gene mapping data, generating true positive and false positive read pairs.



$ cat correctpairs_fcounts.sh

#!/bin/bash

for i in *featureCounts ; do

grep Assigned $i | sort -T . -k 1b,1 | cut -f1,3 > ${i}.s &
done
wait

for FC1 in *pair1.fastq.STAR.bam.featureCounts.s ; do
( FC2=`echo $FC1 | sed 's/pair1.fastq/pair2.fastq/'`
join -1 1 -2 1 $FC1 $FC2 > ${FC1}.jn
rm $FC1 $FC2 ) &
done
wait


for JN in *jn ; do
(
LN=`wc -l < $JN`
COR=`awk '$2==$3' $JN | wc -l`
INC=`awk '$2!=$3' $JN | wc -l`
echo $JN $LN $COR $INC ) &
done | tee result.txt
wait

__________________________________________________________________

Prepare custom transcriptome reference sequence for use with Kallisto.



$ cat run_kal_ref_prep.sh

#!/bin/bash

wget http://ftp.ensembl.org/pub/release-78/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

zcat Homo_sapie.GRCh38.cdna.all.fa.gz | cut -d ' ' -f1,4 \
| sed 's/ gene:/_/' > Homo_sapiens.GRC38.78.format.cdna.all.fa
__________________________________________________________________


Run Kallisto in single end mode, generating a bam file output.



$ cat run_kal.sh

#!/bin/bash

IDX=Homo_sapiens.GRC38.78.format.cdna.all.fa.idx

for FQZ in *fastq.gz ; do
NAME=`echo $FQZ | sed 's/.fastq.gz//'`
kallisto quant -t 7 --pseudobam --single -i $IDX -o ${NAME}.output -b 100 $FQZ \
| samtools view -Sb - > ${NAME}.bam
done

__________________________________________________________________

Analyze Kallisto read pair information. If multimapping has occurred, select one hit only.



$ cat correctpairs_kal.sh

#!/bin/bash

for BAM in *bam ; do

samtools view $BAM | cut -f1,3 | grep ENSG | tr '_' '\t' | cut -f1,3 \
| awk '!arr[$1]++' | sort -k 1b,1 > $BAM.counts &
done
wait

for C1 in *pair1.bam.counts ; do
C2=`echo $C1 | sed 's/pair1.bam.counts/pair2.bam.counts/'`
ls $C1 $C2
join -1 1 -2 1 $C1 $C2 > $C1.jn &
rm $C1 $C2
done
wait


for JN in *jn ; do
(
LN=`wc -l < $JN`
COR=`awk '$2==$3' $JN | wc -l`
INC=`awk '$2!=$3' $JN | wc -l`
echo $JN $LN $COR $INC ) &
done | tee result.txt
wait


Thursday, 13 August 2015

How accurate is Kallisto?

One of the most interesting developments in RNA-seq informatics in the past year or so is the evolution of so-called "lightweight" analysis. Instead of trying to map a whole read to the reference exons, it may be quicker and just as accurate to simply compare the k-mer content of the read and reference. If the read and reference share enough unique kmers, then the aligner can stop analysing that read and can move on the the next one. It turns out that the speed-up with this approach is huge, when compared with previous methods. Kallisto especially appears to have some speed advantages over competitors such as Sailfish, Salmon, eXpress, etc. For those interested in learning more, please refer to the Kallisto pre-print in arXiv, as well as the following blog posts.

For this post, I wanted to compare the accuracy of Kallisto with STAR/featureCounts (mapQ20 threshold), the pipeline that I'm currently using in most of my papers. I also want to know whether quality trimming with Skewer or error correction with BFC is also beneficial to accuracy (both default settings). The whole speed issue isn't a big deal for me as our department has more than enough computational resources available. I will use the number of concordant read pairs as a measure of true positives and discordant read pairs as a measure of false positives.

Kallisto installation was straight-forward, apart from the HDF5 dependency, which I had to guess the missing library. Ubuntu users should:

apt-get install libhdf5-serial-dev

Here are the versions of the software, genome and data I used:

kallisto 0.42.2.1
STAR_2.4.0g1
featureCounts v1.4.2
bfc r181
skewer v0.1.122r
Ensembl Homo_sapiens.GRCh38.78 (cDNA, primary assembly, GTF)


Kallisto uses a reference transcriptome sequence, which is different to STAR (genomic), so I made the transcriptome file contain both the transcript and gene ID in the header, so I can aggregate to get a read-gene relationship for concordance analysis. If Kallisto multi-mapping reads, then one was selected at random.

The data I used is from NCBI GEO (GSE57862) SRA (SRR1293901 & SRR1293902) and is useful because SRR1293901 is a 2x262 cycle run from Illumina MiSeq and SRR1293901 is a 2x76 cycle run from Illumina HiSeq 2000. The PLoS One paper that describes this dataset is also worth a look. The benefit of using this concordant read-pair approach is that it has all the quirks of real seq data that aren't accounted for by read simulators.


Firstly, you can see that in 262 nt data without pre-processing, Kallisto shows a very high concordance rate and low discordance compares to STAR/FeatureCounts. At 76 nt read lengths, Kallisto still shows a higher concordance rate, but discordance is also higher than STAR/FeatureCounts. Longer read lengths improve the concordance rates for Kallisto, while they actually decrease for STAR/FeatureCounts.

Skewer quality trimming was beneficial in all cases, but had a bigger impact on STAR/FeatureCounts than Kallisto pipeline. BFC error correction was also beneficial in all cases, but improved Kallisto pipeline to a greater degree.

In summary, Kallisto is not only a quick, but also accurate with longer reads and doesn't need huge RAM footprint like STAR does. The GEO dataset I used (GSE57862), will be a useful resource if you want to check concordance rates of other read lengths (ie:100 nt, 50 nt).

Thursday, 6 August 2015

Quantification and equimolar pooling of NGS libraries

Library preparation for next generation sequencing is becoming easier with the quality of kits and protocols improving substantially in the past few years. With the price of NGS decreasing, we are finding that our throughput is increasing, both in terms of the number of experiments as well as the average size of these experiments. With this in mind, the ability to accurately pool barcoded libraries in equimolar ratios (also called "balancing") is even more critical. Accurate quantification is thus vital. There are several ways to quantify library DNA:
  • Qubit fluorometer. Gives very accurate concentrations in nanogram per microlitre, but agnostic of fragment size distribution.
  • Bioanalyzer. Gives accurate readings of fragment size, but is only fairly accurate in terms of concentrations of each fragment
  • Quantitative PCR. Most accurate, but expensive and most time consuming.
  • NanoDrop UV-Spec. Easiest method but least accurate. Not recommended.
So I'll lead you through my favourite method for equilmolar library pooling that only uses a bioanalyzer, specifically the Shimadzu MultiNA.

When you elute the DNA off the column or magnetic beads, the first thing you need to do is look at the bioanalyzer profile. Here is an example of a bioanalyzer result for 23 samples (left) and a detailed look at the first sample (right). The MultiNA software also kindly calculates the molarity of the peaks detected, showing that sample 1 is 338.53 nM.

As you can see in the left panel above, is that the library concentrations do vary quite a lot. To demonstrate this more clearly, I present the concentration value of the main library peak for the whole 30 samples in the same experiment in the below graph. You can see that the libraries vary in concentration from 450nM down to 50 nM, meaning the pooling step will be prone to a high errors if the MultiNA readings are just a bit inaccurate.

As the accuracy of MultiNA is not that good over this wide range of concentrations values, its better to dilute the libraries down to a common concentration first and then quantify again on MultiNA. As you can see below, the variability is much improved as compared to the above graph, but not yet perfect.


This second measurement reading is used to calculate the volumes required to generate the library pools. The data yield after quality trimming shown below demonstrates that the balancing has improved again, and the three pools, each containing 10 barcoded samples are well balanced.

Prior to clustering the libraries, I'd recommend running the pooled library on the MultiNA, followed by dilution to 10 nM and then final quantification with either MultiNA or Qubit. The approach I use is fairly cheap as MultiNA runs cost as little as 10c per sample and is not very labour intensive compared to qPCR methods.

Screen for rRNA contamination in RNA-seq data


Ribosomal RNA (rRNA) is very abundant in cells (~80% of total RNA), so it is useful to deplete rRNA when doing genomewide assays to have sufficient coverage of other genes including protein coding and non-protein coding genes.

There are two major strategies for achieving rRNA removal, being (1) poly A enrichment and (2) rRNA depletion. Poly A enrichment uses an oligo dT coupled magnetic bead to "pull-out" RNA molecules with a polyA tag, a common feature of protein coding transcripts. rRNA depletion can be achieved using kits such as Ribo-Zero (Illumina), Ribo-Minus (LifeTech) and NEBNext® rRNA Depletion Kit (NEB). These kits contain oligonucleotide probes that either hybridize and immobilise the rRNA or hybridize and degrade the unwanted rRNA.

Once you have some sequence data, you will need to check whether the rRNA depletion has worked. This is somewhat different to a genome-wide analysis I've mentioned in earlier posts because rRNA genes exist in multiple copies and reads mapped there are normally discarded in the map quality filtering step. So we need to quantify more directly. You can use Broad institutes RNA-seq-QC package, but it requires so many obscure input file formats, I'm sceptical that anyone outside of the Broad has actually gotten it to work. An easy and quick QC for rRNA contamination can be done with BWA and samtools. These can both be installed with apt-get in Ubuntu Linux.

First you need the sequences of the rRNAs, so I collected them from NCBI and post them here for your convenience (human & mouse). For this example we will use human. Start with indexing the fasta reference sequence with BWA:

bwa index human_rRNA.fa

Then run a bwa alignment on the rRNA reference. The following script only looks at the first 1 million reads in the file and uses fastq_quality_trimmer to remove low quality bases from the 3' ends.

#!/bin/bash
REF=/data/projects/mziemann/rna-seq-qc/human_rRNA.fa
for FQZ in *fastq.gz ; do
echo $FQZ
FQ=`echo $FQZ | sed 's/.gz//'`
zcat $FQZ | head -4000000 | fastq_quality_trimmer -t 30 -l 20 -Q33 \
| tee $FQ | bwa aln -t 8 $REF - | bwa samse $REF - $FQ \
| samtools view -uSh - \
| samtools sort - ${FQ}.sort
done
wait
for i in *bam ; do samtools index $i & done ; wait
for i in *bam ; do samtools flagstat $i > ${i}.stats & done ; wait

The samtools flagstats command outputs a simple report of total and mapped reads

$ head -3 *.fastq.sort.bam.stats
==> WT1-1681_C6K35ANXX_ATCACG_L005_R1.fastq.sort.bam.stats <==
998121 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
156636 + 0 mapped (15.69%:-nan%)

==> WT2-1684_C6K35ANXX_CGATGT_L005_R1.fastq.sort.bam.stats <==
997640 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
153092 + 0 mapped (15.35%:-nan%)

==> WT3-1691_C6K35ANXX_TTAGGC_L005_R1.fastq.sort.bam.stats <==
998021 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
238082 + 0 mapped (23.86%:-nan%)

So you can see that from the 1 million reads we started with, we lost a few with quality trimming and then had 15-24% of reads mapping to the rRNAs (irrespective of mapQ). You can also dig a bit deeper and see which rRNA accounts for the most reads. Column 1 is the contig name, column 2 is the contiog length and column 3 is the number of mapped reads, column 4 is the number of unmapped reads.

$ samtools idxstats WT2-1684_C6K35ANXX_CGATGT_L005_R1.fastq.sort.bam
gi|120444900|ref|NR_003279.1| 4730 144029 0
gi|374088232|ref|NR_003278.3| 1870 8997 0
gi|374093199|ref|NR_003280.2| 157 57 0
gi|374088139|ref|NR_046144.1| 121 9 0
* 0 0 844548

It looks like the larger 28S fragment accounts for the most reads, followed by 18S.

In my most recent RNA-seq experiment, we used NEB polyA enrichment for human primary cell culture from 1ug of starting material as well as NEB rRNA depletion of RNA from mouse tissue for which we could only obtain 50ng of RNA. I used the NEB Ultra Directional RNA-seq library prep kit which worked for 45/45 samples. The polyA data contained 9.1% rRNA and the rRNA depleted data contained 21.4% rRNA. Overall I'm happy with that, because the mouse study would not have been possible with the polyA method due to low starting amounts.


EDIT: If using longer reads (>70bp) or another sequencing platform such as Ion Torrent, BWA MEM will actually work better. Below is the code for running it.

#!/bin/bash
REF=/data/projects/mziemann/rna-seq-qc/human_rRNA.fa

for FQZ in *fastq.gz ; do
  echo $FQZ
  FQ=`echo $FQZ | sed 's/.gz//'`
  zcat $FQZ | head -4000000 | fastq_quality_trimmer -t 30 -l 20 -Q33 \
  | bwa mem -t 8 $REF - \
  | samtools view -uSh - \
  | samtools sort - ${FQ}.sort &
done
wait
for i in *bam ; do samtools index $i & done ; wait
for i in *bam ; do samtools flagstat $i > ${i}.stats & done ; wait