Pages

Wednesday, 24 September 2014

Read counting with featureCounts, BedTools and HTSeq

Counting the number of reads that align to certain genomic features is a key element of many next gen sequencing analysis pipelines. For RNA-seq, this is commonly used to count reads aligning to exons, while for ChIP-seq this is used to count reads over a promoter or other region of interest. There are several available tools for performing this task and in this post I will compare the three of the most commonly used:
I took one of the bam files from the recent RNA-seq series of posts and subsampled it using samtools and shuf into file sizes of 1M, 2M, 5M, 10M reads, as well as the bam file containing 25.4M reads.

I  then used the benchmarking script described in the previous post to record execution time, CPU usage and peak memory for read counting to generate a gene-wise matrix. I used featureCounts in the single thread mode as well as the parallel mode (maximum of 8 cores).
Execution time for read counting by featureCounts, BedTools Multicov and HTSeq-count for bam files of varying sizes. 
Peak memory usage for read counting by featureCounts, BedTools Multicov and HTSeq-count for bam files of varying sizes. 
CPU utilisation for read counting by featureCounts, BedTools Multicov and HTSeq-count for bam files of varying sizes. 
The results show that featureCounts is about 10 times faster than BedTools Multicov and about 18 times faster than HTSeq-count when using a single thread, and when allowing parallel processing, this became 20 times and 37 times respectively.

Then I broke down the overall times into that invested in (1) reading in the annotation file and (2) processing the reads.
Time taken for featureCounts, BedTools Multicov and HTSeq to read in the annotation feature (upper panel) file and process 1 million reads (lower panel).

This result shows that featureCounts is considerably faster at both tasks than its competitors.


Exact commands used:
htseq-count -q -f bam --stranded=no --minaqual=10 --type=gene --mode=union test.bam Homo_sapiens.GRCh38.76.gtf > test.bam.htseq.count
bedtools multicov -D -q 10 -bams test.bam -bed Homo_sapiens.GRCh38_exon.bed > test.bam.bedtools.cnt
featureCounts -Q 10 -M -s 0 -T 1 -a Homo_sapiens.GRCh38.76.gtf -o test.bam.featureCount.cnt test.bam
featureCounts -Q 10 -M -s 0 -T 8 -a Homo_sapiens.GRCh38.76.gtf -o test.bam.featureCount.cnt test.bam

Sunday, 21 September 2014

Benchmark scripts and programs

Bioinformaticians strive for accurate results, but when time or computational resources are limited, speed can be a factor too. This is especially true when dealing with the huge data sets coming off sequencers these days.

When putting together an analysis pipeline, try taking a small fraction of the data and perform some benchmarking of the available tools.

Benchmarking could be as simple as using time:

time ./script1.sh
time ./script2.sh


But if you need a little more detail, this benchmarking approach captures peak memory usage and average CPU utilisation too.

1. Set up a list of commands/scripts in a file called "codes.txt"

Here is a list of commands that I used in a previous post:

$ cat codes.txt
cat test.fastq > /dev/null
zcat test.fastq.gz > /dev/null
bzcat test.fastq.bz2 > /dev/null
pigz -dc test.fastq.gz > /dev/null
pbzip2 -dc test.fastq.bz2 > /dev/null
plzip -dc test.fastq.lz > /dev/null

2. Setup the benchmarking script

Use the following script to execute each line of "codes.txt" and uses /usr/bin/time to measure the time/memory/CPU. The /usr/bin/time is more flexible than built-in "time" because the output can be more customised. Importantly, the script requires sudo permissions to clear the cache before executing the next task. You will find sudo su more convenient here to prevent recurrent sudo password requests but be CAREFUL not to delete system files or data. The script is set up to perform replicates of the test to ensure reproducibility.

$ cat benchmark_time.sh

#!/bin/bash
CODES=codes.txt
LEN=`wc -l < $CODES`
EX=exe.sh
TMP=tmp.txt
RES=results.txt
REPS=3

for REP in `seq $REPS`
do
for LINE in `seq $LEN`
do
CODE=`sed -n ${LINE}p $CODES`
echo $CODE > $EX
chmod +x $EX
sync; echo 3| sudo tee /proc/sys/vm/drop_caches > /dev/null
/usr/bin/time --format "%e\t%M\t%P" --output=$TMP ./$EX
echo $CODE | cat - $TMP | tr '\n' '\t' | sed 's/$/\n/'
done
done | tee $RES

3. Execute the script

chmod +x benchmark_time.sh

sudo su

./benchmark_time.sh

The output should look something like this:

cat test.fastq > /dev/null 30.06 728 6%
zcat test.fastq.gz > /dev/null 23.56 1448 93%
bzcat test.fastq.bz2 > /dev/null 95.39 4148 98%
pigz -dc test.fastq.gz > /dev/null 11.99 1020 142%
pbzip2 -dc test.fastq.bz2 > /dev/null 32.96 43124 765%
plzip -dc test.fastq.lz > /dev/null 13.45 225424 751%

Where the 1st column is the time in seconds, the second is the peak memory and the 3rd is the CPU utilisation.

If you need more detailed reports, consider dstat.

Wednesday, 17 September 2014

Data analysis step 9: Leverage ENCODE data for enhanced pathway analysis

So far in this series of posts we've analysed the effect of azacitidine on AML3 cell gene expression using publicly available RNA-seq data. Our pathway analysis showed many interesting trends, but unravelling all the different mechanisms from this point can be really hard. In order to identify the major players at the chromatin level, it can be useful to integrate transcription factor binding data and see whether targets of a particular transcription factor are differentially regulated in a pathway analysis. The problem with this analysis in the past was that ChIP-seq datasets were in varying formats on GEO and processing these into a standardised format would be too time consuming. With the advent of the ENCODE Project, there is now a large body of transcription factor binding data in a uniform format (link), that is being mined in many creative ways. In our group, we used this approach extensively (here, here and here).

In this post, we will mine ENCODE transcription factor binding site (TFBS) data to determine TF-target gene interactions and process these into gene sets for pathway analysis in GSEA.

Our general approach will be the following:

  • Download a bed file of all promoters (or prepare one from a gtf file)
  • Download ENCODE TFBS data in bed format
  • Intersect TFBS sites to promoters using bedtools
  • Pick the top 1000 target genes for each TFBS data set
  • Perform pathway analysis using the new gene sets

Download TFBS data

ENCODE TFBS data is organised into 5 subdirectories. Each of these contains a files.txt outlining the contents. Create a text file called "URL.txt" containing the following:

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibTfbs/
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhTfbs/
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUchicagoTfbs/
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeOpenChromChip/
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwTfbs/



Execute the following script to download the contents files and then download the TFBS peak files. These are based upon the hg19 human genome build. The peak files will appear in a folder called "peakfiles". At the time of writing there are 1340 peak files. You may need to install curl.


#!/bin/bash

URLS=URL.txt
TFBS_URLS=files_TFBS_URLs.txt


for URL in `cat $URLS`
do
URL2=`echo $URL | tr '/' '+'`
LAB=`echo $URL | tr '/' '_' | sed 's/wgEncode/@/' | cut -d '@' -f2-`
echo $LAB
wget --quiet $URL/files.txt
mv files.txt files_${LAB}TFBS.txt
curl $URL/files.txt | grep Peak.gz | cut -f1 \
| sed "s/^/${URL2}/"
done | tee $TFBS_URLS

rm -rf peakfiles
mkdir peakfiles
cd peakfiles

for URL in `cat ../$TFBS_URLS | tr '+' '/'`
do
wget $URL
done


Generate the gene sets file in gmt format

Now we will intersect the TFBS peaks with promoters. You can download a promoter bed file using the UCSC table browser. Just ensure that you're dealing with the same genome build as the one describing the TFBS data (hg19). Execute the following script, specifying the promoter/TSS bed file. It will generate a gmt containing a gene set for each TFBS peak file.

#!/bin/bash

PKDIR=peakfiles
TSSBED=Homo_sapiens.GRCh37.75_1kbTSS.bed
OUTGMT=ENCODE_TFBS.gmt

for PK in ${PKDIR}/*gz
do
BASE=`basename $PK`
zcat $PK | sort -k7gr | sed 's/^chr//' \

| intersectBed -wb -a - -b $TSSBED \
| awk '{print $NF}' | cut -d '@' -f2 \

| awk '!arr[$1]++' | cut -d '_' -f2- \
| awk '!arr[$1]++' | head -1000 \
| tr '\n' '\t' | sed 's/$/\n/' \
| sed "s/^/${BASE}\tNA\t/"
done | tee $OUTGMT

Run GSEA with the new TFBS gene sets gmt 

Follow instructions in the previous post to run GSEA using the java GUI. We began with 1340 gene sets and 7 were filtered out due to there being too few detected members, leaving 1333 sets for analysis. From these, 544 were up-regulated (FDR p-value≤0.05) and 261 were down-regulated.
ENCODE TFBS GSEA result of azacitidine exposed AML3 cells by RNA-seq. Top 20 gene sets in either direction are shown.

Significant up and down-regulated MSigDB gene sets in azacitidine treated AML3 cells. Left panel shows up-regulation of c-Myc targets and right panel shows down-regulation of CEBPB targets. 
The results show that targets of c-Myc, Max and E2F1 are up-regulated, while TEAD4, CEBPB and CTCF targets are down-regulated. This type of analysis allows researchers to perform functional assays to target the above transcription factors to elucidate and validate molecular mechanisms with clear endpoints - the differential expression of target genes in response to azacitidine.

This approach is not limited to TFBS, we could also look at chromatin marks such as histone modifications or DNA methylation datasets from ENCODE or other sources.

Data analysis step 8: Pathway analysis with GSEA

In our RNA-seq series so far we've performed differential analysis and generated some pretty graphs, showing thousands of differentially expressed genes after azacitidine treatment. In order to understand the biology underlying the differential gene expression profile, we need to perform pathway analysis.

We use Gene Set Enrichment Analysis (GSEA) because it can detect pathway changes more sensitively and robustly than some methods. A 2013 paper compared a bunch of gene set analyses software with microarrays and is worth a look.

Generate a rank file

The rank file is a list of detected genes and a rank metric score. At the top of the list are genes with the "strongest" up-regulation, at the bottom of the list are the genes with the "strongest" down-regulation and the genes not changing are in the middle. The metric score I like to use is the sign of the fold change multiplied by the inverse of the p-value, although there may be better methods out there (link).

Here is what the first 10 lines in the DESeq output look like:
Top 10 lines of differential expression table generated by DESeq.



#!/bin/bash
DGE=$1
RNK=`echo $DGE | sed 's/.xls/.rnk/'`
sed 1d $DGE \
| sort -k7g \
| cut -d '_' -f2- \
| awk '!arr[$1]++' \
| awk '{OFS="\t"}
{ if ($6>0) printf "%s\t%4.3e\n", $1, 1/$7 ;
else printf "%s\t%4.3e\n", $1, -1/$7 }' \
| sort -k2gr > $RNK

And here's the top and bottom 10 lines of the rank file:

$cat DEB_DESeq.rnk | (head;tail)
AC011899.9 3.352e+64
CAMK1D 1.471e+51
HSD11B1 1.919e+48
GNG4 2.279e+43
CPLX1 3.873e+40
CGNL1 1.200e+39
APCDD1 1.237e+35
MANEAL 9.303e+32
MKX 4.441e+28
HOXA13 1.007e+26
HLA-B -2.908e+132
PRAME -3.063e+165
LYZ -9.701e+185
CPXM1 -1.169e+186
HSH2D -1.658e+186
TGFBI -8.856e+210
APOC2 -9.633e+243
APOC4-APOC2 -1.984e+245
COL1A2 -1.298e+274
SLC12A3 -5.422e+304



The output looks consistent with the table I made in a previous post.

Run GSEA using the Java GUI

Download the javaGSEA Desktop Application from the Broad website (you will need to register). Open a terminal and cd into the directory containing the jar file and write (substitute the XX for appropriate version)

java -jar gsea2-XX.jar

Hit the "Load data" tab and browse for the rnk file we just generated.

Next, select "GSEA Preranked" from the "Tools" pull-down menu.

Here is the list of parameters I used:
Gene sets database: Initially, I like to use REACTOME to see that GSEA is working as well as quickly give a broad idea of the biology going on. Later on we will run with everything in the MSigDB database.
Number of permutations: 1000 is normally OK.
Ranked List: Ensure that GSEA is looking at the newly generated rnk file.
Collapse dataset to gene symbols: False
Chip platform: We're using is the gene symbols, you can leave this blank and GSEA will download the gene symbol chip file from the Broad website.
Enrichment statistic: Classic, this behaves better when we have extreme p-values such as this case
Max gene set size: 5000
Min gene set size: 15
Collapsing mode for probe sets>1 gene: Max_probe
Normalisation mode: meandiv
Omit features with no symbol: True

REACTOME result

From the initial REACTOME data base of 674 sets, 242 were removed because there either were less than 15 members detected or more than 5000. This left us with 432 sets to analyse. There were 220 sets with FDR p-value (called q-value here) ≤0.05. From these, 125 were up-regulated and 95 were down-regulated.
REACTOME GSEA result of azacitidine exposed AML3 cells by RNA-seq. Top 20 gene sets in either direction are shown.
Most down and up-regulated REACTOME pathways in azacitidine treated AML3 cells.
This result shows up-regulation of cell cycle and DNA repair pathways and down-regulation of innate immune response.

MSigDB Result

In the case of MSigDB, we began with 10295 gene sets and 2148 were filtered out (too few or too many), leaving 8147 sets. From these, 1613 were up-regulated (FDR p-value≤0.05) and 3692 were down-regulated.
MSigDB GSEA result of azacitidine exposed AML3 cells by RNA-seq. Top 20 gene sets in either direction are shown.
Two significant up and down-regulated MSigDB gene sets in azacitidine treated AML3 cells .
These results show the up-regulation of EWS/FLI fusion protein target genes that are linked to oncogenesis (reference). The down-regulation of genes linked to metabolic syndrome (reference) is somewhat unexpected, although many members are part of the interferon signalling pathway shown above.

Sunday, 14 September 2014

Test-driving parallel compression software

Next generation sequencing is driving genomics into realm of big data, and this will only continue as NGS moves into the hands of more and more researchers and clinicians. Genome sequence datasets are huge and place an immense demand on informatics infrastructure including disk space, memory and CPU usage. Bioinformaticians need to be aware of the performance of different compression software available to make the most out of their hardware.

In this post, we'll test the performance of some compression algorithms in terms of their disk-saving ability as well as speed of decompression/compression.

The file that we're working on is an RNA-seq fastq file that is 5.0 GB in size and contains 25.8 million sequence reads on 103.0 million lines. Here is the top 10 lines.

@SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
ACTTGCTGCTAATTAAAACCAACAATAGAACAGTGA
+SRR1171523.1 Y40-ILLUMINA:15:FC:5:1:8493:1211 length=36
B55??4024/13425>;>>CDD@DBB<<<BB<4<<9
@SRR1171523.2 Y40-ILLUMINA:15:FC:5:1:3017:1215 length=36
GAACAATCCAACGCTTGGTGAATTCTGCTTCACAAT
+SRR1171523.2 Y40-ILLUMINA:15:FC:5:1:3017:1215 length=36
DBBE?BDFDEGGGEG@GGBGDGGGGHHHHHGFHFBG
@SRR1171523.3 Y40-ILLUMINA:15:FC:5:1:3346:1218 length=36
TTCCACTTCCCAAGTAAACAGTGCAGTGAATCCTGT

The system performing the benchmark is a pretty standard desktop PC with 8 threads.
The system used to benchmark compression software
I performed compression using the default settings and determined the compression ratio. bzip2 was better than gzip, but not as good as plzip, pxzip or lrzip.
Data compression characteristics. Compressed fastq file size compared to the uncompressed file (top). Compression ratio of various compression software (bottom).
Next I determined the time to read a compressed file and output to /dev/null. Then determined the time required to compress and write to disk. These tests show that parallel gzip (pigz) is definitely the fastest way to read/compress a file on this system. plzip2 was quick for decompression but pretty slow for compression. pxz and  lrzip were slow for both decompression and compression. Their CPU and memory requirements may be useful if other jobs will be running concurrently.
Comparison of different compression algorithms with respect to time (left side), peak memory (centre) and CPU utilisation (right) of decompression (top), compression to /dev/null (middle) and compression with write to disk (bottom).

The pigz program was certainly the stand out in terms of performance and would be recommended for files that might be read regularly. Reading and writing compressed data with pigz proved to be considerably faster than using cat, probably because after compression, there is less data to write to disk. Where disk space is limiting, pbzip2 and plzip might be more applicable for data archiving.

See my post on how the benchmarking was done.

Sources:
http://unix.stackexchange.com/questions/52313/how-to-get-execution-time-of-a-script-effectively
http://www.ubuntugeek.com/how-to-clear-cached-memory-on-ubuntu.html
http://www.thegeekstuff.com/2012/01/time-command-examples/
http://ss64.com/bash/time.html

Friday, 12 September 2014

Data analysis step 7: Fast MDS plot

We will continue our series in the analysis of our azacitidine treated AML3 cell RNA-seq gene expression data set by generating a multidimensional scaling plot. This is a potentially useful way of showing variability in datasets, especially when the number of samples is large. Trawling the blogs, I found a really quick and easy way to do this in R (thanks Michael Dondrup@BioStars) that can be used to analyse the count matrix.

x<-scale(read.table("CountMatrix.xls", row.names=1, header=TRUE))
pdf("MDSplot.pdf") 
plot(cmdscale(dist(t(x))), xlab="Coordinate 1", ylab="Coordinate 2", type = "n") ; text(cmdscale(dist(t(x))), labels=colnames(x), ) 
dev.off()
Multidimensional scaling (MDS) plot for public gene expression data (GSE55123). 
The closer the labels are together, the more similar the samples are. So it is good to see that the untreated samples are clearly separated from the azacitidine treated samples. UNTR1 is a bit far away from the other two replicates as suggested by the heatmap in the previous post.

Wednesday, 10 September 2014

Data analysis step 6: Draw a heatmap from RNA-seq data using R

In the last post of this series, I left you with a gene expression profile of the effect of azacitidine on AML3 cells. I decided to use the DESeq output for downstream analysis. If we want to draw a heatmap at this stage, we might struggle because the output provided by the DEB applet does not send back the normalised count data for each sample.

It is not really useful to plot all 5704 genes with FDR adjusted p-values <0.05 on the heatmap, so I will simply show the top 100 by p-value. Here are the general steps I will use in my R script below:

  1. Read the count matrix and DESeq table into R and merge into one table
  2. Sort based on p-value with most significant genes on top
  3. Select the columns containing gene name and raw counts
  4. Scale the data per row
  5. Select the top 100 genes by significance
  6. Generate the heatmap with mostly default values
Google searches show that R has some quite elaborate heatmap options, especially with features from ggplot2 and RColorBrewer. In this example, I will use built-in R features.


#read in the count matrix
mx<-read.table("Aza_AML3_countmatrix.xls", row.names = 1 , header = TRUE)

#read in the DESeq DGE spreadsheet
dge<-read.table("DEB_DESeq.xls", row.names = 1 , header = TRUE)

#merge the counts onto the DGE spreadsheet
mg<-merge(dge,mx,by="row.names")

#sort the merged table by p-value
smg<-mg[order(mg$pval), ]

#select only the columns containing the gene names and count data
x<-subset(smg, select = c("Row.names", "UNTR1", "UNTR2", "UNTR3", "AZA1", "AZA2", "AZA3"))

#make the table a data frame with gene names then remove duplicate gene name column
y<-(as.data.frame(x, row.names = x$Row.names))
x<-subset(y, ,-c(Row.names))

#scale rows
xt<-t(x)
xts<-scale(xt)
xtst<-t(xts)

#only grab top 100 by p-value
h<-head(xtst, n = 100L)

#set layout options - adjust if labels get cut off
pdf("heatmap.pdf",width=7, height=8)

#draw heatmap allowing larger margins and adjusting row label font size
heatmap(h, margins = c(4,10), cexRow=.4)

#output plot to file
dev.off()

Heatmap representing gene expression of AML3 cells treated with azacitidine. Only top 100 most significant genes are shown. Statistical analysis was performed using DESeq. Scale: Yellow indicates high  expression and red is low expression.

As you can see, the heatmap shows stark expression changes for these top 100 most significantly differential genes. Also note that the majority of genes in the top 100 are down-regulated.

Tuesday, 9 September 2014

Data analysis step 5: Differential analysis of RNA-seq

So far in this RNA-seq analysis series of posts, we've done a whole bunch of primary analysis on GSE55125 and now we are at the stage where we can now perform a statistical analysis of the count matrix we generated in the last post and look at the genes expression differences caused by Azacitidine.

For this type of analysis we could load our data into R and perform the analysis ourselves, but for a simple experiment design with 2 sample groups in triplicate without batch effects or sample pairing I want to share with you an easy solution. DEB is a online service provided by the  Interdisciplinary Center for Biotechnology Research (ICBR) University of Florida that will analyse the count matrix for you with either DESeq, edgeR or baySeq. Their Bioinformation paper is also worth a look.

As with all aspects of bioinformatics, format is critical. You need to follow the specified format exactly. Here is what the head of my count matrix looks like:

gene UNTR1 UNTR2 UNTR3 AZA1 AZA2 AZA3
ENSG00000185650_ZFP36L1 479 467 481 89 77 96
ENSG00000116032_GRIN3B 62 45 62 43 54 52
ENSG00000236345_RP11-59D5__B.2 608 512 600 433 406 417
ENSG00000138069_RAB1A 4141 3568 3571 3250 3259 3189
ENSG00000127666_TICAM1 719 650 677 615 648 566
ENSG00000155393_HEATR3 1672 1291 1570 1511 1634 1641
ENSG00000146731_CCT6A 9400 7432 7743 8923 9322 9471
ENSG00000153012_LGI2 273 275 214 299 264 272
ENSG00000253716_RP13-582O9.5 78 53 62 75 77 60



I then uploaded the count matrix to DEB to perform differential gene expression analysis using all three statistical algorithms with a significance threshold of 5% after false discovery rate (FDR) correction.
Number of differentially expressed genes (FDR≤0.05) using three popular algorithms for RNA-seq analysis.
Then I wanted to know what the overlap was between the three algorithms, so I drew Venn diagrams with Venny.
Venn analysis of differentially expressed genes (FDR≤0.05) using three popular algorithms for RNA-seq analysis. (A) Up-regulated by Azacitidine. (B) Down-regulated by Azacitidine.


This shows that overall, the three algorithms were very similar, although edgeR was the least conservative and baySeq was the most conservative.

Here is the smear plot of the effect of azacitidine using DESeq. The version generated by edgeR was virtually identical.
Smear plot from DESeq for the comparison of azacitidine treated cells versus the control cells (GEO link). The x-axis shows the average expression level and the y-axis shows the fold change. Both axes are on the log scale. DEB server was used for statistical analysis.
As you can see, the range of fold changes is quite striking, with several genes with log2 fold changes less than -6, and only 2 genes with fold change greater than +2.

Below is the top 20 up- and down-expressed genes by azacitidine.
Top 20 up- and down-expressed genes by azacitidine determined by DESeq.

In the next post, we will draw a heatmap.

Data analysis step 4: Count aligned reads and create count matrix

In this RNA-seq analysis series of posts, we're going through an RNA-seq analysis workflow. So far, we've downloaded, done QC and aligned the reads. In this post, we will count the reads aligning to exons and generate a count matrix for statistical analysis.

My overall strategy is to use bedtools to count reads over exons. To do that, I'll need a bed file containing all exons. I could use UCSC genes, but I'd rather use the latest Ensembl version.

Create exon bed file 

To begin, we need a list of known exons. I will use the latest Ensembl human gene annotation gtf file from their ftp server. Then extract the exon entries with grep and then rearrange the gene accession and symbol information into a temporary bed file. The temporary bed file is exploded on gene name and then overlapping exons are merged with bedtools.

#!/bin/bash
wget ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens/Homo_sapiens.GRCh38.76.gtf.gz
zcat Homo_sapiens.GRCh38.76.gtf.gz \
| grep -w exon | tr '"' '\t' \
| cut -f1,4,5,10,16 | sed 's/\t/_/4' \
| sort -k4 > tmp.bed

mkdir Homo_sapiens.GRCh38_explode
cd Homo_sapiens.GRCh38_explode
awk '{print > $4}' ../tmp.bed

for i in `ls ENS*`
do cat $i | sortBed | mergeBed -nms \
| cut -d ';' -f1
done | sortBed > ../Homo_sapiens.GRCh38_exon.bed

Count reads with bedtools

In this part, we ask bedtools to count the reads aligned to the exons with multicov tool. We use the options "-D" to allow duplicate reads  and "-q 10" to only count reads that were uniquely mapped. The script then parses the data to a perl script called "agg.pl" that aggregates the exons data to generate a final gene-wise count. The temporary count files are then then merged from all the samples using awk to make a data matrix.

#!/bin/bash
BED=/path/to/Homo_sapiens.GRCh38_exon.bed
EXPT=Aza_AML3_tophat
MX=`echo $EXPT | sed 's/$/.xls/'`

for BAM in *bam
do
bedtools multicov -D -q 10 -bams $BAM -bed $BED \
| cut -f4- | tr '\t' ',' \
| perl agg.pl > ${BAM}.cnt &
done
wait

ls *bam | cat | tr '\n' '\t' \
| sed 's/^/\t/;s/$/\n/;s/.fq.sort.bam//g' > header.tmp

awk '$0 !~ /#/{arr[$1]=arr[$1] " " $2}END{for(i in arr)print i,arr[i]}' *cnt \
| tr -s ' ' | tr ' ' '\t' | cat header.tmp - > $MX

rm header.tmp

Here is the agg.pl script that you will need:

#!/usr/bin/perl
use warnings;
use strict;
use integer;

my %a;
while (<>) {
    my ($animal, $n) = /^\s*(\S+)\s*,\s*(\S+)/;
    $a{$animal} += $n if defined $n;
}
print "$_\t${a{$_}}\n" for sort keys %a;

Here is what the matrix file will look like:

SRR1171523 SRR1171524 SRR1171525 SRR1171526 SRR1171527 SRR1171528
ENSG00000185650_ZFP36L1 479 467 481 89 77 96
ENSG00000260835_RP11-468I15.1 0 0 0 0 0 0
ENSG00000116032_GRIN3B 62 45 62 43 54 52
ENSG00000185339_TCN2 1 2 1 0 1 1
ENSG00000255506_RP11-203F8.1 0 0 0 0 0 0
ENSG00000229899_AC084290.2 0 0 0 0 0 0
ENSG00000269933_RP3-333A15.2 0 0 0 0 0 0
ENSG00000216069_MIR875 0 0 0 0 0 0
ENSG00000236345_RP11-59D5__B.2 608 512 600 433 406 417

You will notice that there are a lot of genes with few or no reads. I generally like to remove genes with fewer than 10 reads per sample on average. I use a small awk script called filt.sh

#!/bin/bash
FILE=$1
sed -n 1p $FILE > header.txt
awk 'BEGIN {FS=OFS="\t"}
{
sum=0; n=0
for(i=3;i<=NF;i++)
     {sum+=$i; ++n}
     print sum/n,$0
}' $FILE \
| awk '$1>10' | cut -f2- | cat header.txt - > ${FILE}_filt.xls

Here is how you use the filt.sh script:

$ ./filt.sh CountMatrix.xls

Then have a look at the filtered matrix file:

SRR1171523 SRR1171524 SRR1171525 SRR1171526 SRR1171527 SRR1171528
ENSG00000185650_ZFP36L1 479 467 481 89 77 96
ENSG00000116032_GRIN3B 62 45 62 43 54 52
ENSG00000236345_RP11-59D5__B.2 608 512 600 433 406 417
ENSG00000138069_RAB1A 4141 3568 3571 3250 3259 3189
ENSG00000127666_TICAM1 719 650 677 615 648 566
ENSG00000155393_HEATR3 1672 1291 1570 1511 1634 1641
ENSG00000146731_CCT6A 9400 7432 7743 8923 9322 9471
ENSG00000153012_LGI2 273 275 214 299 264 272
ENSG00000253716_RP13-582O9.5 78 53 62 75 77 60

At this stage, we can compare the number of genes that were discarded from the analysis with "wc":

$ wc -l Aza_AML3_tophat*
  19919 Aza_AML3_tophat_filt.xls
  62758 Aza_AML3_tophat.xls

So this shows that 42,839 Ensembl genes did not pass our detection threshold. By omitting these silent and lowly expressed genes, we will remove some noise and also get the added bonus of making the false discovery rate correction of p-values less severe when we get to the statistical analysis.

A peek at the mapping characteristics

In the below figure I look at the map quality for each sample keeping in mind that mapQ<10 is not considered reliable (I used samtools view). I also look at the location of reads in terms of whether they originate from exons or elsewhere (analysing the .cnt files). It shows that 50-60% of the reads are mapped with mapQ greater than or equal to 10 and that about 93% of those reads align to known exons.
Basic analysis of mapped reads. (A) Proportion of mapped reads with map quality of 10 or greater determined with Samtools. (B) Proportion of mapQ10+ reads mapped to exons compared to other regions.

In the next post I'll go through some differential expression analysis.

Monday, 8 September 2014

Data analysis step 3: Align paired end RNA-seq with Tophat

In this series of posts, we're going through an RNA-seq analysis workflow. So far, we've downloaded and inspected sequence quality. In this post, we will align the paired end data to the human genome with Tophat.

Part 1 is to get a suitable reference genome sequence. I chose download the most recent human genome sequence from the Ensembl ftp site (Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa).

axel --num-connections=6 ftp://ftp.ensembl.org/pub/release-76/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz


Once downloaded, the genome needs to be uncompressed and indexed with bowtie2. This will take a few hours.

REF=Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
gunzip ${REF}.gz
bowtie2-build $REF $REF

Next step is to use Tophat to align the reads in paired-end mode. Tophat is smart enough to recognise the bz2 compression. The script also indexes the bam files and generates some familiar samtools statistics.

#!/bin/bash

REF=/path/to/genomes/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa

for FQZ1 in SR*_1.fastq.bz2
do
FQZ2=`echo $FQZ1 | sed 's/_1/_2/'`
BASE=`echo $FQZ1 | cut -d '_' -f1`
tophat -p 4 -o ${BASE}_tophat $REF $FQZ1 $FQZ2
samtools index ${BASE}_tophat/accepted_hits.bam
samtools flagstat ${BASE}_tophat/accepted_hits.bam \
> ${BASE}_tophat/accepted_hits.bam.stats
done

A scan through the align_summary.txt shows that the alignment stats looks reasonable.
Tophat alignment statistics for paired end RNA-seq data (GSE55123)
If you are working on a big multicore server with sufficient memory, you can speed up the alignment by running multiple alignment jobs in the background.

#!/bin/bash

REF=/path/to/genomes/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa

for FQZ1 in SR*_1.fastq.bz2
do
(
FQZ2=`echo $FQZ1 | sed 's/_1/_2/'`
BASE=`echo $FQZ1 | cut -d '_' -f1`
tophat -p 4 -o ${BASE}_tophat $REF $FQZ1 $FQZ2
samtools index ${BASE}_tophat/accepted_hits.bam
samtools flagstat ${BASE}_tophat/accepted_hits.bam \
> ${BASE}_tophat/accepted_hits.bam.stats ) &
done ; wait

Data analysis step 2: quality control of RNA-seq data

In a previous post, we downloaded RNA-seq data from GEO (GSE55123) . Lets continue with the processing of this data by performing QC analysis. In a previous post, I went into a bit more detail, but here we will simply use fastx_quality_stats from the fastx toolkit to have a look at quality scores among the data sets.

The general strategy was to unzip the data on the fly, convert to tabular format and then select a random 1 million sequences and then submit these to fastx_quality_stats. So having a look through the output file, shows very high quality scores with median scores >36 which suggests this dataset is very high quality. Below see the code used and a graph of median quality scores throughout the run.

#!/bin/bash
for FQZ in *bz2
do echo $FQZ
pbzip2 -dc $FQZ | paste - - - - | shuf | head -1000000 \
| tr '\t' '\n' | fastx_quality_stats | cut -f1,6-9
done | tee quality_analysis.txt

Mean cycle base quality for GSE55123 RNA-seq data shows very high quality sequence reads.

Monday, 1 September 2014

Tabular sequence to fasta format

Here is a 1 line command to turn a list of sequences into a fasta file.

$ cat seq.txt 
CAACACCAGTCGATGGGCTGT
CAACACCAGTCGATGGGCTGTC
CAACACCAGTCGATGGGCCGT
TAGCTTATCAGACTGATGTTGA
TAGCTTATCAGACTGATGTTGAC

We use nl to count the lines, then sed to remove whitespaces and introduce the arrow ">" then tr to create line breaks.

$ nl seq.txt | sed 's/^[ \t]*/>/' | tr '\t' '\n'
>1
CAACACCAGTCGATGGGCTGT
>2
CAACACCAGTCGATGGGCTGTC
>3
CAACACCAGTCGATGGGCCGT
>4
TAGCTTATCAGACTGATGTTGA
>5
TAGCTTATCAGACTGATGTTGAC