Friday, 24 April 2015

Functions and GNU parallel for effective cluster load management

I've been a fan of GNU parallel for a long time. Initially I was sceptical about using it, preferring to write huge for loops but over time I've grown to love it. The beauty of GNU parallel is that it spawns a specified number of jobs in parallel and then submits more jobs as others are completed. This means that you get maximum usage out of the CPUs without overloading the system. There are many excuses for not using it, but perhaps the only valid one is that you have Sun Grid Engine or another job scheduler or manager in place.

GNU parallel is particularly useful when used with functions. Functions are subroutines that may be repeated many times to complete a piece of work. In bash, here is a simple example, which declares a function consisting of a chain of piped commands, and then executes 4 jobs in parallel, until all of *files.txt have been processed.

my_func2() {
cmd1 $INPUT $VAR1 | cmd2 | cmd3 > ${1}.out
export -f my_func
parallel -j4 my_func ::: *files.txt

Nice, but now for something relevant to bioinformatics. Here is a bwa mem wrapper that pulls in all fastq files in the current working directory (including gzip and bzip2 compressed) and processes them in parallel (four at any time). Because each bwa job uses 4 cores at any time, the maximum CPU usage will be 16.

bwamem() {
GZ=`echo $FQZ | grep -c '.gz$'`
BZ2=`echo $FQZ | grep -c '.bz2$'`
if [ "$GZ" -eq "1" ] ; then
FQ=`echo $FQZ | sed 's/.gz$//'`
pigz -dc $FQZ \
| $BWA mem -t $CPU $REF - \
| samtools view -uSh - \
| samtools sort - ${FQ}_bwamem.sort
elif [ "$BZ2" -eq "1" ] ; then
FQ=`echo $FQZ | sed 's/.bz2$//'`
pbzip2 -dc $FQZ \
| $BWA mem -t $CPU $REF - \
| samtools view -uSh - \
| samtools sort - ${FQ}_bwamem.sort
echo input file not compressed
FA=`head -1 $FQZ | grep -c '^>'`
FQ=`head -1 $FQZ | grep -c '^@'`
if [ "$FQ" -eq "1" -o "$FA" -eq "1" ] ; then
$BWA mem -t $CPU $REF $FQ \
| samtools view -uSh - \
| samtools sort - ${FQ}_bwamem.sort
echo Error. Unknown file format. Exiting.
samtools index ${FQ}_bwamem.sort.bam
export -f bwamem
parallel -j4 bwamem ::: *fastq.gz *fastq.bz2 *.fq *fastq

If you need to something more sophisticated, you can transfer environment variables and functions too. If you have a cluster of servers where you have ssh login without password, GNU parallel can direct jobs to those connected computers if directed. To do it, just add the name of the server to the parallel command.

parallel -j4 -S server1,server2,server3 bwamem ::: *fastq.gz

Further watching/reading:

Friday, 10 April 2015

What is a CpG shore and how to I get them all?

CpG shores are the regions immediately flanking and up to 2 kbp away from CpG islands. These regions are interesting because methylation they are variably methylated in cancer and development (see reference).
Image courtesy of
To identify CpG shores, you first need a list of coordinates of CpG islands. If you have a linux computer with Emboss tools installed, you can determine CpG island positions with the cpgplot command. (if you don't have Emboss installed you can get CpG island coordinates from the UCSC table browser just be wary of chromosome naming ie, "chr1" vs "1")

cpgreport -score 17 -outfile genome.fa.cpg -outfeat /dev/stdout genome.fa | head

Will produce a file like this one:

##gff-version 3
##sequence-region 1 1 59999934
#!Date 2015-04-10
#!Type DNA
#!Source-version EMBOSS
1 cpgreport sequence_feature 10469 11240 1299 + . ID=1.1
1 cpgreport sequence_feature 11284 11301 37 + . ID=1.2
1 cpgreport sequence_feature 11343 11347 32 + . ID=1.3
1 cpgreport sequence_feature 11404 11405 17 + . ID=1.4
1 cpgreport sequence_feature 11434 11435 17 + . ID=1.5

Now we need to make it bed file by just cutting the chromosome, start and end position.

cpgreport -score 17 -outfile genome.fa.cpg -outfeat /dev/stdout test.fa | grep -v '#' | cut -f1,4,5 > genome_cpg_islands.bed

Next step is to use Bedtools slop to extend the coordinates of the CpG islands by 2000 in each direction.

The catch is that Bedtools requires a file which contains the lengths of each chromosome so that it doesn't create coordinates outside of the possible range. To do that is relatively easy:

samtools faidx genome.fa

Will generate a file containing this data with a ".fai" suffix. The first 2 columns of the .fai file contain the chromosome name and length. Send that text to a new file.

cut -f-2 genome.fa.fai > genome.fa.g

Now we can generate the shores:

slopBed -i cpg_islands.bed -g genome.fa.g -b 2000 | mergeBed \
| subtractBed -a - -b cpg_islands.bed > cpg_shores.bed

This should work if you have a high memory system like a server but you might struggle on your workstation, so here's the same commands but this time processing chromosomes one-by-one.


#bash script to generate BED file of CpG islands and shores across whole genome
#depends on samtools and bedtools.

CGI=`echo $FA | sed 's/.fa/_cgi.bed/'`
CGS=`echo $FA | sed 's/.fa/_cgs.bed/'`

samtools faidx $FA
echo indexed
for CHR in `grep '>' $FA | tr '\t' ' ' | cut -d ' ' -f1 | tr -d '>'`

samtools faidx $FA $CHR \
| cpgreport -score 17 -outfile $RPT -outfeat /dev/stdout /dev/stdin \
| grep -v '#' | cut -f1,4,5
done > $CGI

cut -f-2 ${FA}.fai > ${FA}.g

slopBed -i $CGI -g ${FA}.g -b 2000 | mergeBed \
| subtractBed -a - -b $CGI > $CGS

Its east to run as well.

./ Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa

Simultaneous CpG island and shore detection for the human genome took 5 mins.

real 4m59.186s
user 4m29.325s
sys 0m13.124s