Showing posts from April, 2015

Functions and GNU parallel for effective cluster load management

I've been a fan of GNU parallel for a long time. Initially I was sceptical about using it, preferring to write huge for loops but over time I've grown to love it. The beauty of GNU parallel is that it spawns a specified number of jobs in parallel and then submits more jobs as others are completed. This means that you get maximum usage out of the CPUs without overloading the system. There are many excuses for not using it, but perhaps the only valid one is that you have Sun Grid Engine or another job scheduler or manager in place.

GNU parallel is particularly useful when used with functions. Functions are subroutines that may be repeated many times to complete a piece of work. In bash, here is a simple example, which declares a function consisting of a chain of piped commands, and then executes 4 jobs in parallel, until all of *files.txt have been processed.

my_func2() {
cmd1 $INPUT $VAR1 | cmd2 | cmd3 > ${1}.out
export -f my_func
parallel -j4…

What is a CpG shore and how to I get them all?

CpG shores are the regions immediately flanking and up to 2 kbp away from CpG islands. These regions are interesting because methylation they are variably methylated in cancer and development (see reference).
To identify CpG shores, you first need a list of coordinates of CpG islands. If you have a linux computer with Emboss tools installed, you can determine CpG island positions with the cpgplot command. (if you don't have Emboss installed you can get CpG island coordinates from the UCSC table browser just be wary of chromosome naming ie, "chr1" vs "1")

cpgreport -score 17 -outfile genome.fa.cpg -outfeat /dev/stdout genome.fa | head

Will produce a file like this one:

##gff-version 3
##sequence-region 1 1 59999934
#!Date 2015-04-10
#!Type DNA
#!Source-version EMBOSS