UNIX utility: split

December 06, 2012

Even the most simple types of analysis (ie: count lines) can take a really long time with the colossal data sets coming from sequencing experiments these days. Parallelization of jobs over multiple processors/cores/threads can make things faster, but the standard unix based tools we commonly use don't generally have the in-built capacity to run jobs over many processors. Every so often, I'll present some tools which can help in making NGS analysis jobs parallel, and hopefully give you some ideas for speeding up your jobs.

One of the most basic parallelisation tricks is to split a file into smaller fragments, process those smaller files in parallel and then concatenate/merge the final product. To split a file, you can use a tool such as sed to selectively output lines, another option is to use the split command which will divide the file by the specified number of lines.

Here's an example of splitting a fastq file (called sequence.fq) into 1 million line fragments:

split -l 1000000 sequence.fq sequence.fq.frag.

The output files will look like this:

sequence.fq.frag.aa
sequence.fq.frag.ab
sequence.fq.frag.ac
sequence.fq.frag.ad
sequence.fq.frag.ae
sequence.fq.frag.af
sequence.fq.frag.ag
sequence.fq.frag.ah
sequence.fq.frag.ai
sequence.fq.frag.aj

Be careful that fastq sequence files have an unusual structure of 4 lines per sequence instead of 1 line per sequence format such as sam/bam format and as such, you want to make sure the number of lines per file is a multiple of 4. Now you're free to run batches of smaller jobs over many processors.

Search This Blog

Genome Spot

UNIX utility: split

Popular posts from this blog

Mass download from google drive using R

Data analysis step 8: Pathway analysis with GSEA

Extract properly paired reads from a BAM file using SamTools