Monday, 19 November 2012

What? You're not using parallel compression yet?

Just in case you guys are struggling with (de)compression of collossal data sets, here's something which you'll find useful, a parallel zip archiver called PBZIP2. OK so it's not going to improve the quality of your sequencing data, but it could save you a bit of time.

Say I have a fastq file (FC001_sequence.fq) which needs compression on 8 threads:
pbzip2  -p8 FC001_sequence.fq
To decompress (-d) a file (FC001_sequence.fq.bz2) and keep (-k) the archived version on 10 threads:
pbzip2  -dk -p10 FC001_sequence.fq.bz2
To test the integrity of a compressed file:
pbzip2  -t FC001_sequence.fq.bz2

How to compress an entire directory:

tar cv directory | pbzip2 > directory.tar.bz2

Here is the help page with more examples:
Parallel BZIP2 v1.1.5 - by: Jeff Gilchrist []
[Jul. 16, 2011]               (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <>
Usage: pbzip2 [-1 .. -9] [-b#cdfhklm#p#qrS#tVz] <filename> <filename2> <filenameN>
 -1 .. -9        set BWT block size to 100k .. 900k (default 900k)
 -b#             Block size in 100k steps (default 9 = 900k)
 -c,--stdout     Output to standard out (stdout)
 -d,--decompress Decompress file
 -f,--force      Overwrite existing output file
 -h,--help       Print this help message
 -k,--keep       Keep input file, don't delete
 -l,--loadavg    Load average determines max number processors to use
 -m#             Maximum memory usage in 1MB steps (default 100 = 100MB)
 -p#             Number of processors to use (default: autodetect [32])
 -q,--quiet      Quiet mode (default)
 -r,--read       Read entire input file into RAM and split between processors
 -t,--test       Test compressed file integrity
 -v,--verbose    Verbose mode
 -V,--version    Display version info for pbzip2 then exit
 -z,--compress   Compress file (default)
 --ignore-trailing-garbage=# Ignore trailing garbage flag (1 - ignored; 0 - forbidden)
Example: pbzip2 -b15vk myfile.tar
Example: pbzip2 -p4 -r -5 myfile.tar second*.txt
Example: tar cf myfile.tar.bz2 --use-compress-prog=pbzip2 dir_to_compress/
Example: pbzip2 -d -m500 myfile.tar.bz2
Example: pbzip2 -dc myfile.tar.bz2 | tar x