Pages

Monday, 19 November 2012

What? You're not using parallel compression yet?

Just in case you guys are struggling with (de)compression of collossal data sets, here's something which you'll find useful, a parallel zip archiver called PBZIP2. OK so it's not going to improve the quality of your sequencing data, but it could save you a bit of time.

Say I have a fastq file (FC001_sequence.fq) which needs compression on 8 threads:
pbzip2  -p8 FC001_sequence.fq
To decompress (-d) a file (FC001_sequence.fq.bz2) and keep (-k) the archived version on 10 threads:
pbzip2  -dk -p10 FC001_sequence.fq.bz2
To test the integrity of a compressed file:
pbzip2  -t FC001_sequence.fq.bz2

How to compress an entire directory:

tar cv directory | pbzip2 > directory.tar.bz2


Here is the help page with more examples:
Parallel BZIP2 v1.1.5 - by: Jeff Gilchrist [http://compression.ca]
[Jul. 16, 2011]               (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>
Usage: pbzip2 [-1 .. -9] [-b#cdfhklm#p#qrS#tVz] <filename> <filename2> <filenameN>
 -1 .. -9        set BWT block size to 100k .. 900k (default 900k)
 -b#             Block size in 100k steps (default 9 = 900k)
 -c,--stdout     Output to standard out (stdout)
 -d,--decompress Decompress file
 -f,--force      Overwrite existing output file
 -h,--help       Print this help message
 -k,--keep       Keep input file, don't delete
 -l,--loadavg    Load average determines max number processors to use
 -m#             Maximum memory usage in 1MB steps (default 100 = 100MB)
 -p#             Number of processors to use (default: autodetect [32])
 -q,--quiet      Quiet mode (default)
 -r,--read       Read entire input file into RAM and split between processors
 -t,--test       Test compressed file integrity
 -v,--verbose    Verbose mode
 -V,--version    Display version info for pbzip2 then exit
 -z,--compress   Compress file (default)
 --ignore-trailing-garbage=# Ignore trailing garbage flag (1 - ignored; 0 - forbidden)
Example: pbzip2 -b15vk myfile.tar
Example: pbzip2 -p4 -r -5 myfile.tar second*.txt
Example: tar cf myfile.tar.bz2 --use-compress-prog=pbzip2 dir_to_compress/
Example: pbzip2 -d -m500 myfile.tar.bz2
Example: pbzip2 -dc myfile.tar.bz2 | tar x