Friday, 15 May 2015

Run Linux from USB

Linux is definitely the favorite OS for bioinformatics, but if you ask most university or research institute IT departments they will likely be MS Windows-centric. Even 53% of visitors to this blog run Windows. Many IT departments that I've interacted with lock down their PCs so no software can be installed, leaving employees and students unable to run software to get their work done.

One option is to run virtualisation software such as VirtualBox or VMware to run Linux inside Windows, but that comes with reduced performance. Another, better option is to run Linux from a USB flash drive. Just as virtually all Linux distros can be booted off CD/DVD, they can also be booted off USB. The benefits are that you can run a "pure" Linux OS without modifying the existing host Windows OS. You'll also be able to take it and all the installed software wherever you go, and run it off any machine. Some Linux distros are specifically designed for running off USB (or SD) flash drives, by having a small footprint. Here are some good options (not exhaustive):
  • Porteus - This is my fave right now. Its very customisable. Has a small footprint (<300MB) and once loaded into RAM is very very fast. Unfortunately the package manager does not have a wide range (installing R from source is a real pain..)
  • Puppy Linux One of the most popular OSs for rejuvenating an old laptop or PC.
  • Crunchbang Had a cult following and is known for being fast, functional and beautiful at the same time. With Debian package manager, its a good choice. Under new management.
  • Tails - More for network and data security rather than bioinformatics use, by default it uses Tor network for anonymity and encryption of data on the flash drive. Again built on Debian, so will have the benefits of the apt-get package installer.
  • Other good choices include Arch, Ubuntu, Lite, among others.
Here is the general installation guide (may differ for each distro):
  • Download the iso from the distro homepage or via BitTorrent
  • Burn the disk image iso to a blank CD/DVD
  • Restart the PC so it will boot from the CD/DVD
  • Format the USB stick to ext4 if possible (FAT is also OK)
  • Go through the process of installing the OS selecting the USB drive as the destination. 
  • Enable USB boot on the PC by going into the BIOS menu by pressing F1/F11/F12 when you boot up and move USB drive to the top of the boot order.
  • Remove the CD/DVD and restart the computer. It will hopefully boot the OS from the USB.
So now there's really no excuse to give Linux a try :)

Download SRA data with Aspera command line utility

Aspera connect is NCBI's recommended data transfer client for large datasets >1GB. It uses the FASP protocol, here's a description from the NCBI guide.
"The FASP protocol from Aspera (<<UrlBlockedError.aspx>>) uses UDP, eliminating the latency issues seen with TCP, and provides bandwidth up to 1 gigabit per second (Gbps) to transfer data. It has a restart capability if data transfer is interrupted midstream and is well behaved, so if there is other data traffic on your network connections, it will back off in order to avoid starving other protocols. We have seen effective throughput up to 800 megabits per second (Mbps) to a single site.
The fasp protocol uses UDP port 33001-33009 for data transfer and you may need to contact your IT security staff if this port is not open to NCBI through your institutional firewalls.
NCBI is implementing Aspera for two use cases, occasional users who download files for direct use (Aspera Connect), and bulk users who will be downloading large amounts of data (ascp)."
NCBI recommends getting the Aspera transfer plugin for your browser from the website, but this is little help to power users who have large command line/programmatic job queues. You can download the stand-alone Unix utility using the following command (h/t to Matt Shirley @ BioStars).

sh <(curl -s
Or if that fails

chmod +x

Here is the usage message:

Usage: ascp [OPTION] SRC... DEST
          SRC to DEST, or multiple SRC to DEST dir
          SRC, DEST format: [[user@]host:]PATH

Then you can download SRA files with ease using a command such as this

ascp -l 1000m -O 33001 -T -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh /download/SRR901180.sra

-l MAX-RATE                     Max transfer rate
-O FASP-PORT                    UDP port used for FASP transport
-T                              Disable encryption
-i PRIVATE-KEY-FILE             Private-key file name (id_rsa)

Keep in mind that the Private-key file name may be different on your system.

On my fibre connection, I reached 370 MB/s, which is 20 fold better than my previous implementation of ftp using axel.

Thursday, 7 May 2015

Pathway analysis with ZGST

Pathway analysis is a common procedure for determining the regulation of groups of functionally linked genes. There are a lot of pathway analyses strategies available and I can break them down into these groups:

  1. Bioconductor/R-based: It makes sense to run pathway analysis in the same environment that runs the major differential expression software Limma, edgeR, DESeq, etc. These include CAMERA, MRGST, WilcoxGST, Roast, etc
  2. Commercial, GUI based. Such as Ingenuity IPA or MetaCore GeneGO.
  3. Java based such as GSEA and GSAA
  4. Web-based tools such as DAVIDWebGestalt, GO Enrichment analysis
Now these each have their advantages and disadvantages. I wanted to see whether I could make a tool pathway analysis tool that could be run with just one simple command and didn't require expertise in R. It would run quickly for >10k gene sets and have a lower memory footprint than GSEA.

I wrote a pathway analysis in Bash. I know. Its crazy. But it works. It works in Ubuntu, Fedora, Debian and Mint. Its called ZGST, for Ziemann's Gene Set Test. It is available on Sourceforge now.

Here's how it works:
  1. User supplies a gene set matrix in GMT format along with a differential expression spreadsheet specifying p-value and fold change.
  2. ZGST ranks scores each gene based in the signed log p-value. Most up-regulated genes by p-value at the top and most down-regulated at the bottom.
  3. Program then calculates the sum of the ranks for each gene set if the gene set is up-regulated, the value is positive, otherwise the value is negative.
  4. Next the same number of genes are selected at random from the list and the rank sum is compared to that of the gene set score. This process is repeated 1000+ times and that is used to estimate a p-value that the gene set is deregulated.
  5. False discovery rate p-value correction.
  6. Generate tables of deregulated genes, enrichment plot and volcano plot for a select number of gene sets.
Here are some features:
  1. Performs rank based (non-parametric) by default but can perform weighted analysis as an option
  2. Can perform a bootstrap sensitivity analysis by randomly discarding 10% of genes during each permutation.
  3. Runs in parallel; the config file specifies the number of threads to use.
  4. Very low memory footprint.
There are some limitations:
  1. The differential gene list needs to contain gene identifiers that match the GMT file.
  2. Currently limited to Linux, but early testing suggests it could be easily ported to MacOSX. Windows fans will need VirtualBox to run one of the above Linux distros.
  3. Dependencies: GNU parallel, gnuplot, R is only required for the weighted analysis.

Test run

I selected GSE20602, an experiment comparing 4 control kidney samples with 14 kidney samples from patients with nephrosclerosis. Then I downloaded MSigDB v5. Then ZGST can be run quite simply:

 ./ <-c configfile.config>
 ./ <-d DGE.xls,GeneIDcol,FCcol,Pvalcol> <-g gmtfile.gmt>
 ./ -h

So here's an example of my GEO2R generated analysis.

./ -d GSE20602.txt,7,6,3 -g msigdb.v5.0.symbols.gmt

Alternatively, a config file can be used to specify more options:

./ -c config1.cfg

Where an example config file looks like:



 Total gene sets: 10153
 Gene sets analysed: 9535
 Up-regulated gene sets nom 95 conf: 1467
 Down-regulated gene sets nom 95 conf: 2142
 Up-regulated gene sets FDR 95 conf: 1100
 Down-regulated gene sets FDR 95 conf: 1684

So here's the top 10 most deregulated pathways in the nephrosclerosis dataset:

Here's what the enrichment plots look like.
Example enrichment plot produced by ZGST
And it also outputs volcano plot too, with the members of this gene set highlighted in blue.

Example volcano plot produced by ZGST.


On my 8-core workstation, ZGST ran in 3m10s seconds, compared to 16m49s for GSEA making ZGST about 5x faster. GSEA also used 1.8GB RAM compared to negligible RAM for ZGST.

Then I compared the gene set results by plotting the GSEA (classic) enrichment score against the ZGST rank sum score. Expectedly, the results were very similar, with some interesting differences about the middle of the plot. The shape of the plot could be due to how GSEA calculates the ES and how the ES is never very small (<0.01).

Lastly, I looked at the overlaps of the statistically significant (FDR<0.05) gene sets, which confirms the large overlap in the gene set identification. I did notice 33 gene sets that were discordant between the two methods.
I'll discuss these differences in more detail in another post.

In conclusion, it is possible to do valid pathway analysis using a shell script program 5x faster than the current standard (GSEA). The approach used here could easily be applied in another language (ie. Perl/Python) to improve portability and speed even further.