Ingenuity, AmiGO and GeneGO. The selection of significance values is largely arbitrary, but it is common to set the threshold at FDR adjusted p<0.05. Modifying the threshold will alter the enrichment results considerably. The statistical tests employed include hypergeometric, Fisher's exact and Chi-squared.
In contrast to the above methods that require lists of genes that pass arbitrary thresholds, GSEA is a tool that uses every datapoint in its statistical algorithm. In the "classical" method genes are ranked by from most up-regulated to most down-regulated. The rank metric itself varies, but two valid methods are to use signed p-value, or lower 90% confidence interval of the fold change. The basis of the test is to assess whether members of a gene set appear enriched at one end of the profile. To test the enrichment, GSEA performs permutations of the profile, calculating the enrichment of the gene set a thousand or more times to estimate p-values empirically. Other tools that use the entire profile perform analytical tests like the Wilcoxon and Kolmogorov-Smirnov test.
So now that we understand the main differences in the algorithms, we can start to compare the results from these two techniques. I revisited a dataset I've discussed before, azacitidine treatment in AML3 cells profiled by RNA-seq. I will use the differential RNA-seq expression data generated by DESeq as a start point to compare DAVID and GSEA.
For DAVID, I made a background file consisting of all 19,030 genes detected in the experiment above the detection threshold. The significant (FDR<0.05) gene lists in the up (2502 genes) and down (2907 genes) direction were separately submitted to DAVID 6.7 for analysis, focusing on KEGG pathways only.
For GSEA, I used the same rank metric procedure described previously. I used the classical GSEA algorithm with the gsea2-2.2.2.jar executable, using KEGG gene sets from MSigDB v5.1.
In the graph below, I determined the overlap of statistically significant gene sets (FDR<0.05) separately for up and down regulated pathways. I also included an intersection of DAVID results when no background is provided (for curiosity only, not a recommended procedure).
|DAVID vs GSEA using the same gene expression profile. Numbers represent gene sets that are significantly enriched (FDR<0.05).|
In addition to the more sensitive algorithm, DAVID's gene sets and gene annotations are really out of date. The last update of DAVID's many data sources being in mid to late 2009. As such, many gene accessions from Ensembl may not be supported. Indeed, from my background file that contained 19,030 only 13,488 were recognized by DAVID. The lack of updated gene sets is also a concern, as is the ever growing number of citations that DAVID still receives.
|DAVID still receives thousands of citations per year despite underlying data being woefully out of date.|
JA Timmons, KJ Szkop, IJ Gallagher. Multiple sources of bias confound functional enrichment analysis of global-omics data. Genome biology, 2015. 16:186
Lina Wadi, Mona Meyer, Joel Weiser, Lincoln D Stein, Juri Reimand. Impact of knowledge accumulation on pathway enrichment analysis. bioRxiv. doi: http://dx.doi.org/10.1101/049288
EDIT: In May 2016, DAVID was updated!