Wednesday, 19 June 2013

Generating a custom gmt file for gene set analysis

Pathway and gene set analysis is a common procedure for interpretation of RNA-seq or other genome-wide expression assays. Most of the time, we use GSEA to tell us whether our gene sets of interest are up- or down-regulated. We can use gene sets from KEGG, Reactome, GOMSigDB and other sources, but you can also generate your own gene sets. The format used for GSEA is gmt. I'm going to take you through two examples of generating custom gene sets:

Generate gene sets from published data sets using GEO2R

Let's say you're interested in the transcription factor STAT1. I found a dataset in GEO called "Knockdown of STAT1 in SCC61 tumor xenografts leads to alterations in the expression of energy metabolic pathways", which has a paper in BMC Med. Most uploaded array data sets can be reanalysed with GEO2R, which runs the array analysis tool Limma but this is embedded in the webpage and has a GUI which makes it very accessible for biologists.

Click this link to go directly to the GEO2R analysis page for this data set. You will see that there are four sample groups STAT1 knock-down irradiated, STAT1 knock-down untreated, Control vector irradiated and Control vector untreated. For this example, select the Control vector untreated samples, click define groups, enter in "Control" and then select the STAT1 knock-down untreated, click define groups, enter in "STAT1KD" as I have done in the below image.

Then click on the "Top250" button and this will kick off the comparison of the samples and show the top 250 differential array probes. You can then click the "save all results" button and save to a file.  You might notice that some probes have multiple associated genes, in which case we can include both/all the gene names. To get to a gmt format, we need the gene list to be organised horizontally with the two left-most columns set aside for metadata. Normally the first column is the gene set name and the second is a link to a paper or GEO accession. In our case to make it streamlined, I've included some metadata (ie, the GEO accession) in the gene set name, while leaving only "NA" in the second column.

In this example, because the adjusted p-values aren't that great, I've selected a non-adjusted p-value threshold of 0.001 which gives us a few hundred genes in either direction if you run the script below. You may also want to filter on fold change if that is part of your standard pipeline. If you don't like using command line scripts, then it is still possible to make a gmt file manually using spreadsheet filtering and transposing.

for DGE in STAT1-Knockdown_GSE15845.xls
tr -d '"' < $DGE | awk '$3<=0.001' | awk '$6>0'| cut -f7 \
| tr -s '/' | tr '/' '\n' | sort -u | tr '\n' '\t' \
| sed "s/^/${DGE}_UP\tNA\t/" \
| sed 's/$/\n/' > ${DGE}.gmt
tr -d '"' < $DGE | awk '$3<=0.001' | awk '$6<0'| cut -f7 \
| tr -s '/' | tr '/' '\n' | sort -u | tr '\n' '\t' \
| sed "s/^/${DGE}_DN\tNA\t/" \
| sed 's/$/\n/' >> ${DGE}.gmt

Generate gene sets of human microRNA-mRNA targets from StarBase

StarBase is a repository for CLIP-Seq and degradome seq and provides a database for potential microRNA-mRNA target information and is a useful resource for predicting the function of microRNAs. If you go to the StarBase website, go to the miRNA-target relationship page. Leave all the pull-down menus blank and hit the "Search" button and it will bring up the complete StarBase microRNA-mRNA target set. Hit the "export as excel" button and it will be downloaded.

You can run the below script and it will generate a gmt containing sets of genes which are targeted by specific microRNAs. You will need to check the name of the starbase csv file.

for MIR in `sed -n '1!p' $STARBASE | cut -d ',' -f1 | sort -u `
echo $MIRgrep -w $MIR $STARBASE | cut -d ',' -f2
done | tr '\n' '\t' | sed 's/hsa-miR/>hsa-miR/g' \
| tr '>' '\n' > Starbase_Human.gmt

The result from the first few lines and columns looks good and is ready for GSEA.

hsa-let-7a      SLC45A4 CNOT4   SMCR7L  APPBP2  CFL2    KLF9
hsa-miR-1       SMAD5   VGLL4   C9orf82 NCOA3   PPP2R5C RNF40   PHF20L1 PINX1   MINK1
hsa-miR-100     RAP1B   EPC2    ICMT    MBNL1   SMARCA5 C17orf85        FZD8    MBNL1   SLC35E1
hsa-miR-101     ZNF238  ARID5B  APPBP2  AGFG1   STMN1   G3BP2   QKI     G3BP2   SUB1
hsa-miR-103     FLCN    SMAD5   USP42   KPNA1   NDEL1   AGFG1   SNTB2   NPTN    IPO5
hsa-miR-105     APPBP2  TMEM30A CREB5   MAP4K3  CDKN1A  CLIP1   ATXN1   ZCCHC3  MORF4L2
hsa-miR-106a    ZBTB9   E2F1    CNOT4   SOX4    SMAD5   RAPGEFL1        OCRL    CFL2    CHD9
hsa-miR-106b    ZBTB9   E2F1    ZNF238  CNOT7   CNOT4   TBX3    SOX4    SMAD5   RAPGEFL1
hsa-miR-107     FLCN    SPRYD3  SMAD5   USP42   KPNA1   NDEL1   AGFG1   SNTB2   NPTN
hsa-miR-10a     ACVR2A  SOX4    APPBP2  LANCL1  HSPC159 PAPD5   RAP2A   LRP12   SLC38A2
hsa-miR-10b     ACVR2A  SOX4    APPBP2  LANCL1  AIM1L   HSPC159 PAPD5   RAP2A   LRP12
hsa-miR-1178    E2F1    G3BP2   CD164   PANK3   PPP3CB  SPTLC1  KCNG3   FAM98A  PGK1
hsa-miR-1179    PPTC7   CTNNB1  TMEM30A PAN3    PLEKHA1 TSPYL1  ZNF644  ARID2   CNIH
hsa-miR-1180    MKNK2   NPM1    CCDC6   MKNK2   VEGFB   HOXA10  SESN2   UTP15   KIAA1267
hsa-miR-1184    SMAD5   AGFG1   PTP4A2  SIX4    RAB1A   MAML1   PPP4R1  NR2C2   FMNL2
hsa-miR-1200    ACTL6A  FAM178A SORT1   ZCCHC3  PAPD5   C16orf63        MAP2K6  SNRPB   RASSF

These are just two obvious examples of data mining for pathway analysis, but there are many more, such as directly searching through journal articles. If you have come up with a cool way of mining gene sets, I would love to hear about it.

Friday, 14 June 2013

Generate an RNA-seq count matrix: Part 2 - count over exons

In the previous post, I mentioned that in order to do counting, we need some regions (exons) to count over. We generated a BED file based on GENCODE exons which looks like this:

chr9 33248310 33248419 ENSG00000107262.11_BAG1
chr9 33252477 33255306 ENSG00000107262.11_BAG1
chr9 33255862 33255925 ENSG00000107262.11_BAG1
Now we can actually count how many reads from each of our samples maps to these exons. To do that there are a few different programs:
For this demo, I will outline use of BedTools for RNA-Seq. I think BedTools nulticov is a good option for 3 main reasons:
  1. It can handle multiple bam files at the same time, producing a count table/matrix which is pretty much ready for statistical analysis.
  2. It is aware of mapQ scores of alignments, meaning that if the program comes across non-unique alignments, you can set it to ignore or include those reads.
  3. The BedTools package is well documented, widely used and the code under constant maintenance.
The first stage in the count is to count over each exon. While multicov allows for reading of multiple files at the same time, this is not done in parallel, so I generally prefer to make a count file for each bam individually and join them at the end. 

for bam in `ls *.bam | sed 's/.bam//' `
bedtools multicov -D -q 10 -bams ${bam}.bam -bed $BED \
| cut -f4,5 > ${bam}.tmpcnt &
So now we have a temporary exon-wise count file which looks like this:

ENSG00000001167.10_NFYA 19
ENSG00000001167.10_NFYA 48
ENSG00000001167.10_NFYA 19
ENSG00000001167.10_NFYA 31
So you can see that the exon coordinates have been removed and we're now looking at counts for each exon of the gene. The next step is to aggregate the data, for which we can use a little perl script:

use warnings;
use strict;
use integer;
my %a;
while (<>) {
   my ($gene, $n) = /^\s*(\S+)\s*\t\s*(\S+)/;
      $a{$gene} += $n if defined $n;
print "$_\t${a{$_}}\n" for sort keys %a;

Put the above script into a text file, make it executable with chmod and then it can be used as follows to generate a gene-wise count file

for bam in `ls *.bam | sed 's/.bam//'`dosort ${bam}.tmpcnt | perl > ${bam}_count.txt &donewait

Ok so now we are ready to join these files together to make the matrix. This can also be done a few ways, but I've chosen to use an awk one liner.

awk '$0 !~ /#/{arr[$1]=arr[$1] " " $2}END{for(i in arr)print i,arr[i]}' *count.txt \| tr ' '  '\t' | cut -f1,3- > matrix.tbl
You will then need to attach the sample names to the appropriate columns and now you have a full count matrix which is basically ready for statistical analysis:
Gene_Name                 Ctrl1 Ctrl2 Trt1 Trt2
ENSG00000266033.1_AL606519.1 0 0 0 0
ENSG00000165195.9_PIGA         39 12 43 33
ENSG00000250813.1_SERF1AP1 0 0 0 0
ENSG00000066813.10_ACSM2B 0 0 0 1
ENSG00000249719.1_RP3-486L4.3 12 48 16 8
ENSG00000266336.1_AL356019.1 0 0 0 0
ENSG00000228142.2_RP11-180I4.2 0 0 0 0
ENSG00000146966.8_DENND2A 105 97 77 99
ENSG00000254295.1_CTC-308K20.2 5 0 5 0

Although you may want to remove genes which do not have any reads, or remove genes which have fewer than a threshold number. We commonly remove genes which have fewer than 10 reads in each sample on average. This is straightforward for a simple 2 way comparison, but gets more complicated when dealing with more sample groups - for instance 2 treatment types in 2 cell lines.

Generate an RNA-seq count matrix: Part 1 - GTF2BED

Previously we took a look at how to align mRNA-Seq reads to a genome with BWA and Tophat. The next step in the pipeline is to count reads which align on exons and then to get them into a count matrix that we can submit for statistical analysis with EdgeR or another program.

At this point in the process, we need to decide which annotation set to use. For human, we commonly use RefSeq and we are beginning to use GENCODE annotation because of its very comprehensive nature. Most of the programs which count reads require a BED file to indicate which genomic regions should be counted.

For this blog post, I will show you how we generate an exon BED file for GENCODE genes. The script below will download the GTF file and generate a temp file list of all exons present in the GTF in the form of a BED file (chromosome, start, end) with the gene accession number and gene name pasted into one field.

chr12 38710380 38710866 ENSG00000175548.4_ALG10B
chr12 38710564 38710866 ENSG00000175548.4_ALG10B
chr12 38710569 38712260 ENSG00000175548.4_ALG10B
chr12 38710641 38710840 ENSG00000175548.4_ALG10B
chr12 38712063 38712260 ENSG00000175548.4_ALG10B
chr12 38713963 38714252 ENSG00000175548.4_ALG10B
chr12 38713963 38715129 ENSG00000175548.4_ALG10B
chr12 38713963 38716388 ENSG00000175548.4_ALG10B
So you might think that this output is fine to use, but there is one complication. The fact that in the above example and many others throughout the genome that there are overlapping exons. If we don't collapse them, then reads assigned to those overlapping regions could be counted multiple times and provide an inaccurate profile of expression.

The solution is to merge overlapping exons right? Well yes, but there is another major caveat in that exons from different genes can sometimes have overlapping exons and we can't collapse exons from different genes, otherwise we wouldn't be able to distinguish them. The solution is to merge exons which only belong to the same gene. If there is a read which aligns to a region with more than one annotation, then it is counted twice because we can't differentiate between the two genes at that position.

As you can see in the script below, the program cycles through all unique genes (over 55k) and then grabs exons matching to each and then does a merge of those BED regions with bedtools. It then adds the merged exons to a final BED output file. As there are over 1 million exons, this process is quite slow, but we only need to perform this task each time a new annotation set is made, so being inefficient is not a big deal. The alternative and faster option to this method is to explode the list of exons based on gene name and then perform a bedtools merge on each of the 55k files and collate the results. Either way at the end of the day day you will get the same result.

#input parameters
#Prepare annotation file
wget $GTFURL
gunzip -f ${GTF}.gz
grep exon $GTF |tr ';' '\t' |tr ' ' '\t' |cut -f1,4,5,10,22 \
| tr -d '"' |sed 's/\t/_/4' |sort -u  > $TMP
for GENE in `cut -f4 $TMP | sort -u`; do
grep $GENE $TMP |tr '\t' '%' |sort -u |tr '%' '\t' |sed 's/\t/_/4' \
| bedtools sort |bedtools merge -nms |cut -d ';' -f1 >> $BED

You will notice that the merged exons are annotated only with the gene name and not with the exon ID. The reason behind this is that our group is mostly interested in overall up- and down-regulation of genes and not necessarily in differential exon splicing, and the annotation we have used allows for easy aggregation of in the later stage of the pipeline, which I will describe shortly.