Count ChIP-seq reads across a promoter with featureCounts

March 16, 2015

When analyzing ChIP-seq data, we sometimes like to focus on promoters as a proxy of gene activation especially for histone marks that are normally associated with gene activation such as H3K9 acetylation or H3K4 tri-methylation.

FeatureCounts has emerged as a competitor to HTSeq and BedTools MultiCov for counting reads across features (ie, exons, genes, promoters). FeatureCounts is great for RNA-seq because it can natively read GTF annotation files, but can't read BED format (that we use a lot in ChIP-seq analysis).

In order to make featureCounts work, we need to extract the TSS coordinates and convert to a BED-like format that it can read (SAF format). In the below script, I extract the positions of the TSSs using a grep search followed by stripping only the neccessary information from the GTF, then using awk and BedTools to expand the region around the TSS by 3kbp.

#!/bin/bash
#Generate an SAF file for TSS intervals from a GTF

#Specify some parameters
GTF=Homo_sapiens.GRCh38.78.gtf
FA=Homo_sapiens.GRCh38.dna.primary_assembly.fa
CHR=Homo_sapiens.GRCh38.dna.primary_assembly.fa.g
TSS=3000
FWD=tmp+
REV=tmp-
SAF=Homo_sapiens.GRCh38.78_3kbTSS.saf
COMPL=compl
SAFCOMPL=Homo_sapiens.GRCh38.78_3kbTSS+compl.saf

#Extract TSS coordinates for fwd strand genes
grep -w transcript $GTF | awk '$7=="+"' \
| cut -d '"' -f-2,10 | cut -f1,4,5,9 | sed 's/gene_id "//' \
| tr '"' '_' | awk '{OFS="\t"} {print $1,$2,$2+1,$4}' \
| bedtools slop -i - -g $CHR -b $TSS \
| awk '{OFS="\t"} {print $4,$1,$2,$3,"+"}' > $FWD

#Extract TSS coordinates for reverse strand genes
grep -w transcript $GTF | awk '$7!="+"' \
| cut -d '"' -f-2,10 | cut -f1,4,5,9 | sed 's/gene_id "//' \
| tr '"' '_' | awk '{OFS="\t"} {print $1,$3-1,$3,$4}' \
| bedtools slop -i - -g $CHR -b $TSS \
| awk '{OFS="\t"} {print $4,$1,$2,$3,"-"}' > $REV

cat $FWD $REV > $SAF
rm $FWD $REV

#Account for the non-promoter regions
cut -f2-4 $SAF | sortBed | mergeBed -i - \
| complementBed -i - -g $CHR \
| awk '{OFS="\t"} {print "NA_"$1":"$2"-"$3,$0,"+"}' \
| cat $SAF - > $SAFCOMPL

The last part of the script generates SAF coordinates for complement regions, ie., the parts of the genome that are not located near TSSs. This is useful because sometimes the promoters only contain a small fraction of the original data. For my H3K9ac ChIP-seq, only 10-20% of reads are placed near promoters. When including complement regions, I'm able to include over 80% of reads.

Here's how I use featureCounts to count reads in meta-features (if a gene has many TSSs, then the counts are aggregated). You'll notice that I run featureCounts twice once without the complement regions and once with the complement regions.

#!/bin/bash

SAF=Homo_sapiens.GRCh38.78_3kbTSS.saf

OUT=Exercise_chip_3kbpTSS.txt

featureCounts -Q 20 -T 20 -F SAF -a $SAF -o $OUT *bam

SAFC=Homo_sapiens.GRCh38.78_3kbTSS+compl.saf

OUTC=Exercise_chip_3kbpTSS+compl.txt

featureCounts -Q 20 -T 20 -F SAF -a $SAFC -o $OUTC *bam

Here's the result of the promoter only counts:

Status Sequence1.bam

Assigned 6610765

Unassigned_Ambiguity 1162622

Unassigned_MultiMapping 0

Unassigned_NoFeatures 40595847

Unassigned_Unmapped 1783674

Unassigned_MappingQuality 9157142

Unassigned_FragementLength 0

Unassigned_Chimera 0

Now the result of the promoter & complement counts:

Status Sequence1.bam

Assigned 47112044

Unassigned_Ambiguity 1257190

Unassigned_MultiMapping 0

Unassigned_NoFeatures 0

Unassigned_Unmapped 1783674

Unassigned_MappingQuality 9157142

Unassigned_FragementLength 0

Unassigned_Chimera 0

FeatureCounts can also be run at a feature-level to output counts for each TSS of each gene by adding the "-f" switch.

Now the tabulated reads are ready for differential enrichment analysis with edgeR or similar.

Search This Blog

Genome Spot

Count ChIP-seq reads across a promoter with featureCounts

Popular posts from this blog

Mass download from google drive using R

Data analysis step 8: Pathway analysis with GSEA

Generating a custom gmt file for gene set analysis