Sunday, 25 August 2013

A selection of useful bash one-liners

Here's  a selection of one liners that you might find useful in data analysis. I'll update this post regularly and hope you can share some of your own gems.

Compress an archive:
tar cf backup.tar.bz2 --use-compress-prog=pbzip2  directory
Extract a tar archive:
tar xvf backup.tar 
 Head a compressed file:
bzcat file.bz2 | head 
pbzip2 -dc file.bz2 | head
zcat file.gz | head 
Uniq on one column (field 2):
awk '!arr[$2]++' file.txt
Explode a file on the first field:
awk '{print > $1}' file1.txt
Sum a column:
awk '{ sum+=$1} END {print sum}' file.txt
Put spaces between every three characters
sed 's/\(.\{3\}\)/\1 /g' file.txt
Join 2 files based on field 1. Both files need to be properly sorted (use sort -k 1b,1)
 join -1 1 -2 1 file1.txt file2.txt
Join a bunch of files by field 1. Individual files don't need to be sorted but the final output might need to be sorted:
awk '$0 !~ /#/{arr[$1]=arr[$1] " " $2}END{for(i in arr)print i,arr[i]}' file1..txt file2.txt ... fileN.txt
Find number of lines shared by 2 files:
sort file1 file2 | uniq -d
Alternative method to find the common lines (files need to be pre-sorted):
comm -12 file1 file2
Add a header to a file
sed -e '1i\HeaderGoesHere' originalFile
Extract every 4th line starting at the second line (extract the sequence from fastq)
sed -n '2~4p' file.txt
Find the most common strings in column 2:
cut -f2 file.txt | sort | uniq -c | sort -k1nr | head
Randomise lines in a file
shuf file.txt
Find a bunch of strings in file1 in file2
grep -Ff file1 file2
Print lines which contain string1 or string2
egrep '(string1|string2|stringN)' file.txt
Count the number of "X" characters per line:
n=0; while read line; do echo -n "$((n=$((n + 1)))) "; echo "$line" | tr -cd  "X" | wc -c; done < file.txt
Count the length of strings in field 2:
awk '{print length($2)}' file1
Print all possible 3mer DNA sequence combinations
echo {A,C,T,G}{A,C,T,G}{A,C,T,G}
Filter reads with SamTools
samtools view -f 4 file.bam > unmapped.sam
 samtools view -F 4 file.bam > mapped.sam 
samtools view -f 2 file.bam > mappedPairs.sam
Compress a bunch of folders full of data
for DIR in `ls -d */ | sed 's#/##' ` ; do ZIP=$ ; zip -r $ZIP $DIR/ ; done
Get year-month-day hour:minute:second from Unix "date"
DATE=`date +%Y-%m-%d:%H:%M:%S`
GNU Parallel has many interesting uses.

Confirm md5sum of fastq files
ls *fastq.gz | parallel  md5sum {} > checksums.txt
Index bam files
parallel samtools index {} ::: *bam
Fix Bedtools "Error: malformed BED entry at line 1. Start was greater than end. Exiting." by reorganising bed coordinates
awk '{OFS="\t"} {if ($3<$2) print $1,$3,$2 ; else print $0}' file.bed > file_fixed.bed 
Fix a MACS peak BED file that contains negative coordinates
awk '{if ($2<1) print $1,1,$3 ; else print $0 }' macs_peaks.bed > macs_peaks_fixed.bed
You want to recursively remove spaces from filenames in many sub directories:
find -name "* *" -type d | rename 's/ /_/g'
Generate md5 checksums for directory of files in parallel

parallel md5sum ::: * > checksums.md5
Aggregate: Sum column 2 values based on colum 1 string

awk  '{array[$1]+=$2} END { for (i in array) {print i, array[i]}}' file.tsv

Validate files by comparing checksums in parallel
cat checksums.md5 | parallel --pipe -N1 md5sum -c
Further reading:'s+list+of+linux+one-liners

Thursday, 22 August 2013

Quick alignment of microRNA-Seq data to a reference

In my previous post, I discussed an efficient approach to align deep sequencing reads to a genome, which could be used for most applications (mRNA-Seq, ChIP-Seq, exome-Seq, etc).

MicroRNA datasets are rather a bit different because they often contain millions of reads from only a few thousand RNA species, meaning there is a lot of redundancy. We can speed up this analysis considerably if we collapse the sequences and remove the redundancy before genome alignment. You can see in the below script that several steps are done without writing a single intermediate file:

  • Adapter clipping
  • Sequence collapsing
  • BWA alignment
  • Bam conversion
  • Sorting bam file

for FQ in *fq
fastx_clipper -a TGGAATTCTCGGGTGCCAAGG -l 18 -i $FQ \
| fastx_collapser \
| bwa aln -t 30 $REF - \
| bwa samse $REF - ${FQ} \
| samtools view -uSh - \
| samtools sort - ${FQ}.sort &
for bam in *bam ; do samtools index $bam & done ; wait
for bam in *bam ; do samtools flagstat $bam > ${bam}.stats & done ; wait

Using this method, the fastq is collapsed, meaning the sequences are now in a format which looks like this:


Where the sequence name now consists of a single number (unique identifier) as well as the number of times that exact sequence string was identified. Using this technique, I was able to align 12 data sets totaling 32 million reads in about 3 minutes. This is how the bam file looks:

36725-1 0 chr2 59549323 0 33M * 0 0 CCATACCACCCTGAACGCGCCCGATCTCGTCTG * XT:A:R NM:i:0 X0:i:34 XM:i:0 XO:i:0 XG:i:0 MD:Z:33
64660-1 0 chr2 59549328 0 28M * 0 0 CCACCCTGAACGCGCCCGATCTCGTCTG * XT:A:R NM:i:0 X0:i:34 XM:i:0 XO:i:0 XG:i:0 MD:Z:28
6064-8 0 chr2 59549329 0 24M * 0 0 CACCCTGAACGCGCCCGATCTCGT * XT:A:R NM:i:0 X0:i:36 XM:i:0 XO:i:0 XG:i:0 MD:Z:24
11925-3 0 chr2 59549331 0 28M * 0 0 CCCTGAACGCGCCCGATCTCGTCTGATC * XT:A:R NM:i:0 X0:i:31 XM:i:0 XO:i:0 XG:i:0 MD:Z:28
23045-1 16 chr2 59549331 0 24M * 0 0 CCCTGAACGCGCCCGATCTCGTCT * XT:A:R NM:i:0 X0:i:36 XM:i:0 XO:i:0 XG:i:0 MD:Z:24

I will cover the differential analysis aspect in another post. If you have tips for microRNA-Seq analysis I'd be glad to hear.