Saturday, March 5, 2011

The need for speed: Bowtie

Following a suggestion in Comments (here), I decided to take a look at Bowtie for doing alignments of short reads to a genome.

I downloaded the source and it was the easiest build ever. Just cd into the project directory and do make


> file ~/Software/bowtie-0.12.7/bowtie
/Users/telliott/Software/bowtie-0.12.7/bowtie: Mach-O 64-bit executable x86_64


Remember to make sym links. Then, before using it I need to build an index for the reference genome:


mkdir Hinf-ref
bowtie-build Hinf/seq/Hinf.fasta ~/Desktop/Hinf-ref/Hinf


Take it out for a spin:


bowtie Hinf-ref/Hinf -f test.txt

4023 - Hinf 114788 AGGAAAAAGAATGATTATTTAGCGAATTTCAGGTATGATTCTGCCCACTCTTA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 16:T>G,29:C>G
8411 - Hinf 443632 CCAGAAGCAAGGGTATGGAAAGTTTCTAATTGCAGACGCCATTCATAAGATTA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0
..
# reads processed: 10000
# reads with at least one reported alignment: 7892 (78.92%)
# reads that failed to align: 2108 (21.08%)
Reported 7892 alignments to 1 output stream(s)


Virtually instantaneous! The I's are made-up quality scores (since this is a FASTA file using option -f, and it doesn't have them).

Some other options:

--refout prints to a file
--quiet what it says
-m 1 if a read has more than 1 match, do not report it


--suppress (doesn't seem to work with --refout), but we can just redirect the output:


bowtie Hinf-ref/Hinf -f test.txt -m 1 --quiet --suppress 3,6,7 > x.txt

4023 - 114788 AGGAAAAAGAATGATTATTTAGCGAATTTCAGGTATGATTCTGCCCACTCTTA 16:T>G,29:C>G
8411 - 443632 CCAGAAGCAAGGGTATGGAAAGTTTCTAATTGCAGACGCCATTCATAAGATTA
10519 - 1563910 TAGTAATCACTTCTGAAAATTCGTGAGAAAGTGTTGCCACGGAGTAGAGCTTA 26:A>G
10532 - 705902 TGCTTGTTGTAACCACTAGCACATCGAACTTTTTATAACGTCCTGCAAGAGTA
13283 - 89256 AATATTGGTTAGTGCTTCTTCGGTTAAAAACGCTTGGTGACCTGTTAGCAATA
..

blat Hinf/seq/Hinf.fasta test.txt blat.results.txt -out=blast8

4023 Hinf 96.23 53 2 0 1 53 114841 114789 2.6e-19 93.0
8411 Hinf 100.00 53 0 0 1 53 443685 443633 3.5e-22 103.0
10519 Hinf 98.11 53 1 0 1 53 1563963 1563911 2.3e-21 100.0
10532 Hinf 100.00 53 0 0 1 53 705955 705903 3.5e-22 103.0
13283 Hinf 100.00 53 0 0 1 53 89309 89257 4.0e-22 103.0
..



Let's time it on a bigger input file:


> python -c "import time;  print time.time()"
1299330135.3
> bowtie Hinf-ref/Hinf -f Hinf/orig/SD1.txt --quiet --refout
> python -c "import time; print time.time()"
1299330160.53


Compare with BLAT:


> python -c "import time;  print time.time()"
1299330054.12
> blat Hinf/seq/Hinf.fasta Hinf/orig/SD1.txt blat.results.txt -out=blast8
Loaded 1830138 letters in 1 sequences
Searched 37561637 bases in 708731 sequences
> python -c "import time; print time.time()"
1299330107.54


13 seconds v. 25 seconds (roughly). However, Bowtie is doing a lot more I/O.


> bowtie Hinf-ref/Hinf -f Hinf/orig/SD1.txt -m 1 --refout
Warning: Skipping read (6532258) because it is less than 4 characters long
# reads processed: 708731
# reads with at least one reported alignment: 549761 (77.57%)
# reads that failed to align: 144851 (20.44%)
# reads with alignments suppressed due to -m: 14119 (1.99%)
Reported 549761 alignments to 1 output stream(s)

> ls -al blat.results.txt ref00000.map
-rw-r--r-- 1 telliott_admin staff 556006 Mar 5 08:13 blat.results.txt
-rw-r--r--@ 1 telliott_admin staff 74392716 Mar 5 08:44 ref00000.map


BLAT has more hits.


> python
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from utils import load_data
>>> data = load_data('blat.results.txt')
>>> data.count('\n')
676301
>>> data = load_data('ref00000.map')
>>> data.count('\n')
549760

That would certainly bear investigation. But Bowtie definitely looks like a winner.