Fast Local Alignment Methods

Fast Local Alignment Methods Kun-Mao Chao (趙坤茂) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: kmchao@csie.ntu.edu.tw WWW: http://www.csie.ntu.edu.tw/~kmchao

FASTA • Find runs of identities, and identify regions with the highest density of identities. • Re-score using PAM matrix, and keep top scoring segments. • Eliminate segments that are unlikely to be part of the alignment. • Optimize the alignment in a band.

FASTA Step 1: Find runes of identities, and identify regions with the highest density of identities. Sequence B Sequence A

FASTA Step 2: Re-score using PAM matrix, andkeep top scoring segments.

FASTA Step 3: Eliminate segments that are unlikely to be part of the alignment.

FASTA Step 4: Optimize the alignment in a band.

BLAST • Basic Local Alignment Search Tool(by Altschul, Gish, Miller, Myers and Lipman) • The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.

The maximal segment pair measure • A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences.(for DNA: Identities: +5; Mismatches: -4) • The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming. • BLAST heuristically attempts to calculate the MSP score. the highest scoring pair

BLAST • Build the hash table for Sequence A. • Scan Sequence B for hits. • Extend hits.

BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For protein sequences: Seq. A = ELVISAdd xyz to the hash table if Score(xyz, ELV) ≧ T;Add xyz to the hash table if Score(xyz, LVI) ≧ T;Add xyz to the hash table if Score(xyz, VIS) ≧ T; For DNA sequences: Seq. A = AGATCGAT 12345678 AAAAAC..AGA 1..ATC 3..CGA 5..GAT 2 6..TCG 4..TTT

BLAST Step2: Scan sequence B for hits.

BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. BLAST 2.0 saves the time spent in extension, and considers gapped alignments. hit Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)

Remarks • Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments. • The idea of filtration was used in both FASTA and BLAST.

PatternHunter The following slides are based on Bin Ma’s illustrations.

A homology between mouse and human genomes GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC || ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA---- || ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG ------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG ||||||| |||| | | |||| ||||| || ||||| || |||||| ||||||||||||||| GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG Smith-Waterman is the most accurate method. Time complexity：O(mn).

BLAST finds a “hit” and then extends GCNTACACGTCACCATCTGTGCCACCACNCATGTCTCTAGTGATCCCTCATAAGTTCCAACAAAGTTTGC || ||||| | ||| |||| || |||||||||||||||||| | |||||||| | | ||||| GCCTACACACCGCCAGTTGTG-TTCCTGCTATGTCTCTAGTGATCCCTGAAAAGTTCCAGCGTATTTTGC GAGTACTCAACACCAACATTGATGGGCAATGGAAAATAGCCTTCGCCATCACACCATTAAGGGTGA---- || ||||||||| |||||| | ||||| |||||||| ||| |||||||| | | | || GAATACTCAACAGCAACATCAACGGGCAGCAGAAAATAGGCTTTGCCATCACTGCCATTAAGGATGTGGG ------------------TGTTGAGGAAAGCAGACATTGACCTCACCGAGAGGGCAGGCGAGCTCAGGTA ||||||||||||| ||| ||||||||||| || ||||||| || |||| | TTGACAGTACACTCATAGTGTTGAGGAAAGCTGACGTTGACCTCACCAAGTGGGCAGGAGAACTCACTGA GGATGAGGTGGAGCATATGATCACCATCATACAGAACTCAC-------CAAGATTCCAGACTGGTTCTTG ||||||| |||| | | |||| ||||| || ||||| || |||||| ||||||||||||||| GGATGAGATGGAACGTGTGATGACCATTATGCAGAATCCATGCCAGTACAAGATCCCAGACTGGTTCTTG Seed match = hit

Example of missing a target • Fail: (Word size 11) GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT || ||||||||| |||||| | |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT • Dilemma • Sensitivity – needs shorter seeds • the success rate of finding a homology • Speed – needs longer seeds • Mega-BLAST uses seeds of length 28.

PatternHunter uses “spaced seeds” • 111010010100110111 (called a model) • Eleven required matches (weight=11) • Seven “don’t care” positions GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT… || ||||||||| ||||| ||||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT… 111010010100110111 • Hit = all the required matches are satisfied. • BLAST seed model = 11111111111

Simulated sensitivity curves

Weight of a seed • Lemma: The expected number of hits of a weight W length M seed model within a length L region with similarity p is (L-M+1)pW • Proof: There are (L-M+1) positions a hit can occur. At each position, pW hit is expected. Q.E.D. • Seed models with the same weight generate approximately the same amount of hits. • Speed is approximately the same. • Sensitivity is not necessarily the same. • num of hits v.s. num of regions that contain hits. GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT || ||||||||| ||||| ||||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT 111010010100110111

Why spaced seeds are better? TTGACCTCACC? |||||||||||? TTGACCTCACC? 11111111111 11111111111 CAA?A??A?C??TA?TGG? |||?|??|?|??||?|||? CAA?A??A?C??TA?TGG? 111010010100110111 111010010100110111 • BLAST’s seed usually uses more than one hits to detect one homology (redundant) • Spaced seeds uses fewer hits to detect one homology (efficient)

PH’s seed does not overlap heavily • PH’s seed do not overlap heavily when shifts: 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 111010010100110111 ...... • The hits at different positions are independent. • The probability of having the second hit is 5*p6 + … • compare to BLAST’s model p + p2 + p3 + p4 + …

Non-overlapping For optimized spaced seed, 111010010100110111 Non overlap Prob 111010010100110111 6 p6 111010010100110111 6 p6 111010010100110111 6 p6 111010010100110111 7 p7 ….. • For spaced seed: the divisor is 1+p6+p6+p6+p7+ … • For BLAST seed: the divisor is bigger: 1+ p + p2 + p3 + … Ming Li’s CPM’05 illustration.

In a region of length 64 with p=0.7 • Hit more using less number of hits: • Pr(BLAST seed hits)=0.3 E(# of hits by BLAST seed)=1.07 • Pr(optimal spaced seed hits)=0.466, 50% more E(# of hits by spaced seed)=0.93, 14% less • the average number of hits in that region is • 2.0 for PH’s weight-11 seed • 3.6 for contiguous weight-11 seed.

PatternHunter I performance Blastn MB28 PH • E.coli (4.7M) v.s. H.inf (1.8M) • 716s /158M 5s/561M 34s/78M • Arabidopsis chr2 (19.6M) v.s. chr4 (17.5M) • -- 21720s/1087M 5020s/279M • Human chr21 (26.2M) v.s. chr22 (35M) • -- -- 14512s/419M • All used a 700MHZ PentiumIII PC with 1G byte memory. • Human (3G) v.s. Mouse (3G)* • Using 2-hit, weight 12 seed, PH used 6 days with a 1GHZ PentiumIII PC with 2G byte memory. • With Blast, it would otherwise take months with parallel computers to finish.

Homology search vs Google search • Internet search • Size limit: 5 billion people x homepage size • Supercomputing power used: ½ million CPU-hours/day • Query frequency: Google --- 112 million/day • Query type: exact keyword search --- easy to do • Homology search • Size limit: 5 billion people x 3 billion basepairs + millions of species x billion bases • 10% (?) of world’s supercomputing power • Query frequency: NCBI BLAST -- 150,000/day, 15% increase/month • Query type: approximate search Ming Li’s CPM’05 illustration.

Fast Local Alignment Methods