360 likes | 377 Views
This project involves designing spaced seeds for sequence matching efficiency. Topics include computation, probability, algorithm optimization, and automaton construction for pattern recognition. Explore methods to enhance sensitivity, specificity, and speed in sequence analysis.
E N D
Designing Spaced Seeds Vineet Bafna
Project/Exam deadlines • May 2 • Send email to me with a title of your project • May 9 • Each student/group gives a 10 min. presentation on their proposed project. • Show preliminary computations. What is the test plan? What is the data like, and how much is there. • Last week of classes: • A 20 min. presentation from each group • A written report on the project • A take home exam, due electronically on the date of the final exam Vineet Bafna
Accuracy • Consider a 64bp sequence that is 70% similar to the query. • Pr(an 11 mer matches) = 0.3 • Pr(A spaced seed 11101001.. Matches) = 0.466 • This non-intuitive result leads to selection of spaced words that are an order of magnitude faster for identical specificity and sensitivity • Implemented in PATTERNHUNTER Vineet Bafna
How to compute a spaced seed • No good algorithm is known. • Iterate over all (M choose W) seeds. • Use a computation to decide Pr(match) • Choose the seed that maximizes probability. Vineet Bafna
Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. • We can assume that there is a probability p of match. • The match mismatch string is a binary string with probability p of 1 1 L 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 Vineet Bafna
Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. • Q is a binary string of length M, with W 1s • We try to match the binary ‘match string’ S which is a random binary string with probability p of success. M 1 L 110…0.1…1..0 • PQ = Prob. (Q matches random S at some location) • How can we compute PQ? Vineet Bafna
Computing F(i,b) • For a specific string b, define • F(i,b) = Prob. (Q matches a random string S of length i, s.t. S ends in b) i 1 b • Why is it sufficient to compute f(I,b) for all I, b? • PQ = f(L,) Vineet Bafna
Computing f(i,b) • Define B1 as the set of all strings b that match a suffix of Q b • We have two possibilities: • b B1 : b is consistent with a suffix of Q. • b B0 = B-B1 110001 Q 110001 Vineet Bafna
Computing f(i,b) • Case b B0 • f(i,b) = f(i-1,b>>1) b Q • Case b B1 and |b| = M • f(i,b) = 1 Vineet Bafna
Computing f(i,b) b • Case b B1, |b|<M • f(i,b) = pf(i-1,1b) + (1-p)pf(i-1,0b) • Note that if b B1 , then 1b B1 • However, it is possible that 0b B1 • We want to iterate only over b B1 • Find smallest j s.t. 0b>>j B1 • f(i,b) = pf(i-1,1b) + (1-p)f(i-j,0b>>j) Q Vineet Bafna
Efficiency • |B1| = M2M-W • The iteration proceeds for all i, and all bB1, and each comparison needs O(M) steps • O(M2M-W L M) = O(M22M-W L) Vineet Bafna
More efficient algorithm for spaced seed design • Due to Buhler, Keich, and Sun • Consider seed (weight w, span s). • Let Q be the set of all possible 2s-w strings matching . 1 1 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1 1 1 : : Vineet Bafna
Trie construction 1 0 0 1 1 1 1 Ex: =1001 0 1 1 1 • Our goal is to make an automaton that accepts all strings which contain a string from Q. • Make a trie T from Q. • T is a DFA that precisely accepts Q • Can we convert T to an DFA that accepts all strings that matches a string from Q as a suffix? Vineet Bafna
Failure links 1 0 0 1 1 1 1 Ex: =1001 0 1 1 1 • Use of failure links allows us to traverse any string till Q is reached. • Note: the DFA has special structure. Does it help? No failure links when outgoing edge is 0. Therefore, we fail only when we see a 1. Vineet Bafna
Substring automaton • We started with a Trie that only accepts Q • Next, we use failure links to accept any string with a suffix from Q. • Finally, make every accepting state an absorbing one, to accept all strings containing a string from Q as suffix. 0,1 1 0 0 1 1 1 1 Ex: =1001 0 1 1 1 Vineet Bafna
Computing sensitivity of • Compute the probability that a ‘random’ string of length l will match ? • Equivalently: What is the probability that a random string of length l that starts at the begin node will end in an accepting state of A. • Case 1: Each bit of S is 1 with probability p • P(q,t)=Probability that we reach q after reading the first t bits. Vineet Bafna
Complexity • Size of the Automaton W2M-W • What is the in-degree? • Claimed complexity = (W2M-WL) • O(M2/W) faster then the previous algorithm Vineet Bafna
Generalizing the match string • The match string may have a different distribution • Errors do not fall independently at random • Instead of independent bernoulli trials, we can have a higher order markovian process generating the match string. • The algorithm of Keich et al. Cannot deal with this extension, but it is natural in Mandala Vineet Bafna
Experimental Results with Mandala • 428 human/mouse genomic aligned regions. • Repeat mask the alignments and separate into coding/non-coding regions. • A total of 1136000 similarities (alignments) were pulled. These are used to check for sensitivity (accuracy) of filters. Vineet Bafna
Effect of Span • Solid line: 0-th order model • Dashed line: 5-th order model. • W=11 throughout: larger span implies more gaps, span=11 implies ungapped (BLASTN) seed Vineet Bafna
Accuracy of different seeds • Non-coding • Coding Vineet Bafna
Model order • Non-Coding: solid line • coding: dashed line Vineet Bafna
What about multiple keywords • All of the analysis is for ungapped alignments. • With indels, multiple words might be more sensitive. • Mandala works for multiple keywords also. • Can we make the algorithm more efficient? • In particular, there is an explosion of states in making a deterministic automaton? Can we match a non-deterministic automaton? Vineet Bafna
Regular Expressions • Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if Vineet Bafna
Regular Expression • Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • *(A+C)? • AC*..E? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? Vineet Bafna
Regular Expression & Automata • Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or • Suppose R is described by automaton A • S R if and only if there is a path from start to end in A, labeled with s. Vineet Bafna
Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C Vineet Bafna
Constructing automata from R.E • R = {} • R = {}, • R = R1 + R2 • R = R1 · R2 • R = R1* Vineet Bafna
Regular Expression Matching • Given a database D, and a regular expression R, is a substring of D in R? • Is there a string D[l..c] that is accepted by the automaton of R? • Simpler Q: Is D[1..c] accepted by the automaton of R? Vineet Bafna
Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA D[1] D[2] D[c] Vineet Bafna
Alg. For matching R.E. u D[1] .. D[c-1] D[c] • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA • There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END Vineet Bafna
D.P. to match regular expression • Define: • A[u,] = Automaton node reached from u after reading • Eps(u): set of all nodes reachable from node u using epsilon transitions. • N[c] = subset of nodes reachable from START node after reading D[1..c] • Q: when is v N[c]? u v u Eps(u) Vineet Bafna
D.P. to match regular expression • Q: when is v N[c]? • A: If for some u N[c-1], w = A[u,D[c]], • v {w}+ Eps(w) w u D[1] .. D[c-1] D[c] Vineet Bafna
Algorithm Vineet Bafna
The final step • We have answered the question: • Is D[1..c] accepted by R? • Yes, if END N[c] • We need to answer • Is D[l..c] (for some l, and some c) accepted by R Vineet Bafna