360 likes | 370 Views
Designing Spaced Seeds. Project/Exam deadlines. May 2 Send email to me with a title of your project May 9 Each student/group gives a 10 min. presentation on their proposed project. Show preliminary computations. What is the test plan? What is the data like, and how much is there.
E N D
Designing Spaced Seeds Vineet Bafna
Project/Exam deadlines • May 2 • Send email to me with a title of your project • May 9 • Each student/group gives a 10 min. presentation on their proposed project. • Show preliminary computations. What is the test plan? What is the data like, and how much is there. • Last week of classes: • A 20 min. presentation from each group • A written report on the project • A take home exam, due electronically on the date of the final exam Vineet Bafna
Accuracy • Consider a 64bp sequence that is 70% similar to the query. • Pr(an 11 mer matches) = 0.3 • Pr(A spaced seed 11101001.. Matches) = 0.466 • This non-intuitive result leads to selection of spaced words that are an order of magnitude faster for identical specificity and sensitivity • Implemented in PATTERNHUNTER Vineet Bafna
How to compute a spaced seed • No good algorithm is known. • Iterate over all (M choose W) seeds. • Use a computation to decide Pr(match) • Choose the seed that maximizes probability. Vineet Bafna
Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. • We can assume that there is a probability p of match. • The match mismatch string is a binary string with probability p of 1 1 L 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 Vineet Bafna
Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. • Q is a binary string of length M, with W 1s • We try to match the binary ‘match string’ S which is a random binary string with probability p of success. M 1 L 110…0.1…1..0 • PQ = Prob. (Q matches random S at some location) • How can we compute PQ? Vineet Bafna
Computing F(i,b) • For a specific string b, define • F(i,b) = Prob. (Q matches a random string S of length i, s.t. S ends in b) i 1 b • Why is it sufficient to compute f(I,b) for all I, b? • PQ = f(L,) Vineet Bafna
Computing f(i,b) • Define B1 as the set of all strings b that match a suffix of Q b • We have two possibilities: • b B1 : b is consistent with a suffix of Q. • b B0 = B-B1 110001 Q 110001 Vineet Bafna
Computing f(i,b) • Case b B0 • f(i,b) = f(i-1,b>>1) b Q • Case b B1 and |b| = M • f(i,b) = 1 Vineet Bafna
Computing f(i,b) b • Case b B1, |b|<M • f(i,b) = pf(i-1,1b) + (1-p)pf(i-1,0b) • Note that if b B1 , then 1b B1 • However, it is possible that 0b B1 • We want to iterate only over b B1 • Find smallest j s.t. 0b>>j B1 • f(i,b) = pf(i-1,1b) + (1-p)f(i-j,0b>>j) Q Vineet Bafna
Efficiency • |B1| = M2M-W • The iteration proceeds for all i, and all bB1, and each comparison needs O(M) steps • O(M2M-W L M) = O(M22M-W L) Vineet Bafna
More efficient algorithm for spaced seed design • Due to Buhler, Keich, and Sun • Consider seed (weight w, span s). • Let Q be the set of all possible 2s-w strings matching . 1 1 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1 1 1 : : Vineet Bafna
Trie construction 1 0 0 1 1 1 1 Ex: =1001 0 1 1 1 • Our goal is to make an automaton that accepts all strings which contain a string from Q. • Make a trie T from Q. • T is a DFA that precisely accepts Q • Can we convert T to an DFA that accepts all strings that matches a string from Q as a suffix? Vineet Bafna
Failure links 1 0 0 1 1 1 1 Ex: =1001 0 1 1 1 • Use of failure links allows us to traverse any string till Q is reached. • Note: the DFA has special structure. Does it help? No failure links when outgoing edge is 0. Therefore, we fail only when we see a 1. Vineet Bafna
Substring automaton • We started with a Trie that only accepts Q • Next, we use failure links to accept any string with a suffix from Q. • Finally, make every accepting state an absorbing one, to accept all strings containing a string from Q as suffix. 0,1 1 0 0 1 1 1 1 Ex: =1001 0 1 1 1 Vineet Bafna
Computing sensitivity of • Compute the probability that a ‘random’ string of length l will match ? • Equivalently: What is the probability that a random string of length l that starts at the begin node will end in an accepting state of A. • Case 1: Each bit of S is 1 with probability p • P(q,t)=Probability that we reach q after reading the first t bits. Vineet Bafna
Complexity • Size of the Automaton W2M-W • What is the in-degree? • Claimed complexity = (W2M-WL) • O(M2/W) faster then the previous algorithm Vineet Bafna
Generalizing the match string • The match string may have a different distribution • Errors do not fall independently at random • Instead of independent bernoulli trials, we can have a higher order markovian process generating the match string. • The algorithm of Keich et al. Cannot deal with this extension, but it is natural in Mandala Vineet Bafna
Experimental Results with Mandala • 428 human/mouse genomic aligned regions. • Repeat mask the alignments and separate into coding/non-coding regions. • A total of 1136000 similarities (alignments) were pulled. These are used to check for sensitivity (accuracy) of filters. Vineet Bafna
Effect of Span • Solid line: 0-th order model • Dashed line: 5-th order model. • W=11 throughout: larger span implies more gaps, span=11 implies ungapped (BLASTN) seed Vineet Bafna
Accuracy of different seeds • Non-coding • Coding Vineet Bafna
Model order • Non-Coding: solid line • coding: dashed line Vineet Bafna
What about multiple keywords • All of the analysis is for ungapped alignments. • With indels, multiple words might be more sensitive. • Mandala works for multiple keywords also. • Can we make the algorithm more efficient? • In particular, there is an explosion of states in making a deterministic automaton? Can we match a non-deterministic automaton? Vineet Bafna
Regular Expressions • Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if Vineet Bafna
Regular Expression • Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • *(A+C)? • AC*..E? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? Vineet Bafna
Regular Expression & Automata • Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or • Suppose R is described by automaton A • S R if and only if there is a path from start to end in A, labeled with s. Vineet Bafna
Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C Vineet Bafna
Constructing automata from R.E • R = {} • R = {}, • R = R1 + R2 • R = R1 · R2 • R = R1* Vineet Bafna
Regular Expression Matching • Given a database D, and a regular expression R, is a substring of D in R? • Is there a string D[l..c] that is accepted by the automaton of R? • Simpler Q: Is D[1..c] accepted by the automaton of R? Vineet Bafna
Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA D[1] D[2] D[c] Vineet Bafna
Alg. For matching R.E. u D[1] .. D[c-1] D[c] • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA • There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END Vineet Bafna
D.P. to match regular expression • Define: • A[u,] = Automaton node reached from u after reading • Eps(u): set of all nodes reachable from node u using epsilon transitions. • N[c] = subset of nodes reachable from START node after reading D[1..c] • Q: when is v N[c]? u v u Eps(u) Vineet Bafna
D.P. to match regular expression • Q: when is v N[c]? • A: If for some u N[c-1], w = A[u,D[c]], • v {w}+ Eps(w) w u D[1] .. D[c-1] D[c] Vineet Bafna
Algorithm Vineet Bafna
The final step • We have answered the question: • Is D[1..c] accepted by R? • Yes, if END N[c] • We need to answer • Is D[l..c] (for some l, and some c) accepted by R Vineet Bafna