1 / 36

Designing Spaced Seeds

Designing Spaced Seeds. Project/Exam deadlines. May 2 Send email to me with a title of your project May 9 Each student/group gives a 10 min. presentation on their proposed project. Show preliminary computations. What is the test plan? What is the data like, and how much is there.

rconnelly
Download Presentation

Designing Spaced Seeds

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing Spaced Seeds Vineet Bafna

  2. Project/Exam deadlines • May 2 • Send email to me with a title of your project • May 9 • Each student/group gives a 10 min. presentation on their proposed project. • Show preliminary computations. What is the test plan? What is the data like, and how much is there. • Last week of classes: • A 20 min. presentation from each group • A written report on the project • A take home exam, due electronically on the date of the final exam Vineet Bafna

  3. Accuracy • Consider a 64bp sequence that is 70% similar to the query. • Pr(an 11 mer matches) = 0.3 • Pr(A spaced seed 11101001.. Matches) = 0.466 • This non-intuitive result leads to selection of spaced words that are an order of magnitude faster for identical specificity and sensitivity • Implemented in PATTERNHUNTER Vineet Bafna

  4. How to compute a spaced seed • No good algorithm is known. • Iterate over all (M choose W) seeds. • Use a computation to decide Pr(match) • Choose the seed that maximizes probability. Vineet Bafna

  5. Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. • We can assume that there is a probability p of match. • The match mismatch string is a binary string with probability p of 1 1 L 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 Vineet Bafna

  6. Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. • Q is a binary string of length M, with W 1s • We try to match the binary ‘match string’ S which is a random binary string with probability p of success. M 1 L 110…0.1…1..0 • PQ = Prob. (Q matches random S at some location) • How can we compute PQ? Vineet Bafna

  7. Computing F(i,b) • For a specific string b, define • F(i,b) = Prob. (Q matches a random string S of length i, s.t. S ends in b) i 1 b • Why is it sufficient to compute f(I,b) for all I, b? • PQ = f(L,) Vineet Bafna

  8. Computing f(i,b) • Define B1 as the set of all strings b that match a suffix of Q b • We have two possibilities: • b  B1 : b is consistent with a suffix of Q. • b  B0 = B-B1 110001 Q 110001 Vineet Bafna

  9. Computing f(i,b) • Case b  B0 • f(i,b) = f(i-1,b>>1) b Q • Case b  B1 and |b| = M • f(i,b) = 1 Vineet Bafna

  10. Computing f(i,b) b • Case b  B1, |b|<M • f(i,b) = pf(i-1,1b) + (1-p)pf(i-1,0b) • Note that if b  B1 , then 1b  B1 • However, it is possible that 0b  B1 • We want to iterate only over b  B1 • Find smallest j s.t. 0b>>j B1 • f(i,b) = pf(i-1,1b) + (1-p)f(i-j,0b>>j) Q Vineet Bafna

  11. Efficiency • |B1| = M2M-W • The iteration proceeds for all i, and all bB1, and each comparison needs O(M) steps • O(M2M-W L M) = O(M22M-W L) Vineet Bafna

  12. More efficient algorithm for spaced seed design • Due to Buhler, Keich, and Sun • Consider seed  (weight w, span s). • Let Q be the set of all possible 2s-w strings matching . 1 1 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1 1 1 : : Vineet Bafna

  13. Trie construction 1 0 0 1 1 1 1 Ex:  =1001 0 1 1 1 • Our goal is to make an automaton that accepts all strings which contain a string from Q. • Make a trie T from Q. • T is a DFA that precisely accepts Q • Can we convert T to an DFA that accepts all strings that matches a string from Q as a suffix? Vineet Bafna

  14. Failure links 1 0 0 1 1 1 1 Ex:  =1001 0 1 1 1 • Use of failure links allows us to traverse any string till Q is reached. • Note: the DFA has special structure. Does it help? No failure links when outgoing edge is 0. Therefore, we fail only when we see a 1. Vineet Bafna

  15. Substring automaton • We started with a Trie that only accepts Q • Next, we use failure links to accept any string with a suffix from Q. • Finally, make every accepting state an absorbing one, to accept all strings containing a string from Q as suffix. 0,1 1 0 0 1 1 1 1 Ex:  =1001 0 1 1 1 Vineet Bafna

  16. Computing sensitivity of  • Compute the probability that a ‘random’ string of length l will match ? • Equivalently: What is the probability that a random string of length l that starts at the begin node will end in an accepting state of A. • Case 1: Each bit of S is 1 with probability p • P(q,t)=Probability that we reach q after reading the first t bits. Vineet Bafna

  17. Complexity • Size of the Automaton W2M-W • What is the in-degree? • Claimed complexity = (W2M-WL) • O(M2/W) faster then the previous algorithm Vineet Bafna

  18. Generalizing the match string • The match string may have a different distribution • Errors do not fall independently at random • Instead of independent bernoulli trials, we can have a higher order markovian process generating the match string. • The algorithm of Keich et al. Cannot deal with this extension, but it is natural in Mandala Vineet Bafna

  19. Experimental Results with Mandala • 428 human/mouse genomic aligned regions. • Repeat mask the alignments and separate into coding/non-coding regions. • A total of 1136000 similarities (alignments) were pulled. These are used to check for sensitivity (accuracy) of filters. Vineet Bafna

  20. Effect of Span • Solid line: 0-th order model • Dashed line: 5-th order model. • W=11 throughout: larger span implies more gaps, span=11 implies ungapped (BLASTN) seed Vineet Bafna

  21. Accuracy of different seeds • Non-coding • Coding Vineet Bafna

  22. Model order • Non-Coding: solid line • coding: dashed line Vineet Bafna

  23. What about multiple keywords • All of the analysis is for ungapped alignments. • With indels, multiple words might be more sensitive. • Mandala works for multiple keywords also. • Can we make the algorithm more efficient? • In particular, there is an explosion of states in making a deterministic automaton? Can we match a non-deterministic automaton? Vineet Bafna

  24. Regular Expressions • Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if Vineet Bafna

  25. Regular Expression • Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • *(A+C)? • AC*..E? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? Vineet Bafna

  26. Regular Expression & Automata • Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or  • Suppose R is described by automaton A • S  R if and only if there is a path from start to end in A, labeled with s. Vineet Bafna

  27. Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C Vineet Bafna

  28. Constructing automata from R.E     • R = {} • R = {},    • R = R1 + R2 • R = R1 · R2 • R = R1*       Vineet Bafna

  29. Regular Expression Matching • Given a database D, and a regular expression R, is a substring of D in R? • Is there a string D[l..c] that is accepted by the automaton of R? • Simpler Q: Is D[1..c] accepted by the automaton of R? Vineet Bafna

  30. Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA  D[1] D[2] D[c] Vineet Bafna

  31. Alg. For matching R.E. u D[1] .. D[c-1] D[c] • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA • There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END Vineet Bafna

  32. D.P. to match regular expression • Define: • A[u,] = Automaton node reached from u after reading  • Eps(u): set of all nodes reachable from node u using epsilon transitions. • N[c] = subset of nodes reachable from START node after reading D[1..c] • Q: when is v  N[c]? u  v  u Eps(u) Vineet Bafna

  33. D.P. to match regular expression • Q: when is v  N[c]? • A: If for some u  N[c-1], w = A[u,D[c]], • v  {w}+ Eps(w) w u D[1] .. D[c-1]  D[c] Vineet Bafna

  34. Algorithm Vineet Bafna

  35. The final step • We have answered the question: • Is D[1..c] accepted by R? • Yes, if END  N[c] • We need to answer • Is D[l..c] (for some l, and some c) accepted by R Vineet Bafna

  36. Vineet Bafna

More Related