Designing Spaced Seeds

Designing Spaced Seeds Vineet Bafna

Project/Exam deadlines • May 2 • Send email to me with a title of your project • May 9 • Each student/group gives a 10 min. presentation on their proposed project. • Show preliminary computations. What is the test plan? What is the data like, and how much is there. • Last week of classes: • A 20 min. presentation from each group • A written report on the project • A take home exam, due electronically on the date of the final exam Vineet Bafna

Accuracy • Consider a 64bp sequence that is 70% similar to the query. • Pr(an 11 mer matches) = 0.3 • Pr(A spaced seed 11101001.. Matches) = 0.466 • This non-intuitive result leads to selection of spaced words that are an order of magnitude faster for identical specificity and sensitivity • Implemented in PATTERNHUNTER Vineet Bafna

How to compute a spaced seed • No good algorithm is known. • Iterate over all (M choose W) seeds. • Use a computation to decide Pr(match) • Choose the seed that maximizes probability. Vineet Bafna

Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. • We can assume that there is a probability p of match. • The match mismatch string is a binary string with probability p of 1 1 L 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 Vineet Bafna

Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. • Q is a binary string of length M, with W 1s • We try to match the binary ‘match string’ S which is a random binary string with probability p of success. M 1 L 110…0.1…1..0 • PQ = Prob. (Q matches random S at some location) • How can we compute PQ? Vineet Bafna

Computing F(i,b) • For a specific string b, define • F(i,b) = Prob. (Q matches a random string S of length i, s.t. S ends in b) i 1 b • Why is it sufficient to compute f(I,b) for all I, b? • PQ = f(L,) Vineet Bafna

Computing f(i,b) • Define B1 as the set of all strings b that match a suffix of Q b • We have two possibilities: • b  B1 : b is consistent with a suffix of Q. • b  B0 = B-B1 110001 Q 110001 Vineet Bafna

Computing f(i,b) • Case b  B0 • f(i,b) = f(i-1,b>>1) b Q • Case b  B1 and |b| = M • f(i,b) = 1 Vineet Bafna

Computing f(i,b) b • Case b  B1, |b|<M • f(i,b) = pf(i-1,1b) + (1-p)pf(i-1,0b) • Note that if b  B1 , then 1b  B1 • However, it is possible that 0b  B1 • We want to iterate only over b  B1 • Find smallest j s.t. 0b>>j B1 • f(i,b) = pf(i-1,1b) + (1-p)f(i-j,0b>>j) Q Vineet Bafna

Efficiency • |B1| = M2M-W • The iteration proceeds for all i, and all bB1, and each comparison needs O(M) steps • O(M2M-W L M) = O(M22M-W L) Vineet Bafna

More efficient algorithm for spaced seed design • Due to Buhler, Keich, and Sun • Consider seed  (weight w, span s). • Let Q be the set of all possible 2s-w strings matching . 1 1 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1 1 1 : : Vineet Bafna

Trie construction 1 0 0 1 1 1 1 Ex:  =1001 0 1 1 1 • Our goal is to make an automaton that accepts all strings which contain a string from Q. • Make a trie T from Q. • T is a DFA that precisely accepts Q • Can we convert T to an DFA that accepts all strings that matches a string from Q as a suffix? Vineet Bafna

Failure links 1 0 0 1 1 1 1 Ex:  =1001 0 1 1 1 • Use of failure links allows us to traverse any string till Q is reached. • Note: the DFA has special structure. Does it help? No failure links when outgoing edge is 0. Therefore, we fail only when we see a 1. Vineet Bafna

Substring automaton • We started with a Trie that only accepts Q • Next, we use failure links to accept any string with a suffix from Q. • Finally, make every accepting state an absorbing one, to accept all strings containing a string from Q as suffix. 0,1 1 0 0 1 1 1 1 Ex:  =1001 0 1 1 1 Vineet Bafna

Computing sensitivity of  • Compute the probability that a ‘random’ string of length l will match ? • Equivalently: What is the probability that a random string of length l that starts at the begin node will end in an accepting state of A. • Case 1: Each bit of S is 1 with probability p • P(q,t)=Probability that we reach q after reading the first t bits. Vineet Bafna

Complexity • Size of the Automaton W2M-W • What is the in-degree? • Claimed complexity = (W2M-WL) • O(M2/W) faster then the previous algorithm Vineet Bafna

Generalizing the match string • The match string may have a different distribution • Errors do not fall independently at random • Instead of independent bernoulli trials, we can have a higher order markovian process generating the match string. • The algorithm of Keich et al. Cannot deal with this extension, but it is natural in Mandala Vineet Bafna

Experimental Results with Mandala • 428 human/mouse genomic aligned regions. • Repeat mask the alignments and separate into coding/non-coding regions. • A total of 1136000 similarities (alignments) were pulled. These are used to check for sensitivity (accuracy) of filters. Vineet Bafna

Effect of Span • Solid line: 0-th order model • Dashed line: 5-th order model. • W=11 throughout: larger span implies more gaps, span=11 implies ungapped (BLASTN) seed Vineet Bafna

Accuracy of different seeds • Non-coding • Coding Vineet Bafna

Model order • Non-Coding: solid line • coding: dashed line Vineet Bafna

What about multiple keywords • All of the analysis is for ungapped alignments. • With indels, multiple words might be more sensitive. • Mandala works for multiple keywords also. • Can we make the algorithm more efficient? • In particular, there is an explosion of states in making a deterministic automaton? Can we match a non-deterministic automaton? Vineet Bafna

Regular Expressions • Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if Vineet Bafna

Regular Expression • Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • *(A+C)? • AC*..E? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? Vineet Bafna

Regular Expression & Automata • Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or  • Suppose R is described by automaton A • S  R if and only if there is a path from start to end in A, labeled with s. Vineet Bafna

Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C Vineet Bafna

Constructing automata from R.E     • R = {} • R = {},    • R = R1 + R2 • R = R1 · R2 • R = R1*       Vineet Bafna

Regular Expression Matching • Given a database D, and a regular expression R, is a substring of D in R? • Is there a string D[l..c] that is accepted by the automaton of R? • Simpler Q: Is D[1..c] accepted by the automaton of R? Vineet Bafna

Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA  D[1] D[2] D[c] Vineet Bafna

Alg. For matching R.E. u D[1] .. D[c-1] D[c] • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA • There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END Vineet Bafna

D.P. to match regular expression • Define: • A[u,] = Automaton node reached from u after reading  • Eps(u): set of all nodes reachable from node u using epsilon transitions. • N[c] = subset of nodes reachable from START node after reading D[1..c] • Q: when is v  N[c]? u  v  u Eps(u) Vineet Bafna

D.P. to match regular expression • Q: when is v  N[c]? • A: If for some u  N[c-1], w = A[u,D[c]], • v  {w}+ Eps(w) w u D[1] .. D[c-1]  D[c] Vineet Bafna

Algorithm Vineet Bafna

The final step • We have answered the question: • Is D[1..c] accepted by R? • Yes, if END  N[c] • We need to answer • Is D[l..c] (for some l, and some c) accepted by R Vineet Bafna

Vineet Bafna

Designing Spaced Seeds

Designing Spaced Seeds

Presentation Transcript

Designing Multiple Simultaneous Seeds for DNA Similarity Search

Seeds

SEEDS

P4 Spaced learning

Spaced Out PT. 1

Spaced Invaders

SEEDS

Seeds

SPACED OUT

Seeds

Seeds

Seeds

SEEDS

Seeds and Seeds

SEEDS AND SEEDS

spaced repetition

Design of Optimal Multiple Spaced Seeds for Homology Search

Spaced learning

Expert Seeds - Sensible Seeds

Spaced Repetition Language Learning

SEEDS AND SEEDS