1 / 42

Database Filtering

Database Filtering. Project/Exam deadlines. May 2 Send email to me with a title of your project May 9 Each student/group gives a 10 min. presentation on their proposed project. Show preliminary computations. What is the test plan? What is the data like, and how much is there.

stephinet
Download Presentation

Database Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Filtering Vineet Bafna

  2. Project/Exam deadlines • May 2 • Send email to me with a title of your project • May 9 • Each student/group gives a 10 min. presentation on their proposed project. • Show preliminary computations. What is the test plan? What is the data like, and how much is there. • Last week of classes: • A 20 min. presentation from each group • A written report on the project • A take home exam, due electronically on the date of the final exam Vineet Bafna

  3. Building better filters • Better filters for ncRNA is an open and relatively unresearched problems. • In contrast, filters for sequence searches have been extensively researched • Some non-intuitive ideas. • We will digress into sequence based filters to see if some of the principles can be exported to other domains. Vineet Bafna

  4. Large Database Search • Given a query of length m • Identify all sub-sequences in a database that aligns with a high score. • Imagine the database to be a single long string of length n • The straightforward algorithm would employ a scan of the database. How much time would it take? query Sequnce database Vineet Bafna

  5. D.P. computation • The entire computation is one large local alignment. • S[i,j]: score of the best local alignment of prefix 1..i of the database against prefix 1..j of the query. i j Vineet Bafna

  6. Large database search Database (n) Database size n=10M, Querysize m=300. O(nm) = 3. 109 computations Query (m) Vineet Bafna

  7. Filtering • The goal of filtering is to reduce the search space to o(nm) using a fast filter • How can we filter? Vineet Bafna

  8. Observations • Much of the database is random from the query’s perspective • Consider a random DNA string of length n. • Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 • Assume for the moment that the query is all A’s (length k). • What is the probability that an exact match to the query can be found? Vineet Bafna

  9. Basic probability • Probability that there is a match starting at a fixed position i = 0.25k • What is the probability that some position i has a match. • Dependencies confound probability estimates. • Related question: What is the expected number of hits? Vineet Bafna

  10. Basic Probability:Expectation • Total money you expect to earn • Q: Toss a coin: each time it comes up heads, you get a dollar • What is the money you expect to get after n tosses? • Let Xi be the amount earned in the i-th toss Vineet Bafna

  11. Expected number of matches i • Let Xi=1 if there is a match starting at position i, Xi=0 otherwise • Expected number of matches = Vineet Bafna

  12. Expected number of exact Matches is small! • Expected number of matches = n*0.25k • If n=107, k=10, • Then, expected number of matches = 9.537 • If n=107, k=11 • expected number of hits = 2.38 • n=107,k=12, • Expected number of hits = 0.5 < 1 • Bottom Line: An exact match to a substring of the query is unlikely just by chance. Vineet Bafna

  13. Blast filter • Take all m-k words of length k. • Filter: Consider only those sequences that match at least one of these words. • Expected number of matches in a random database? =(m-k)(n-k) (1/4)k • Efficiency = (1/4)k • A small increase in k decreases efficiency considerably • What can we say about accuracy? Vineet Bafna

  14. Observation 2: Pigeonhole principle • Suppose we are looking for a database string with greater than 90% identity to the query (length 100) • Partition the query into size 10 substrings. At least one must match the database string exactly Vineet Bafna

  15. Why is this important? • Suppose we are looking for sequences that are 80% identical to the query sequence of length 100. • Assume that the mismatches are randomly distributed. • What is the probability that there is no stretch of 10 bp, where the query and the subject match exactly? • Rough calculations show that it is very low. Exact match of a short query substring to a truly similar subject is very high. • The above equation does not take dependencies into account • Reality is better because the matches are not randomly distributed Vineet Bafna

  16. Combining the Facts • Consider the set of all substrings of the query string of fixed length W. • Prob. of exact match to a random database string is very low. • Prob. of exact match to a true homolog is very high. • This filter is efficient and accurate. What about speed? • Keyword Search (exact matches) is MUCH faster than sequence alignment Vineet Bafna

  17. BLAST Database (n) • Consider all (m-W) query words of size W (Default = 11) • Scan the database for exact match to all such words • For all regions that hit, extend using a dynamic programming alignment. • Can be many orders of magnitude faster than SW over the entire string Vineet Bafna

  18. Why is BLAST fast? • Assume that keyword searching does not consume any time and that alignment computation the expensive step. • Query m=1000, random Db n=107,no TP • SW = O(nm) = 1000*107 = 1010 computations • BLAST, W=11 • E(#11-mer hits)= 1000* (1/4)11 * 107=2384 • Number of computations = 2384*100*50=1.292*107 • Ratio=1010/(1.292*107)=774 • Further speed improvements are possible 50 50 Vineet Bafna

  19. Filter Speed: Keyword Matching • How fast can we match keywords? • Hash table/Db index? What is the size of the hash table, for m=11 • Suffix trees? What is the size of the suffix trees? • Trie based search. We will do this in class. AATCA 567 Vineet Bafna

  20. Dictionary Matching • Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string. • How fast can this be done? 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O database dictionary Vineet Bafna

  21. Dict. Matching & string matching • How fast can you do it, if you only had one word of length m? • Trivial algorithm O(nm) time • Pre-processing O(m), Search O(n) time. • Dictionary matching • Trivial algorithm (l1+l2+l3…)n • Using a keyword tree, lpn (lp is the length of the longest pattern) • Aho-Corasick: O(n) after preprocessing O(l1+l2..) • We will consider the most general case Vineet Bafna

  22. Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O Observations: • When we mismatch, we (should) know something about where the next match will be. • When there is a mismatch, we (should) know something about other patterns in the dictionary as well. Vineet Bafna

  23. The Trie Automaton O A P M S T T T O T S I U A E • Construct an automaton A from the dictionary • A[v,x] describes the transition from node v to a node w upon reading x. • A[u,’T’] = v, and A[u,’S’] = w • Special root node r • Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO 2:POTASSIUM 3:TASTE u v 1 r S 2 w 3 Vineet Bafna

  24. An O(lpn) algorithm for keyword matching Start with the first position in the db, and the root node. If successful transition Increment current pointer Move to a new node If terminal node “success” Else Retract ‘current’ pointer Increment ‘start’ pointer Move to root & repeat Vineet Bafna

  25. Illustration: c c l O A P T M S T T O T S I U A E P O T A S T P O T A T O v 1 S Vineet Bafna

  26. Idea for improving the time • Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match • Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) c l P O T A S T P O T A T O P O T A S S I U M Pattern i T A S T E 1:POTATO 2:POTASSIUM 3:TASTE Pattern j Vineet Bafna

  27. Improving speed of dictionary matching O A S T P M T T O T S I U A E • Every node v corresponds to a string sv that is a prefix of some pattern. • Define F[v] to be the node u such that su is the longest suffix of sv • If we fail to match at v, we should jump to F[v], and commence matching from there • Let lp[v] = |su| 2 3 4 5 1 S 11 6 7 9 10 8 Vineet Bafna

  28. An O(n) alg. For keyword matching • Start with the first position in the db, and the root node. • If successful transition • Increment current pointer • Move to a new node • If terminal node “success” • Else (if at root) • Increment ‘current’ pointer • Mv ‘start’ pointer • Move to root • Else • Move ‘start’ pointer forward • Move to failure node Vineet Bafna

  29. Illustration P O T A S T P O T A T O l c 1 P O T A T O v T S S I U M A S T E Vineet Bafna

  30. Time analysis • In each step, either c is incremented, or l is incremented • Neither pointer is ever decremented (lp[v] < c-l). • l and c do not exceed n • Total time <= 2n l c P O T A S T P O T A T O Vineet Bafna

  31. Blast: Putting it all together • Input: Query of length m, database of size n • Select word-size, scoring matrix, gap penalties, E-value cutoff Vineet Bafna

  32. Blast Steps • Generate an automaton of all query keywords. • Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. • Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. • For each alignment with score S, compute the bit-score, E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. • Output results. Vineet Bafna

  33. Can we improve the filter? • For a query word of size M, • Consider a binary string Q of length M with W<=M ones. • Q ‘matches’ a substring as long as the ‘ones’ match 11010011010 ACCGTCACGTA M=11 W=6 W = weight of spaced seed ACCATAAACAGAUACTTAATTTGGGA Vineet Bafna

  34. Can Spaced seeds help? • The ‘spaced seed’ for BLAST has W consecutive 1s. • Efficiency? • Blast Expected(hits) = n pW • For any (M,W), expected hits =~ npW • Accuracy? Vineet Bafna

  35. Accuracy • Consider a 64bp sequence that is 70% similar to the query. • Pr(an 11 mer matches) = 0.3 • Pr(A spaced seed 11101001.. Matches) = 0.466 • This non-intuitive result leads to selection of spaced words that are an order of magnitude faster for identical specificity and sensitivity • Implemented in PATTERNHUNTER Vineet Bafna

  36. How to compute a spaced seed • No good algorithm is known. • Iterate over all (M choose W) seeds. • Use a computation to decide Pr(match) • Choose the seed that maximizes probability. Vineet Bafna

  37. Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. • We can assume that there is a probability p of match. • The match mismatch string is a binary string with probability p of 1 1 L 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 0 Vineet Bafna

  38. Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. • Q is a binary string of length M, with W 1s • We try to match the binary ‘match string’ S which is a random binary string with probability p of success. M 1 L 110…0.1…1..0 • PQ = Prob. (Q matches random S at some location) • How can we compute PQ? Vineet Bafna

  39. Computing F(i,b) • For a specific string b, define • F(i,b) = Prob. (Q matches a random string S of length i, s.t. S ends in B) i 1 b Vineet Bafna

  40. Why is it sufficient to compute f(i,b) • PQ = f(L,) b • We have two possibilities: • b  B1 : b is consistent with a suffix of Q. • b  B0 = B-B1 110001 Q 110001 Vineet Bafna

  41. Computing f(i,b) • Case b  B0 • f(i,b) = f(i-1,b>>1) b Q • Case b  B1 and |b| = M • f(i,b) = 1 Vineet Bafna

  42. Computing f(i,b) • Case b  B1 • f(i,b) = pf(i-1,1b) + (1-p)pf(i-1,0b) b Q Vineet Bafna

More Related