html5-img
1 / 67

Pattern Discovery in Sequences under a Markov assumption Darya Chudova, Padhraic Smyth

Pattern Discovery in Sequences under a Markov assumption Darya Chudova, Padhraic Smyth Information and Computer Science University of California, Irvine Funding Acknowledgements National Science Foundation, Microsoft Research, IBM Research. Outline. . Pattern discovery problem

lowell
Download Presentation

Pattern Discovery in Sequences under a Markov assumption Darya Chudova, Padhraic Smyth

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Discovery in Sequences under a Markov assumption Darya Chudova, Padhraic Smyth Information and Computer Science University of California, Irvine Funding Acknowledgements National Science Foundation, Microsoft Research, IBM Research

  2. Outline  • Pattern discovery problem • Problem statement • Research questions • Bayes error rate framework • Experimental results • Conclusions

  3. ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB

  4. ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADABCCCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBAABBCCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCABBBCBBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBABBCCAADCBCDACBCABABCCBACBBBCADDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADABCCCBBCDBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBAABBCCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCABBBCBBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBABBCCAADCBCDACBCABABCCBACBBBCADDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB

  5. Applications in Biology • Motif discovery problem • Task: Identification of potential binding sites of proteins • Input: Upstream regions of co-regulated genes • Patterns are non-deterministic • fixed length motifs • may have substitution errors • substitutions are independent

  6. Generative Model • Extensions: • Variable length patterns • Multiple patterns, multi-part patterns • Richer background |A| • Hidden Markov Model  A 0.9 A 0.9 B 0.9 B 0.9 D 0.9 B 0.25 C 0.25 A 0.25 D 0.25 P1 P3 BG P2 P4 0.01 1.0 1.0 1.0 0.99 F 1.0 L

  7. Pattern Detection and Discovery • Unsupervised discovery of embedded patterns EM Unlabeled Data Pattern Model • Supervised detection of known patterns Viterbi Unlabeled Data Pattern Model Pattern Locations

  8. Research Questions • Current state • Successful algorithms for motif discovery • Can not solve some seemingly simple problems • Why is this happening? • Are the algorithms suboptimal? • Is the data set too small? • Is the problem inherently difficult and ambiguous? • Can such questions be answered in a principled way?

  9. Outline  • Pattern discovery problem • Bayes error rate framework • Definitions • Examples • Application to pattern discovery • Experimental results • Conclusions 

  10. Difficulty of Pattern Discovery • Assume true model is given • Measure of difficulty • classification performance of pattern detection • Multiple factors influence the difficulty • Single characteristic that quantifies it? • Bayes error rate

  11. Bayes Error Rate: Definition • Bayes error rate: • average error rate of the optimal decision rule • a lower bound on classification performance of any algorithm on a given problem • Optimal rule: • pick the class with the highest posterior probability

  12. Classification Example • Simple problem • Harder problem

  13. Pattern Discovery Example • Simple problem • BDCBBDCABCDADCADAADABDAABBCBCACADAADCDDABDCADAADCBBAADBDBDBDAABACABBABBCAADDBCBADCBDDABCDABBBDBBCCBDAABCAABACDDADCADBCABADBCABAAABBCABBCDAABDCDAABACDDACCCDBCDDDBAADCBDAADDBBADAAADAADAADBAABDBDACADBDCCBACBACBADABDCACBCBDDCBACBAAADDCABBADDDCABCDCCCCBDDCADBBCDDACCDBBBACAACADBDACDAADCDACBADAADCBABACADAADBAABAAAADAADDDADBDDCBCDDCCBDDCDCBBBDAADDBDBBCDACBCCCBCBCDAD • Hard problem • DDDBACABBCDDCDCBBCACCBDADDBBACDACCBDADCCCDBDBDADABABDCBCDABDBABABCCADBBDCDBBBBDACBBAABBBBBADCCAACACDACCCBBCADDADDBACCCBDABCCCBDADDADABDBABAAACCDBCBCDCBCABABCDBCDDAAACBADACCBCDABAACDCDCDDBCCACBDDADAACABDADDBBDDBCAADBAADBACBDADDBDBDACACDBBBBCADACCBDDBDBCCAACAADABDCBDDCCDDACBDDDCCBCCBCDCACCBDACDCDADCDCDDDADCCCBDACBCBDCACCDDBBACCBBCCDBBABAADABABDCDDBBDCDDAADDABBCBAB

  14. Bayes Error in Markov Context • Closed-form expressions are hard to obtain • Special cases considered in the 1960s and 70s • Raviv (1967) • Used context to improve text classification • Chu (1970), Lee (1974) • Lower and upper bounds for a 2-state HMM • The context is limited to 1 or 2 symbols

  15. Bayes Error for Pattern Discovery • Analytical approximation • fixed length patterns • uniform background /substitution probabilities • Limited context (IID assumption): • HMM:P ( hidden state | observed sequence ) • IID : P ( hidden state | next L symbols )

  16. How Accurate is the Analysis?

  17. Insights from Bayes Error Rate • Example: • Alphabet size |A| = 4 • Pattern length L = 5 • Pattern frequency F = 0.005 • Substitution probability = 0.2 • How hard is this problem? • How sensitive is “problem hardness” to these parameters?

  18. Example Input Problem Solution L, F,  Normalized Bayes error rate? Pe* = 0.87 1 L, F s.t. all patternsrecognized as background  0.28 2 L, F,  False negative / false positive rate? FN = 77% 3

  19. Extensions • Loss functions other than 0/1 loss • Multiple distinct patterns • Variable length patterns (insertions and deletions) • Insertions and deletions increase the Bayes error

  20. The Autocorrelation Effect • Which pattern is easier to learn • DACBDDBADB or AAAAAAAAAA ?

  21. Outline  • Pattern discovery problem • Bayes error rate framework • Experimental results • Comparison of algorithms • Application to real-world problem • Conclusions  

  22. Probabilistic Algorithms • HMM-EM • IID-Gibbs • Motif SamplerLiu, Neuwald, Lawrence (1993, 1995, …) • IID-EM • MEME Bailey & Elkan (1995, 1998, ...) IID GibbsIID EMHMM EM Context Local Local Global Learning Stochastic Deterministic Deterministic

  23. Using the Bayes Error Framework • The estimation problem can be decomposed into • Estimation of pattern locations • Estimation of emissions given pattern locations • What is the effect of each of these factors? • How far are these algorithms from their theoretical optimum?

  24. Test Accuracy vs. Training Size Algorithms Known Locations Bayes Error Training Data Size

  25. Can Algorithms be Improved? • Three gaps to be bridged: • From “current algorithms” to “known locations”: • “location noise”: reduce with better algorithms? • From “known locations” to Bayes error: • “estimation noise”: need more data or prior knowledge • From Bayes error to zero error: • need additional features/measurements

  26. Test Accuracy vs. Bayes Error 2K training size Bayes Error

  27. Test Accuracy vs. Bayes Error 2K training size 4K training size Bayes Error

  28. Application to Real Problems • DPInteract database - Robison et al. (1999) • experimentally verified binding sites • 55 protein families in E.Coli • Supervised learning ACAGAATAAAAATACACT TTCGAATAATCATGCAAA ... AGTGAGTGAATATTCTCT Pattern Model • Unsupervised learning • Some problems are not solvable due to high Bayes error

  29. Bayes Error of Experimental Problems

  30. Summary of Contributions • Analyzed sequential pattern discovery using Bayes error rate • Bayes error = lower bound on error rate of any discovery algorithm • Explicit analytical form for error rate dependence on pattern parameters • Provides insight into what makes a learning problem hard • Example: autocorrelated patterns are harder to learn • Experimental results • Test error = Bayes error + location error + estimation error • Current algorithms • tend to perform similarly, can be quite far away from Bayes error • future improvements? • Real world motif discovery problems can have very high Bayes error

  31. Pattern Discovery Problem • Input • Set of strings over finite alphabet (e.g. {A,B,C,D}) • Task • Unsupervised identification of recurrent patterns embedded in a background process

  32. Future work • Further analysis of Bayes error rate • Quantifying the effect of insertions / deletions • Application and insights into biological problems • Development of suitable learning algorithms • Flat likelihood surface

  33. More Complex Models • Variable length patterns 0.1 A 0.9 B 0.9 B 0.9 D 0.9 B 0.25 C 0.25 I1 A 0.25 D 0.25 0.1 0.9 P1 P3 BG P2 P4 0.9 0.01 1.0 1.0 0.99 1.0 • Multiple patterns • Multi-part patterns • Richer background

  34. Extensions of Generative Model • Variable length patterns • special insertion / deletion states • Multiple patterns • multiple pattern chains connected to the background • Multi-part patterns • insertion states at the gaps • Richer background • multiple background states

  35. Learnability of patterns • Multiple factors influence learnability • alphabet size • pattern length • pattern frequency • variability of the pattern • similarity of pattern and background • Single characteristic that quantifies the difficulty? • In multivariate statistics, Bayes error rate • Bayes error rate applies to classification problems

  36. Bayes Error Rate: Definition • Optimal decision rule • Probability of error for each example is • Bayes error rate is obtained by averaging

  37. Example: Pattern Discovery • Pattern model is known • Classify each symbol as pattern or background • ABDABBDABDDACABBDDBCBDBDBCADBD • 000111100000011110000000000000 • Optimal classification • relies on posterior probabilities of states • evaluated by forward-backward algorithm • makes mistakes • How to evaluate Bayes error?

  38. Example: Pattern Discovery • Pattern model is known • Classify each symbol as pattern or background • ABDABBDABDDACABBDDBCBDBDBCADBD • 000111100000011110000000000000 • Optimal classification • relies on posterior probabilities of states • evaluated by forward-backward algorithm • makes mistakes • How to evaluate Bayes error?

  39. IID and IID/pure assumptions • tight for non-autocorrelated patterns • leads to complex closed-form expression • BBBB or PPPP only • BBPPnot allowed

  40. Approximation of Bayes Error • Bayes error for patterns is approximated by • Quality of approximation • Bayes error can be estimated empirically

  41. Bayes Error for Pattern Discovery • Analytical approximation • fixed length patterns • uniform background /substitution probabilities • Limited context (IID assumption): • HMM: P(hidden state | observed sequence) ABDABBDABDDACABBDDBCBDBDBCADBD • IID: P(hidden state | next L symbols) ABDABBDABDDACABBDDBCBDBDBCADBD

  42. IID and PURE Assumptions • BayesError<BayesErrorIID • tight for non-autocorrelated patterns • leads to complex closed-form expression • BayesErrorIID  BayesErrorPURE • BBBB or PPPP only • BBPPnot allowed

  43. Varying the substitution probability

  44. The Autocorrelation Effect • Autocorrelated (e.g., ABABABAB) patterns have higher Bayes error rate • harder to detect when true model is known • harder to learn when true model is not known • Boundaries of periodic patterns are fuzzy when substitutions are allowed • Illustrated by the posterior pattern probability

  45. Example: |A| = 4, L = 5, F = 0.005, = 0.2; Pe* = 0.87 Input Problem Solution L, F Pe* as   0 ? Pe* = 0.2 1 L, F s.t. all patternsrecognized as background  0.28 2 L,  F s.t. all patternsrecognized as background F  0.003 3 L, F,  k* :the max allowed errors in the pattern? k* = 0 4 L, F,  False negative / false positive rate? FP = 77% 5

  46. Algorithms for motif discovery • Combinatorial search algorithm • Pevzner & Sze, 2000 • finds the largest cliques in the graph induced by the edit-distance between the L-mers • Hill climbing • Hu, Kibler & Sandmeyer (1999) • objective function maximizes the difference between background and pattern distributions

  47. Algorithms for motif discovery • Detection of over-represented exactk-mers • Van Helden, Abdre & Collado-Vides (1998) • comparing the number of occurrences in the USRs of co-regulated genes and in the whole genome • Detection of over-represented non-exactk-mers • Buhler & Tompa (2000) • Method of random projections • May be used to initialize probabilistic models

  48. Quality of solution • Pattern structure effects the quality of fitted models • Higher Bayes error means lower quality of solutions • Parameters: Length=10, Ppat=0.01, E[#Errors] = 2

  49. Pattern Structure • Bayes error depends on pattern structure through the autocorrelation vector (Gubais, Odlyzko, 1981) • translation-invariant patterns are harder to label / learn Ranked by increasing difficulty

More Related