1 / 28

Probabilistic Suffix Trees

Probabilistic Suffix Trees. CMPUT 606. Maria Cutumisu. October 13, 2004. Goal. Provide efficient prediction for protein families Probabilistic Suffix Trees (PSTs) are variable length Markov models (VMMs). Conceptual Map. Background. PSTs were introduced by Ron, Singer, Tishby

oswald
Download Presentation

Probabilistic Suffix Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Suffix Trees CMPUT 606 Maria Cutumisu October 13, 2004

  2. Goal • Provide efficient prediction for protein families • Probabilistic Suffix Trees (PSTs) are variable length Markov models (VMMs)

  3. Conceptual Map

  4. Background • PSTs were introduced by Ron, Singer, Tishby • Bejerano, Yona made further improvements (bPST) • Poulin – efficient PSTs (ePSTs) • PSTs a.k.a. prediction suffix trees

  5. Higher Order Markov Models • A k-order Markov chain: history of length k for conditional probabilities • Exponential storage requirements • Order of the chain increases, amount of training data increases to improve estimation accuracy

  6. Variable Length Markov Models (VMMs) • Space and parameter-estimation efficient • variable length of the history sequence for prediction • only needed parameters are stored • Created from less training data Training sequences Is T1 in the training set? >T1 Test sequence AHGSGYMNAB

  7. VMMs • P(sequence) = product of the probabilities of each amino acid given those that precede it • Conditional probability based on the context of each amino acid • A context function k(·) can select the history length based on the context x1 . . . xi−1 xi • VMMs were first introduced as PSTs

  8. PSTs • VMMs for efficient prediction • Pruned during training to contain only required parameters • bPST: represents histories • ePST: represents sequences

  9. bPST • Used to represent the histories for prediction instead of the training sequences • The possible histories are the reversed strings of all the substrings of the training sequences

  10. Prediction with bPSTs • The conditional probabilities P(xi|xi-1…) are obtained for each position by tracing a path from the root that matches the preceding residues

  11. Construction bPST • We add histories for the training data • Nodes: parameters that estimate the conditional probabilities • γhistory(a) = P(a|history) • PbPST (xi|xi−1, . . . , x1) = γx1...xi−1(xi) if in bPST • else γx2...xi−1(xi) if in bPST etc. • else γ(xi)

  12. bPST created and pruned using 010010010011110101100010111 Brett Poulin P(01001) = P(0)P(1|0)P(0|01)P(0|010)P(1|0100) = γ(0) γ0(1) γ01(0) γ0*(0) γ00*(1) = (13/27)(8/13)(5/8)(5/13)(4/5) = 10400/182520 = 0.057

  13. Complexity bPST • bPST building process requires O(Ln2) time • L is the length limit of the tree • n is the total length of the training set. • bPST building requires all training sequences at once (in order to get all the reverse substrings) and cannot be done online (the bPST cannot be built as the training data is encountered) • Prediction: O(mL), m = sequence length

  14. Improved bPST • Idea: tree with training sequences • n length of all training sequences • m length of tested sequence • Result (theoretical): • linear time building O(n) • linear time prediction O(m).

  15. Efficient PST (ePST) • Used for predicting protein function • ePST represents sequences • Linear construction and prediction

  16. Example ePST Brett Poulin

  17. Prediction with ePSTs • The probabilities for a substring are obtained for each position by tracing the path representing the sequence from the root • If the entire sequence is not found in the tree, suffix links are followed

  18. Construction ePST • ePSTs gain efficiency by representing the training sequences in the PST • Nodes store counts of the subsequence occurrences in the training data (with respect to the complete tree) • Conditional probabilities deducted from the counts are stored as well

  19. Example ePST - AYYYA Brett Poulin

  20. Complexity ePST • Linear time and space with regards to the combined length of the training sequences O(n) • Linear prediction time O(m)

  21. Advantages and Disadvantages • Avoid exponential space requirements and parameter estimation problems of higher order Markov chains • Pruned during training to contain only required parameters • bPSTs for local predictions: more accurate prediction than global • Loss in classification performance: Pfarm, SCOP

  22. Conclusions • PSTs require less training and prediction time than HMMs • Despite some loss in classification performance, PSTs compete with HMMs due to PSTs reduced resource demands • PSTs take advantage of VMMs higher order correlations

  23. References • Brett Poulin, Sequence-based Protein Function Prediction, Master Thesis, University of Alberta, 2004 • G Bejerano, G Yona, Modeling protein families using probabilistic suffix trees, RECOMB’99 • G Bejerano, Algorithms for variable length markov chain modeling, Bioinformatics Applications Note, 20(5):788–789, 2004

  24. PSTs and HMMs • “HMMs do not capture any higher-order correlations. An HMM assumes that the identity of a particular position is independent of the identity of all other positions.” [1] • PSTs are variable length Markov models for efficient prediction. The prediction uses the longest available context matching the history of the current amino acid. • For protein prediction in general, “the main advantage of PSTs over HMMs is that the training and prediction time requirements of PSTs are much less than for the equivalent HMMs.” [1]

  25. Suffix Trees (ST) Brett Poulin

  26. bPST • Histories added to the tree must occur more frequently than a threshold Pmin • The substrings are added in order of length from smallest to largest

  27. bPST vs ST • The string s is only added to the tree if the resulting conditional probability at the node to be created will be greater than the minimum prediction probability γmin + α and the probability for the prefix of the string is different (with some ratio r) from the probability assigned to the next shortest substring suf(s) (which is already in the tree). After all the substrings are added to the tree, the probabilities are smoothed according to the parameter γmin. • The smoothing (as calculated by the equation below) prevents any probability from being less than γmin

  28. New!

More Related