1 / 44

Introduction to Probabilistic Sequence Models: Theory and Applications

David H. Ardell,Forskarassistent. Introduction to Probabilistic Sequence Models: Theory and Applications. Lecture Outline: Intro. to Probabilistic Sequence Models. Motif Representations: Consensus Sequences, Motifs and Blocks, Regular Expressions

mandar
Download Presentation

Introduction to Probabilistic Sequence Models: Theory and Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. David H. Ardell,Forskarassistent Introduction to Probabilistic Sequence Models:Theory and Applications

  2. Lecture Outline: Intro. to Probabilistic Sequence Models • Motif Representations: Consensus Sequences, Motifs and Blocks, Regular Expressions • Probabilistic Sequence Models: profiles, HMMs, SCFG

  3. A T C G Consensus sequences revisited • Consense sequences make poor summaries

  4. A motif is a short stretch of protein sequence associated with a particular function (R. Doolittle, 1981) • The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins • [GA]x(4)GK[ST] • A variety of databases of such motifs exist: such as BLOCKS, PROSITE, and PRINTS, and there are many tools to search proteins for matches to blocks.

  5. Introduction to Regular Expressions (Regexes) • Regular Expressions specify sets of sequences that match a pattern. • Ex: a[bc]a matches "aba" and "aca" • In addition to literals like a and b in the last example, regular expressions provide quantifiers like * (0 or more), + (1 or more), ? (0 or 1) and {N,M} (between N and M): • Ex: a[bc]*a matches "aa", "aba", "acca", "acbcba" etc • As well as grouping constructions like character classes [xy], compound literals like (this)+, and logical relations, like | which means "or" in (this|that) • Anchors match the beginning ^ and end $ of strings

  6. IUPAC DNA ambiguity codes as reg-ex classes • Pyrimidines Y = [CT] • PuRines R = [AG] • Strong S = [CG] • Weak W = [AT] • Keto K = [GT] • aMino M = [AC] • B B = [CGT] (one letter greater than A=not-A) • D D = [AGT] • H H = [ACT] • V V = [ACG] • Any base N = [ACGT]

  7. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghghgacbah" [bc] a [bc] a Begin End [^a] [^bc] [^bc]

  8. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a ghstu… End [^a] [^bc] [^bc]

  9. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a hstua… End [^a] [^bc] [^bc]

  10. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a stuac… End [^a] [^bc] [^bc]

  11. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstugacbah" [bc] a [bc] a tuacb… End [^a] [^bc] [^bc]

  12. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a uacbah End [^a] [^bc] [^bc]

  13. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a acbah End [^a] [^bc] [^bc]

  14. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a Begin End cbah [^a] [^bc] [^bc]

  15. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a End bah [^a] [^bc] [^bc]

  16. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a End ah [^a] [^bc] [^bc]

  17. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a h [^a] [^bc] [^bc]

  18. Regular Expressions are like machines that eat sequences one letter at a time Ex: a[bc]+a matching "ghstuacbah" [bc] a [bc] a MATCH! [^a] [^bc] [^bc]

  19. Motifs are almost always either too selective or too specific • The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins • [GA]x(4)GK[ST] • Prob. of this motif ≈ (1/10)(1/20)(1/20)(1/10) = 0.000025 • Expected number of matches in database with 3.2 x108 residues: about 8000! • About half of the proteins that match this motif are not NTPases of the P-loop class. (Lack of specificity)

  20. Motifs are almost always either too selective or too specific • [GA]x(4)GK[ST] Larger and larger alignments of true members of the class give more and more exceptions to the rule (lack of sensitivity) Extending the rule ([GAT]x(4)[GAF][KTL][STG]) leads to loss of specificity

  21. A better way to model motifs • REGULAR EXPRESSIONS • “(TTR[ATC]WT) N{15,22} (TRWWAT)” • Can find alternative members of a class • Treat alternative character states as equally likely. • Treat all spacer lengths as equally likely. • PROFILES (Position-Specific Score Matrices)

  22. Profiles turn alignments into probabilistic models

  23. C C H T M G L … S G G S A graphical view of the same profile: CCGTL… CGHSV… GCGSL… CGGTL… CCGSS…

  24. You can also allow for unobserved residues or bases in a profile by giving them small probabilities: A A T T A G T … C G C C G T G

  25. The probability that a sequence matches a profile P is the product of its parts: A 0.8 A 0.7 T 0.1 T 0.2 A 0.1 G 0.8 T 0.6 C 0.7 P G 0.2 C 0.2 G 0.1 Ex: p(AAGCT | P) = p(A) x p(A) x p(G) x p(C) x p(T) = 0.8 x 0.7 x 0.8 x 0.7 x 0.6 = 0.18

  26. In practice, we compare this probability to that of matching a null model A A T T A G T C G C G A A A A A G G G G G T T T T T C C C C C

  27. A 0.25 G 0.25 T 0.25 C 0.25 The null model is usually based on a composition. A 0.8 A 0.7 T 0.1 T 0.2 A 0.1 G 0.8 T 0.6 C 0.7 G 0.2 C 0.2 G 0.1 No positional information need be taken into account.

  28. A 0.8 A 0.7 T 0.1 T 0.2 A 0.1 G 0.8 T 0.6 … C 0.7 G 0.2 C 0.2 G 0.1 A 0.25 G 0.25 T 0.25 C 0.25 Example: probabilities of AAGCT with the two models p = 0.18 p = 0.255= 0.00098

  29. A 0.8 A 0.7 T 0.1 T 0.2 A 0.1 G 0.8 T 0.6 … C 0.7 G 0.2 C 0.2 G 0.1 A 0.25 G 0.25 T 0.25 C 0.25 Example: odds ratio of AAGCT with the two models p = 0.18 p = 0.255= 0.00098 The odds ratio is 0.18 / 0.00098 ≈ 184. It is 184 times more likely that AAGCT matches the profile than the null model!

  30. Like with substitution scoring matrices, we prefer the log-odds as a profile score A positive log-odds (score) indicates a match.

  31. Digression: interpreting BLAST results The bit score is a scaled log-odds of homology versus chance

  32. Digression: interpreting BLAST results E value is the expected number of hits with scores at least S

  33. A better way to model motifs • REGULAR EXPRESSIONS • “(TTR[ATC]WT) N{15,22} (TRWWAT)” • Can find alternative members of a class • Treat alternative character states as equally likely. • Treat all spacer lengths as equally likely. • PROFILES (Position-Specific Score Matrices) • Turn a multiple sequence alignment into a multidimensional (by position) multinomial distribution. • Explicit accounting of observed character states • Cannot handle gaps (separate models must be made for different spacer length -- O’Neill and Chiafari 1989) • Can't be used to make alignments

  34. Hidden Markov Models • A Hidden Markov Model is a machine that can either parse or emit a family of sequences according to a Markov model • The same symbols can put the machine in different states, (A,C,T,G can be in a promoter, a codon, a terminator, etc.) so we say the states are “hidden” • Example: The Dice Factory 0.01 P(1) = 3/6 P(1) = 1/6 P(2) = 1/6 P(2) = 1/10 P(3) = 1/6 P(3) = 1/10 0.99 0.70 P(4) = 1/6 P(4) = 1/10 P(5) = 1/10 P(5) = 1/6 P(6) = 1/6 P(6) = 1/10 0.30 GENERATED BIASED FAIR ...11452161621233453261432152211121611112211... PREDICTED

  35. A A T T A G T C G C G A Profile HMM is a profile with gaps

  36. A A T T A G T C G C G A Profile HMM is a profile with gaps insertions

  37. A A T T A G T C G C G A Profile HMM is a profile with gaps deletions

  38. A A T T A G T C G C G A Profile HMM is a profile with gaps deletions insertions

  39. A 0.25 G 0.25 T 0.25 C 0.25 The HMMer Null Model (composition of insertions may be set by user, eg to match genome)

  40. The Plan 7 architecture in HMMer Permit local matches to sequence Permit local matches to model Permit repeated matches to sequence

  41. HMMer2 (pronounced 'hammer', as in, “Why BLAST if you can hammer?”)

  42. The HMMer2 design separates models from algorithms • With the same alignment or model design, you can easily change the search algorithm (encoded in the HMM) to do: • Multihit Global alignments of model to sequence • Multihit Smith-Waterman (local with respect to both model and sequence, multiple non-overlapping hits to sequence allowed) • Single (best) hit variants of both of the above.

  43. This separation of model from algorithm provides a ready framework for sequence analysis(programs provided in HMMer) hmmalignAlign sequences to an existing model. hmmbuildBuild a model from a multiple sequence alignment. hmmcalibrateTakes an HMM and empirically determines parameters that are used to make searches more sensitive, by calculating more accurate expectation value scores (E-values). hmmconvertConvert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG profiles. hmmemitEmit sequences probabilistically from a profile HMM. hmmfetchGet a single model from an HMM database. hmmindexIndex an HMM database. hmmpfamSearch an HMM database for matches to a query sequence. hmmsearchSearch a sequence database for matches to an HMM.

  44. HMMer2 format can be automatically converted for use with SAM

More Related