Create Presentation
Download Presentation

Download Presentation
## Uncovering Sequences Mysteries With Hidden Markov Model

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Uncovering Sequences Mysteries WithHidden Markov Model**Cédric Notredame**Our Scope**Look once Under the Hood Understand the principle of HMMs Understand HOW HMMs are used in Biology**Outline**-Reminder of Bayesian Probabilities -HMMs and Markov Chains -Application to gene prediction -Application Tm predictions -Application to Domain/Prot Family Prediction -Future Applications**Conditional**Probabilities And Bayes Theorem**I now send you an essay which I have found among the papers**of our deceased friend Mr Bayes, and which, in my opinion, has great merit... In an introduction which he has writ to this Essay, he says, that his design at first in thinking on the subject of it was, to find out a method by which we might judge concerning the probability that an event has to happen, in given circumstances, upon supposition that we know nothing concerning it but that, under the same circumstances, it has happened a certain number of times, and failed a certain other number of times. Bayes**What is a Probabilistic Model ?**Dice = Probabilistic Model -Each Possible outcome has a probability (1/6) -Biological Questions: -What kind of dice would generate coding DNA -Non-Coding ?**Which Parameters ?**OR -Through Observation: -measure frequencies on a large number of events Dice = Probabilistic Model Parameters: proba of each outcome -A Priori estimation: 1/6 for each Number**Which Parameters ?**Parameters: proba of each outcome Model: Intra/Extra Protein 1- Make a set of Inside Proteins using annotation 2- Make a set of Outside Proteins using annotation 3- COUNT Frequencies on the two sets Model Accuracy Training Set**Maximum Likelihood Models**1- Make training set 2- Count Frequencies Model Accuracy Training Set Maximum Likelihood Model: Model probability MAXIMISES Data probability Model: Intra/Extra Proteins**Maximum Likelihood Models**Model: Intra/Extra-Cell Proteins Maximum Likelihood Model Model Probability MAXIMISES Data Probability AND Data Probability MAXIMISES Model Probability P ( Model ¦ Data) is Maximised ¦ means GIVEN!**Maximum Likelihood Models**Model: Intra/Extra-Cell Proteins Maximum Likelihood Model P ( Model ¦ Data) is Maximised P ( Data ¦ Model) is Maximised Model Probability MAXIMISES Data Probability AND Data Probability MAXIMISES Model Probability**Maximum Likelihood Models**Model: Intra/Extra-Cell Proteins Maximum Likelihood Model P ( Coin ¦ Data)< P(Dice ¦ Data) Data: 11121112221212122121112221112121112211111**Conditional Probabilities**P (Win Lottery ¦ Participation) The Probability that something happens IF something else ALSOHappens**Conditional Probability**Dice 1Dice 2 P(6¦ Dice 1)=1/6P(6¦ Dice 2)=1/2 Loaded! The Probability that something happens IF something else ALSOHappens**Joint Probability**The Probability that something happens IF something else ALSOHappens AND P(6¦ D1)=1/6P(6¦ D2)=1/2 P(6,D2)=P(6¦D2) * P(D2)=1/2* 1/100 Comma**Joint Probability**P(6¦ DF and DL)= P(6, DF) + P(6, DL) = P(6 ¦ DF) * P(DF) + P(6¦ DL)*P(DL) = 1/6*0.99 + 1/2*0.01 = 0.17 Question: What is the probability of Making a 6, given that the Loaded Dice is used 1% of the time (0.16 for an unloaded dice)**Joint Probability**Unsuspected Heterogeneity In the training set Inaccurate Parameters Estimation P(6¦ DF and DL)= P(6, DF) + P(6, DL) = P(6 ¦ DF) * P(DL) + P(6¦ DF)*P(DL) = 1/6*0.99 + 1/2*0.01 = 0.17 (0.16 for an unloaded dice)**Bayes Theorem**P(Y¦Xi) * P(Xi) P(Xi¦ Y) = S(P(Y¦Xi)*P(Xi)) i X : Model or Data or any Event Y : Model or Data or any Event**Bayes Theorem**P(Y¦X) * P(X) P(X¦ Y) = P(Y¦X)*P(X)+ P(Y¦X)*P(X) P(Y,X)+ P(Y,X) P(Y) X : Model or Data or any Event Y : Model or Data or any Event XT=X+ X**Bayes Theorem**Proba of Observing Y AND X simultaneously Proba of Observing XIF Y is fulfilled ‘Remove’ P(Y) to Get P(X¦Y) X : Model or Data or any Event Y : Model or Data or any event P(Y¦X) * P(X) P(X¦ Y) = P(Y)**Bayes Theorem**X : Model or Data or any Event Y : Model or Data or any event Proba of Observing Y and X simultaneously P(X,Y) P(X¦Y) = P(Y) Proba of Observing XIF Y is fulfilled ‘Remove’ P(Y) to Get P(X¦Y)**Using Bayes Theorem**We will use Bayes Theorem to test our belief: If the Dice was loaded (model) what would be the probability of this Model Given the data (three 6 in a row) Question:The dice gave three 6s in a row IS IT LOADED !!!**Using Bayes Theorem**Question:The dice gave three 6s in a row IS IT LOADED !!! P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 Occasionally Dishonest Casino…**Using Bayes Theorem**P(Y¦X)*P(X) P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 P(X¦ Y) = P(Y) Y: 63 X: D2 P(63 ¦D2)*P(D2) P(D2¦63) = P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2) 63 with D2 63 with D1 Question:The dice gave three 6s in a row IS IT LOADED !!!**Using Bayes Theorem**P(D1)=0.99 P(D2)=0.01 P(6¦D1)=1/6 P(6¦D2)=1/2 P(X,Y) P(X¦ Y) = P(Y) Question:The dice gave three 6s in a row IS IT LOADED !!! P(63 ¦D2)*P(D2) P(D2¦63) = = 0.21 P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2) Probably NOT**Posterior Probability**0.21 is a posterior probability: it was estimated AFTER the Data was obtained P(63¦D2) is the likelihood of the Hypotheses Question:The dice gave three 6s in a row IS IT LOADED !!! P(63 ¦D2)*P(D2) P(D2¦63) = = 0.21 P(63 ¦D1)*p(D1) + P(63¦D2)*P(D2)**Debunking Headlines**50% of the crimes are committed by Migrants. Question: Are 50% of the Migrants Criminals??. P(Migrant) =0.1 P(Criminal) =0.0001 P(M¦C)=0.5 P(M¦C)*P(C) P(M¦C)*P(C) P(C¦M) = P(C¦M) = P(M) P(M) 0.5*0.0001 =0.0005 = 0.1 NO: 0.05% Migrants only are Criminals (NOT 50%!)**Debunking Headlines**P(T¦P)*P(P) P(T)=0.1 P(P)=0.0001 P(T¦P)=0.5 P(T¦P)*P(P) P(P¦T) = P(P¦T) = P(T) P(T) 0.5*0.0001 =0.0005 = 0.1 50% of Gene Promoters contain TATA. Question:IS TATA a good gene predictor NO**Bayes Theorem**TATA=High Sensitivity / Low Specificity Bayes Theorem Reveals the Trade-off Between Sensitivity:Finding ALL the genes and Specificity: Finding ONLY genes**What is a Markov Chain ?**Markov Chain: Two Dices -You only use ONE dice: the fair OR the loaded -The Dice you roll only depends on the previous roll Simple Chain: One Dice -Each Roll is the same -A Roll does not depend on the previous**What is a Markov Chain ?**Biological Sequences Tend To Behave like Markov Chains Question/Example Is it possible to Tell Whether my sequence is CpG island ???**What is a Markov Chain ?**Question: Identify CpG Island sequences Old Fashion Solution -Slide a Window of size: Captain’s Height/p -Measure the % of CpG -Plot it against the sequence -Decide**sliding Window Methods**Sliding Window Average Sliding Window**What is a Markov Chain ?**Question: Identify CpG Island sequences Bayesian Solution -Make a CpG Markov Chain -Run the sequence through the Chain -Likelihood for the chain to produce the sequence?**Transition**State T A C G Transition Probabilities Probability of Transition from G to C AGC=P(Xi=C ¦ Xi-1=G)**P(sequence)=P(XL,XL-1,XL-2,….., X1)**Remember: P(X,Y)=P(X¦Y)*P(Y) In The Markov Chain, XL only depends on XL-1 P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) )**AGC=P(Xi=C ¦ Xi-1=G)**P(sequence)=P(XL¦XL-1)*P(XL-1¦XL-2)….., P(X1) ) P(sequence)= P(x1)* Axi-1 xi L P i=2**T**B A C G Arbitrary Beginning and End States can be added To The Chain. By Convention, Only the Beginning State is added**E**A C G Adding An End State with a Transition Proba T Defines Length probabilities P(all the sequences length L)=T(1-T)L-1 T B**A**C G T E B The transition are probabilities The sum of the probability of all the possible Sequences of all possible Length is 1**Using**Markov Chains To Predict**What is a Prediction**Given A sequence We want to know what is the probability that this sequence is a CpG 1-We need a training set: -CpG+ sequences -CpG- sequences 2-We will Measure the transition frequencies, and treat them like probabilities**What is a Prediction**Transition GC: G followed by a C = GCCGCTGCGCGA Ratio between the number of transitions GC, and all the other transitions involving G->X + S N + X GC A + GC N GX Is my sequence a CpG ??? 2-We will Measure the transition frequencies, and treat them like probabilities**What is a Prediction**A 0.18 0.17 0.16 0.08 A 0.30 0.32 0.25 0.17 + A C G T - A C G T C 0.27 0.36 0.33 0.35 C 0.21 0.30 0.25 0.24 G 0.42 0.27 0.37 0.38 G 0.28 0.08 0.30 0.29 T0.12 0.18 0.12 0.18 T0.21 0.30 0.20 0.29 1 Is my sequence a CpG ??? 2-We will Measure the transition frequencies, and treat them like probabilities**What is a Prediction**- + A 0.18 0.17 0.16 0.08 A 0.30 0.32 0.25 0.17 + A C G T - A C G T C 0.27 0.36 0.33 0.35 C 0.21 0.30 0.25 0.24 G 0.42 0.27 0.37 0.38 G 0.28 0.08 0.30 0.29 T0.21 0.30 0.20 0.29 T0.12 0.18 0.12 0.18 P(seq ¦ M-)= Axi-1 xi P(seq ¦ M+)= Axi-1 xi L L P P i=1 i=1 Is my sequence a CpG ??? 3-Evaluate the probability for each of these models to generate our sequence