Probabilistic Models

Probabilistic Models Consider our FMR triplet repeat finite state automaton... a g c g c g g c t g e S 1 2 3 4 5 6 7 8 c While the above finite state automaton above is non-deterministic, it is NOT probabilistic. Sequences under test are either elements of the set of “accepted” sequences, or are not (“rejected”) What makes a model of DNA or protein probabilistic?

Probabilistic Models Consider our FMR triplet repeat finite state automaton... a g c g c g g c t g e c S 1 2 3 4 5 6 7 8 The key property of a probabilistic model is that the model must define a probability distribution over the set of possible sequences Um, OK. What the heck does that actually mean? Let’s review some more basic concepts of probability

Probability Distributions First a formal description… A probability distributionPr{} on a sample space S is a mapping from “events” of S to the real numbers such that the following axioms are satisfied: Pr{A} ≥ 0 for any event A (i.e. probability of event A) Pr{S} = 1 (S here is the “certain event”) Pr{A ∪ B} = Pr{A} + Pr{B} for mutually exclusive events An “event” is just some subset of the sample space

Probability Distributions Sample spaces and events can conveniently be conceptualized as a Venn diagram Pr{A ∩ B} = ∅ S A B Mutually independent events Aand Bhere clearly don’t overlap (their “intersection is null”), and the probability of their union is the sum of their individual probabilities: Pr{A ∪ B} = Pr{A} + Pr{B}

Probability Distributions Now an example Consider the following sample space S: S = {AA, AC, AG, AT, CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT} Each element s in S (s∈S) here is a dinucleotide. Sample space here is the set of all possible dinucleotides. Pr{UAi} = SPr{Ai} i i The above formula is just a generalization of our axiom Pr{A ∪ B} = Pr{A} + Pr{B} for mutually exclusive events

Probability Distributions AA AC AG AT S CA CC CG CT is represented by the entire large box. We make the “normalizing assumption” that its area is 1 in order to satisfy axiom 2 GA GC GG GT TA TC TG TT In this distribution, there are 16 possible mutually exclusive elemental events, that together “paint” the entire sample space S without overlap. This distribution has the additional property of being uniform

Probability Distributions AA AC AG AT CA CC CG CT A GA GC GG GT Contains “C” B TA TC TG TT No “C” Events need not always be elemental. Here mutually exclusive events Aand Bare shown and the axioms (as they must) are still satisfied

Discrete Probability Distributions Many of the probability distributions we will be considering have the additional property of being discrete* Consider, for example, that for DNA strings of some definite finite length n, there only 4n possibilities to consider Pr{A} = SPr{s} s∈A A probability distribution is discrete if it is defined over a finite or “countably infinite” sample space * discrete, as in separate; distinct; individual. NOT discreet, as in “I won’t tell your wife what you did last Friday”

Unconditional Probability Consider statements of the form: “The chance of snow is 60%” or “The test is 99% accurate” These are unconditional statements, so we can say: Pr{H} ≡total probability of event H (over the set S of all possible events) Using intuitive notions of probability, we can express this as the counts of events where H occurred, over the total number of possible events in S

Unconditional Probability Statements of this kind can conveniently be conceptualized as a Venn diagram S H Using intuitive notions of probability, we can express this as the counts of events where H occurred, over the total number of possible events in S

Conditional Probability More often than not, we wish to express probabilities conditionally. i.e. we specify the assumptions or conditions under which it was measured Conditional statements of probability take the form: Pr{H|O} ≡probability that event H occurred in the subset of cases where event Oalso occurred

Conditional Probability Again, this can be represented as a Venn diagram S O H The intersection H∩ O represents events where bothH and O occurred

Conditional Probability More often than not, we wish to express probabilities conditionally. i.e. we specify the assumptions or conditions under which it was measured Note that in the “universe of possibilities”, O has effectively replaced S. Our probability for event H is now conditional on the assumption that event O has already taken place. We can express the idea shown in the Venn diagram as:

Joint Probability Let’s discuss the idea of joint probability, which expresses the idea that both events O and H occur A common alternative notation: P(H,O) = In joint probability, the intersection of events are expressed relative to the entire sample space S:

Joint Probability Let’s discuss the idea of joint probability i.e. both events O and H occur But wait… these new terms look awfully familiar! There’s no reason we can’t rewrite this as:

Conditional & Joint Probability Let’s discuss the idea of joint probability i.e. both events O and H occur In other words, we can express joint probability in terms of their separate conditional and unconditional probabilities! This key result turns out to be exceedingly useful!

Conditional Probability It gets even better! Dig it! We can therefore express everything only in terms of reciprocal conditional and unconditional probabilities: The intersection operator makes no assertion regarding order We’ll have a lot of fun* with this later! * So. Much. Fun.

The probability of observing sequence of states x... ...is equal to the probability that the XLthstate was whatever AND the XL-1th state was whatever else, AND etc., etc. until we reach the first position Sequences considered probabilistically What is the probability of some sequence of events x? If it was generated from a probabilistic model it can always be written: P(x) = P(xL, xL-1, … ,x1) This is familiar as statement of joint probability But how to proceed?

The probability of events X AND Y happening is equal to the probability of X happening given that Y has already happened, times the probability of event Y Sequences considered probabilistically What is the probability of chain of events x? We could try repeatedly applying our rules of conditional probability… P(x) = P(xL, xL-1, … ,x1) = P(xL | xL-1, … ,x1) P(xL-1 | xL-2, … ,x1) ... P(x1) Remember our result P(X,Y) = P(X|Y) * P(Y)

Sequences considered probabilistically What is the probability of chain of events x? We could try repeatedly applying our rules of conditional probability… P(x) = P(xL, xL-1, … ,x1) = P(xL | xL-1, … ,x1) P(xL-1 | xL-2, … ,x1) ... P(x1) This is still incredibly yucky, but at least we now have a separate probability term for each position in our sequence Is this the best we can do? Let’s digress for a moment…

Prg = 1 Pgy = 1 Pyr = 1 Markov Chains A traffic light considered as a sequence of states A trivial Markov chain – the transition probability between the states is always 1

Markov Chains A traffic light considered as a sequence of states If we watch our traffic light, it will emit a string of states In the case of a simple Markov model, the state labels (e.g. green, red, yellow) are the observable outputs of the process

Markov Chains An occasionally malfunctioning traffic light!! Pgy = 1 Prg = .85 Pyg = .10 Pyr = .9 Pry = .15 The Markov property is that the probability of observing next a given future state depends only on the current state! For this reason we say that a Markov chain is a memoryless process

The transition probability ast from state s to state t… …is equal to the probability that the ith state was t.. given that that the immediately proceeding state (xi-1) was s Markov Chains The Markov Property English Translation: ast = P(xi= t | xi-1 = s) You should recognize this a statement of conditional probability!

Markov Chains The Markov Property ast = P(xi= t | xi-1 = s) Another way to look at this is to say that the conditional probability distribution for the system’s next step depends only on the current state, not on any prior state or states. There is no xi-2 in this equation!

Markov Chain An occasionally malfunctioning traffic light!! If we know the transition probabilities, we may already feel intuitively that some outcomes are more likely to have been produced by our model than others….. But can we calculate the probability of an observed sequence?

Therefore: P(x) = P(xL | xL-1) P(xL-1 | xL-2) ... P(x2|x1) P(x1) L P P(x) = P(x1) axi-1xi i=2 Markov Chains Can they help simplify our statement of probability for a sequence? P(x) = P(xL | xL-1, … ,x1) P(xL-1 | xL-2, … ,x1) ... P(x1) Remember, the key property of a Markov Chain is that probability of symbol xidepends ONLY on the value of preceding symbol Xi-1!!

L P P(x) = P(x1) axi-1xi i=2 Markov Chains Scoring an arbitrary sequence If we know the transition probabilities, our formula tells us exactly how to calculate the probability of a sequence of unknown origin:

Markov Chains How about nucleic acid sequences? A C T G No reason why nucleic acid sequences found in an organism cannot be modeled using Markov chains

States Transition probabilities Markov Model What do we need to probabilistically model DNA sequences? A C T G The states are the same for all organisms, so the transition probabilities are the modelparameters (θ)we need to estimate

Markov Models What do we need to probabilistically model DNA sequences? A C e S T G As with transformational grammars, we can also model special start and end states

Markov Models Which model best explains a newly observed sequence? A C C A G T G T Organism B Organism A Each organism will have different transition probability parameters, so you can ask “was the sequence more likely to be generated by model A or model B?”

Markov Models Using Markov chains for discrimination L =S aAxi-1xi P(x|θA) S(x)= log log aBxi-1xi P(x|θB) i =1 A commonly used metric for discrimination using Markov Chains is the Log-Odds ratio. Odds ratios allow us to calibrate our thinking about a probability we observe under one set of assumptions relative to a probability given another set of assumptions We must be careful in interpretation of log-odds ratios -- an impressive seeming ratio will not necessarily be significant

Markov Models Using Markov chains for discrimination P(x|θA) P(x|θA) - P(x|θB) S(x)= log = log P(x|θB) L L =S =S aAxi-1xi aAxi-1xi aBxi-1xi log - log log aBxi-1xi i =1 i =1 Note that we can rewrite this in several ways, and often one form will be more convenient or efficient in a given context More on this when we discuss numerical stability

Some unfinished business Does a Markov Chain really define a probability distribution over the entire space of sequences? In other words, is the sum of the probability of all possible sequences of any length equal to one? This turns out to be just slightly slippery in the case where we model the ends, but it’s not too bad in the case of a definite length L. We’ll address this in two ways: Primarily by simulation in Python But also have a go at convincing yourself: L P 1 = axi-1-xi i = 2

The “Occasionally Dishonest Casino” Consider a casino that uses two dice, one fair, one loaded. Once in a while, they sneakily switch the dice… You watch for awhile and observe the following sequence of rolls….. 1 2 2 6 4 3 2 1 5 6 6 6 5 1 6 2 6 Would you know which die the casino was using? Would you know if they switched at some point? Your first problem…… The fair and cheat die have the SAME observables!!

DNA and Amino Acids are like our dice We see the same symbols regardless of any functional association! QATNRNTDGS TDYGILQINS RWWCNDGRTP GSRNLCNIPC SALLSSDITA SVNCAKKIVS DGNGMNAWVA WRNRCKGTDQ Can we tell by casual examination which segment of this amino acid sequence corresponds to a trans-membrane domain and which does not? The amino acids are not labelled

Hidden Markov Models What does Hidden refer to? A C T G In our simple Markov model, we cannot say anything about functional associations based on observed symbols.. Transition probabilities don’t reflect the underlying state of the system

Hidden Markov Models Can we alter our model to accommodate alternative states for each observable? A- A+ C- C+ T- G- T+ G+ Figure after Durbin, et al. The complete set of “within set” transitions are omitted here for clarity. It’s still yucky!

Hidden Markov Models The clearest representation is to separate transitions from emissionsof symbols A: 0.20 A: 0.30 0.1 C: 0.35 C: 0.25 0.9 0.99 G: 0.25 G: 0.15 0.01 T: 0.20 T: 0.30 State “+” State “-” Each state now has its own table of emission probabilities and transitions now occur strictly between states

Hidden Markov Models We now must incorporate the concept of a state path 1 2 2 6 4 3 2 1 5 6 6 6 5 1 6 2 6 F F FFFFFF L LLLLLL F F A A T A C A C G G C T A G C T A A - - - - - - - + + + + + + + + - - Often, the true state path (often denoted p ) associated with sequences of interest will be hidden from us

Three Classic Problems of HMMs We will study algorithms that address each of these problems The learning problem Given a model and some observed sequences, how do we rationally choose or adjust (i.e. estimate) the model parameters? The evaluation problem Given some HMM, and some observed sequences, what is the probability that our model could have produced those sequences? i.e. what is Pr{observationX | θ} The decoding problem Given some HMM and some observed sequences, what state path through the model was most likely to have produced those observations?

The transition probability aklfrom state kto state l… …is equal to the probability that the ithstate was l.. given that … the immediately proceeding state (pi-1) was k The Evaluation Problem With HMMs, if we forget about the symbols that are emitted, the statepath still forms a Markov chain and the Markov property applies akl= P(pi= l| pi-1= k) This statement of conditional probability is in exactly the same form as was shown in our first Markov chain example

The probability of emitting symbol b when in state k …is equal to the probability that the ithsymbol is b.. given that … the current state (xi) is k The Evaluation Problem What about the emission probabilities ? ek(b)= P(xi= b| pi= k) The probability that a particular symbol will be emitted is now dependent on what state we are currently in! In other words, emission probabilities are conditional on state.

Hidden Markov Models Imagine we run this model generatively S A: 0.20 A: 0.30 0.1 0.9 C: 0.35 C: 0.25 0.9 0.99 0.1 G: 0.15 G: 0.25 T: 0.20 T: 0.30 0.01 State “+” State “-” What general sequence of events would occur?

The joint probability of some observed sequence xand some state sequence p …is equal to the probability of transitioning from Start to the first state.. ..times a running product across all positions in the sequence of… … the state specific emission probability times the probability of transitioning to the next state The Evaluation Problem Putting together the state transitions and the symbol emissions P L P(x,p) = aS→ p1epi(xi)api→pi+1 i = 1 What we are really interested in knowing is the joint probability of the sequence and the state path. But what if the state path is hidden?

The Evaluation Problem Scoring with known state sequence…. S A: 0.30 A: 0.20 0.10 0.10 0.90 C: 0.25 C: 0.35 Sequence under test G: 0.25 G: 0.15 _ACGCT S---++ T: 0.20 T: 0.30 0.90 0.99 0.01 State “+” State “-” 0.99 0.99 0.01 0.9 0.9 = 1.042x10-5 * * * * 0.2 0.35 0.25 0.25 0.3 already a small number! 0.9 * 0.198 *0.3465 *0.0025 *0.225 *0.3 This is the procedure when we have a known state sequence – but how could we approach this if the state path was unknown?? We’ll revisit this last question another day…

Alternative python transition dicts Which reflects a uniformdistribution? self.transitions = { "A": { "A": 0.3, "C": 0.2, "G": 0.2, "T": 0.3, }, "C": { "A": 0.4, "C": 0.2, "G": 0.25, "T": 0.15, }, "G": { "A": 0.2, "C": 0.2, "G": 0.3, "T": 0.3, }, "T": { "A": 0.4, "C": 0.15, "G": 0.25, "T": 0.2, }, } self.transitions = { "A": { "A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25, }, "C": { "A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25, }, "G": { "A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25, }, "T": { "A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25, }, } I’ll send you some test sequences, and your program will tell me which of these models was more likely to have produced the sequence!

Our python HMM data structures Here are distributions for a dishonest casino… self.emissions = { "S": # “S” is start { "": 1 # always emit the null string! }, "F": # 'F' indicates a fair die { "1": 1 / 6, "2": 1 / 6, "3": 1 / 6, "4": 1 / 6, "5": 1 / 6, "6": 1 / 6 }, "L": # 'L' indicates a loaded die { "1": 1 / 10, "2": 1 / 10, "3": 1 / 10, "4": 1 / 10, "5": 1 / 10, "6": 1 / 2 } } self.transitions= { "S": { "F": 0.5, "L": 0.5, }, "F": { "F": 0.95, "L": 0.05, }, "L": { "L": 0.90, "F": 0.10, } } Let’s adapt your simple Markov program to first work with our dice example…..

Probabilistic Models