CS 6243 Machine Learning

CS 6243 Machine Learning Markov Chain and Hidden Markov Models

Outline • Background on probability • Hidden Markov models • Algorithms • Applications

Probability Basics • Definition (informal) • Probabilities are numbers assigned to events that indicate “how likely” it is that the event will occur when a random experiment is performed • A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment • The sample space S of a random experiment is the set of all possible outcomes

Probabilistic Calculus • All probabilities between 0 and 1 • If A, B are mutually exclusive: • P(A  B) = P(A) + P(B) • Thus: P(not(A)) = P(Ac) = 1 – P(A) S B A

Conditional probability • The joint probability of two events A and B P(AB), or simply P(A, B) is the probability that event A and B occur at the same time. • The conditional probability of P(A|B) is the probability that A occurs given B occurred. P(A | B) = P(A  B) / P(B) <=> P(A  B) = P(A | B) P(B) <=> P(A  B) = P(B|A) P(A)

Example • Roll a die • If I tell you the number is less than 4 • What is the probability of an even number? • P(d = even | d < 4) = P(d = even  d < 4) / P(d < 4) • P(d = 2) / P(d = 1, 2, or 3) = (1/6) / (3/6) = 1/3

Independence • A and B are independent iff: • Therefore, if A and B are independent: These two constraints are logically equivalent

Examples • Are P(d = even) and P(d < 4) independent? • P(d = even and d < 4) = 1/6 • P(d = even) = ½ • P(d < 4) = ½ • ½ * ½ > 1/6 • If your die actually has 8 faces, will P(d = even) and P(d < 5) be independent? • Are P(even in first roll) and P(even in second roll) independent? • Playing card, are the suit and rank independent?

Theorem of total probability • Let B1, B2, …, BN be mutually exclusive events whose union equals the sample space S. We refer to these sets as a partition of S. • An event A can be represented as: • Since B1, B2, …, BN are mutually exclusive, then • P(A) = P(A  B1) + P(A B2) + … + P(ABN) • And therefore • P(A) = P(A|B1)*P(B1) + P(A|B2)*P(B2) + … + P(A|BN)*P(BN) • = i P(A | Bi) * P(Bi) Marginalization Exhaustive conditionalization

Example • A loaded die: • P(6) = 0.5 • P(1) = … = P(5) = 0.1 • Prob of even number? P(even) = P(even | d < 6) * P (d<6) + P(even | d = 6) * P (d=6) = 2/5 * 0.5 + 1 * 0.5 = 0.7

Another example • A box of dice: • 99% fair • 1% loaded • P(6) = 0.5. • P(1) = … = P(5) = 0.1 • Randomly pick a die and roll, P(6)? • P(6) = P(6 | F) * P(F) + P(6 | L) * P(L) • 1/6 * 0.99 + 0.5 * 0.01 = 0.17

Bayes theorem • P(A  B) = P(B) * P(A | B) = P(A) * P(B | A) Conditional probability (likelihood) P ( A | B ) P ( B ) Prior of B => = P ( B | A ) P ( A ) Posterior probability Prior of A (Normalizing constant) This is known as Bayes Theorem or Bayes Rule, and is (one of) the most useful relations in probability and statistics Bayes Theorem is definitely the fundamental relation in Statistical Pattern Recognition

Bayes theorem (cont’d) • Given B1, B2, …, BN, a partition of the sample space S. Suppose that event A occurs; what is the probability of event Bj? • P(Bj | A) = P(A | Bj) * P(Bj) / P(A) = P(A | Bj) * P(Bj) / jP(A | Bj)*P(Bj) Posterior probability Prior of Bj Likelihood Normalizing constant (theorem of total probabilities) Bj: different models / hypotheses In the observation of A, should you choose a model that maximizes P(Bj | A) or P(A | Bj)? Depending on how much you know about Bj !

Example • A test for a rare disease claims that it will report positive for 99.5% of people with disease, and negative 99.9% of time for those without. • The disease is present in the population at 1 in 100,000 • What is P(disease | positive test)? • P(D|+) = P(+|D)P(D)/P(+) = 0.01 • What is P(disease | negative test)? • P(D|-) = P(-|D)P(D)/P(-) = 5e-8

Another example • We’ve talked about the boxes of casinos: 99% fair, 1% loaded (50% at six) • We said if we randomly pick a die and roll, we have 17% of chance to get a six • If we get 3 six in a row, what’s the chance that the die is loaded? • How about 5 six in a row?

P(loaded | 666) = P(666 | loaded) * P(loaded) / P(666) = 0.53 * 0.01 / (0.53 * 0.01 + (1/6)3 * 0.99) = 0.21 • P(loaded | 66666) = P(66666 | loaded) * P(loaded) / P(66666) = 0.55 * 0.01 / (0.55 * 0.01 + (1/6)5 * 0.99) = 0.71

Simple probabilistic models for DNA sequences • Assume nature generates a type of DNA sequence as follows: • A box of dice, each with four faces: {A,C,G,T} • Select a die suitable for the type of DNA • Roll it, append the symbol to a string. • Repeat 3, until all symbols have been generated. • Given a string say X=“GATTCCAA…” and two dice • M1 has the distribution of pA=pC=pG=pT=0.25. • M2 has the distribution: pA=pT=0.20, pC=pG=0.30 • What is the probability of the sequence being generated by M1 or M2?

Model selection by maximum a posterior probability criterion • Take the prior probabilities of M1 and M2 into consideration if known Log (P(M1|X) / P(M2|X)) = LLR + log(P(M1)) – log(P(M2)) = nASA + nCSC + nGSG + nTST + log(P(M1)) – log(P(M2)) • If P(M1) ~ P(M2), results will be similar to LLR test

Markov models for DNA sequences We have assumed independence of nucleotides in different positions - unrealistic in biology

Example: CpG islands • CpG - 2 adjacent nucleotides, same strand (not base-pair; “p” stands for the phosphodiester bond of the DNA backbone) • In mammal promoter regions, CpG is more frequent than other regions of genome • often mark gene-rich regions

CpG islands • CpG Islands • More CpG than elsewhere • More C & G than elsewhere, too • Typical length: a few 100s to few 1000s bp • Questions • Is a short sequence (say, 200 bp) a CpG island or not? • Given a long sequence (say, 10-100kb), find CpG islands?

Markov models • A sequence of random variables is a k-th order Markov chain if, for all i, ithvalue is independent of all but the previous k values: • First order (k=1): • Second order: • 0th order: (independence)

First order Markov model

A 1st order Markov model for CpG islands • Essentially a finite state automaton (FSA) • Transitions are probabilistic (instead of deterministic) • 4 states: A, C, G, T • 16 transitions: ast = P(xi = t | xi-1 = s) • Begin/End states

Probability of emitting sequence x

Probability of a sequence • What’s the probability of ACGGCTA in this model? P(A) * P(C|A) * P(G|C) … P(A|T) = aBA aAC aCG …aTA • Equivalent: follow the path in the automaton, and multiply the transition probabilities on the path

Training • Estimate the parameters of the model • CpG+ model: Count the transition frequencies from known CpG islands • CpG- model: Also count the transition frequencies from sequences without CpG islands • ast = #(s→t) / #(s →) a+st a-st

Discrimination / Classification • Given a sequence, is it CpG island or not? • Log likelihood ratio (LLR) βCG = log2(a+CG/a -CG) = log2(0.274/0.078) = 1.812 βBA = log2(a+ A/a - A) = log2(0.591/1.047) = -0.825

Example • X = ACGGCGACGTCG • S(X) = βBA + βAC +βCG +βGG +βGC +βCG +βGA + βAC +βCG +βGT +βTC +βCG = βBA + 2βAC +4βCG +βGG +βGC +βGA +βGT +βTC = -0.825+ 2*.419 + 4*1.812+.313 +.461 - .624 - .730 + .573 = 7.25

CpG island scores Figure 3.2 (Durbin book) The histogram of length-normalized scores for all the sequences. CpG islands are shown with dark grey and non-CpG with light grey.

Questions • Q1: given a short sequence, is it more likely from CpG+ model or CpG- model? • Q2: Given a long sequence, where are the CpG islands (if any)? • Approach 1: score (e.g.) 100 bp windows • Pro: simple • Con: arbitrary, fixed length, inflexible • Approach 2: combine +/- models.

Combined model • Given a long sequence, predict which state each position is in. (states are hidden: Hidden Markov model)

Hidden Markov Model (HMM) • Introduced in the 70’s for speech recognition • Have been shown to be good models for biosequences • Alignment • Gene prediction • Protein domain analysis • … • An observed sequence data that can be modeled by a Markov chain • State path unknown • Model parameter known or unknown • Observed data: emission sequences X = (x1x2…xn) • Hidden data: state sequences Π = (π1π2…πn)

Hidden Markov model (HMM) Definition: A hidden Markov model (HMM) is a five-tuple • Alphabet = { b1, b2, …, bM } • Set of statesQ = { 1, ..., K } • Transition probabilities between any two states aij = transition prob from state i to state j ai1 + … + aiK = 1, for all states i = 1…K • Start probabilitiesa0i a01 + … + a0K = 1 • Emission probabilities within each state ek(b) = P( xi = b | i = k) ek(b1) + … + ek(bM) = 1, for all states k = 1…K 1 2 K …

HMM for the Dishonest Casino A casino has two dice: • Fair die P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6 • Loaded die P(1) = P(2) = P(3) = P(4) = P(5) = 1/10 P(6) = 1/2 Casino player switches back and forth between fair and loaded die once in a while

The dishonest casino model aFF = 0.95 aFL = 0.05 aLL = 0.95 Fair LOADED eF(1) = 1/6 eF(2) = 1/6 eF(3) = 1/6 eF(4) = 1/6 eF(5) = 1/6 eF(6) = 1/6 eL(1) = 1/10 eL(2) = 1/10 eL(3) = 1/10 eL(4) = 1/10 eL(5) = 1/10 eL(6) = 1/2 aLF = 0.05 Transition probability Emission probability

Simple scenario • You don’t know the probabilities • The casino player lets you observe which die he/she uses every time • The “state” of each roll is known • Training (parameter estimation) • How often the casino player switches dice? • How “loaded” is the loaded die? • Simply count the frequency that each face appeared and the frequency of die switching • May add pseudo-counts if number of observations is small

More complex scenarios • The “state” of each roll is unknown: • You are given the results of a series of rolls • You don’t know which number is generated by which die • You may or may not know the parameters • How “loaded” is the loaded die • How frequently the casino player switches dice

The three main questions on HMMs • Decoding GIVEN a HMM M, and a sequence x, FIND the sequence  of states that maximizes P (x,  | M ) • Evaluation GIVEN a HMM M, and a sequence x, FIND P ( x | M ) [or P(x)for simplicity] • Learning GIVEN a HMM M with unspecified transition/emission probs., and a sequence x, FIND parameters  = (ei(.), aij) that maximize P (x | ) Sometimes written as P (x, ) for simplicity.

Question # 1 – Decoding GIVEN A HMM with parameters. And a sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

1 1 1 1 … 2 2 2 2 … … … … … K K K K … A parse of a sequence Given a sequence x = x1……xN, and a HMM with k states, A parse of x is a sequence of states  = 1, ……, N 1 2 2 K x1 x2 x3 xK

1 1 1 1 … 2 2 2 2 … … … … … K K K K … Probability of a parse 1 Given a sequence x = x1……xN and a parse  = 1, ……, N To find how likely is the parse: (given our HMM) P(x, ) = P(x1, …, xN, 1, ……, N) = P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1) = P(xN | N)P(N | N-1) ……P(x2 | 2)P(2 | 1)P(x1 | 1)P(1) =a01 a12……aN-1Ne1(x1)……eN(xN) 2 2 K x1 x2 x3 xK

Example 0.05 • What’s the probability of  = Fair, Fair, Fair, Fair, Load, Load, Load, Load, Fair, Fair X = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4? 0.95 0.95 Fair LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05

Decoding • Parse (path) is unknown. What to do? • Alternative algorithms: • Most probable single path (Viterbi algorithm) • Sequence of most probable states (Forward-backward algorithm)

The Viterbi algorithm • Goal: to find • Is equivalent to find

The Viterbi algorithm L L L L L L L L L L • Find a path with the following objective: • Maximize the product of transition and emission probabilities Maximize the sum of log probabilities B F F F F F F F F F F P(s|L) = 1/10, for s in [1..5]P(6|L) = 1/2 P(s|F) = 1/6, for s in [1..6] Edge weight (symbol independent) Node weight (depend on symbols in seq)

wLF wFF wFF = log (aFF) rF(xi+1) = log (eF(xi+1)) The Viterbi algorithm x1 x2 x3 … xi xi+1 … xn-1 xn VF (i+1) = rF (xi+1) + max VF (i) + wFF VL (i) + wLF … … L L L L L L L B E … … F F F F F F F Weight for the best parse of (x1…xi+1), with xi+1 emitted by state F VL (i+1) = rL (xi+1) + max VF (i) + wFL VL (i) + wLL Weight for the best parse of (x1…xi+1), with xi+1 emitted by state L

Recursion from FSA directly wLL=-0.05 wFF=-0.05 WFL=-3.00 aLL=0.95 aFF=0.95 aFL=0.05 Fair LOADED Fair LOADED WLF=-3.00 aLF=0.05 rL(6) = -0.7 rL(s) = -2.3 (s = 1…5) P(s|F) = 1/6 s = 1…6 P(6|L) = ½ P(s|L) = 1/10 (s = 1...5) rF(s) = -1.8 s = 1...6 VF (i+1) = rF (xi+1) + max {VL (i) + WLF VF (i) + WFF} VL (i+1) = rL (xi+1) + max {VL (i) + WLL VF (i) + WFL} PF (i+1) = eF (xi+1) max {PL (i) aLF PF (i) aFF} PL (i+1) = eL (xi+1) max {PL (i) aLL PF (i) aFL}

CS 6243 Machine Learning