Em algorithm and applications lecture 9
1 / 38

EM algorithm and applications Lecture #9 - PowerPoint PPT Presentation

  • Updated On :

EM algorithm and applications Lecture #9. Background Readings : Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis , Durbin et al., 2001. Reminder: Relative Entropy.

Related searches for EM algorithm and applications Lecture #9

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'EM algorithm and applications Lecture #9' - hanley

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Em algorithm and applications lecture 9 l.jpg

EM algorithm and applicationsLecture #9

Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.


Reminder relative entropy l.jpg
Reminder: Relative Entropy

Let p,q be two probability distributions on the same sample space. The relative entropy between p and q is defined by

H(p||q) = D(p||q) = ∑x p(x)log[p(x)/q(x)]

= ∑x p(x)log(1/(q(x)) -

-∑ x p(x)log(1/(p(x)).


“The inefficiency of assuming distribution q when the correct distribution is p”.

Non negativity of relative entropy l.jpg
Non negativity of relative entropy

Claim (proved last week)

D(p||q)= ∑x p(x)log[p(x)/q(x)]

= ∑x p(x)log(1/(q(x)) -∑ x p(x)log(1/(p(x)) ≥ 0.

Equality if and only if q=p.

This claim is used in the correctness proof of the EM algorithm, which we present next.

Em algorithm approximating mle from incomplete data l.jpg
EM algorithm:approximating MLE from Incomplete Data

  • Finding MLE parameters: nonlinear optimization problem

log P(x| λ)

E [log P(x,y|λ)]




General idea of EM:

Use “current point” θ to construct alternative function Qθ (which is “nice”)

Guarantee: if Qθ(λ)>Qθ(θ), than λhas higher likelihood than θ.

The em algorithm l.jpg
The EM algorithm

Consider a model where, for observed data x and model parameters θ:


(y are “hidden data”).

The EM algorithm receives x and parameters θ, and returns new parameters * s.t. p(x|*) > p(x|θ).

Note: In Durbin et. al. book, the initial parameters are denoted by θ0, and the new parameters by θ.

The em algorithm6 l.jpg
The EM algorithm

Finding * which maximizes


is equivalent to finding * which maximizes the logarithm

log p(x|*) = log (∑yp(x,y| *))

Which is what the EM algorithm attempts to do.

In the following we:

  • Present the EM algorithm.

  • Give few examples of implementations

  • Prove its correctness.

The em algorithm7 l.jpg
The EM algorithm

In each iteration the EM algorithm does the following.

  • (E step): Calculate

    Qθ() = ∑y p(y|x,θ)log p(x,y|)

  • (M step): Find* which maximizes Qθ()

    (Next iteration sets  * and repeats).


When θ is clear, we shall write Q() instead of Qθ()

At the M-step we only need that Qθ(*)>Qθ(θ). This change yields the so called Generalized EM algorithm. It is important when it is hard to find the optimal *.

Example baum welsh em for hmm l.jpg
Example: Baum Welsh = EM for HMM

The Baum-Welsh algorithm is the EM algorithm for HMM :

  • E step for HMM: Qθ() = ∑S p(s|x,θ)log p(s,x|), where λ are the new parameters {akl,ek(b)}.

(The are the counts of state transitions and symbol emissions in (s,x)).

Baum welsh em for hmm l.jpg
Baum Welsh = EM for HMM

  • M step For HMM: Find* which maximizes Qθ(). As we proved, λ* is given by the relative frequencies of the Akl’s and the Ek(b)’s

A simplest example em for 2 coin tosses l.jpg
A simplest example: EM for 2 coin tosses

Consider the following experiment:

Given a coin with two possible outcomes: H (head) and T (tail), with probabilities qH, qT= 1- qH.

The coin is tossed twice, but only the 1st outcome, T, is seen. So the data is x = (T,*).

We wish to apply the EM algorithm to get parameters that increase the likelihood of the data.

Let the initial parameters be θ = (qH, qT) = ( ¼, ¾ ).

Em for 2 coin tosses cont l.jpg
EM for 2 coin tosses (cont)

The hidden data which can produce x are the sequences y1= (T,H); y2=(T,T);

Hence the likelihood of x with parameters (qH, qT), is p(x| θ) = P(x,y1|) + P(x,y2|) = qHqT+qT2

For the initial parameters θ = ( ¼, ¾ ), we have:

p(x| θ) = ¾ * ¼ + ¾ * ¾ = ¾

Note that in this case P(x,yi|) = P(yi|), for i = 1,2.

we can always define y so that (x,y) = y (otherwise we set y’(x,y) and replace the “ y ”s by “ y’ ”s).

Em for 2 coin tosses e step l.jpg
EM for 2 coin tosses - E step

Calculate Qθ() = Qθ(qH,qT). Note: qH,qT are variables

Qθ() = p(y1|x,θ)log p(x,y1|) +

p(y2|x,θ)log p(x,y2|)

p(y1|x,θ) = p(y1,x|θ)/p(x|θ) = (¾∙ ¼)/ (¾) = ¼

p(y2|x,θ) = p(y2,x|θ)/p(x|θ) = (¾∙ ¾)/ (¾) = ¾

Thus we have

Qθ() =¼log p(x,y1|) +¾log p(x,y2|)

Em for 2 coin tosses e step13 l.jpg
EM for 2 coin tosses - E step

For a sequence y of coin tosses, let NH(y) be the number of H’s in y, and NT(y) be the number of T’s in y. Then

log p(y|) = NH(y) log qH+ NT(y) log qT

In our example:

y1= (T,H); y2=(T,T), hence:

NH(y1) = NT(y1)=1, NH(y2) =0, NT(y2)=2

Example 2 coin tosses e step l.jpg
Example: 2 coin tosses - E step


¼ log p(x,y1|) = ¼ (NH(y1) log qH+ NT(y1) log qT) = ¼ (log qH+ log qT)

¾ log p(x,y2|) = ¾ ( NH(y2) log qH+ NT(y2) log qT) = ¾ (2 log qT)

Substituting in the equation for Qθ() :

Qθ() =¼log p(x,y1|)+¾log p(x,y2|)

= ( ¼ NH(y1)+ ¾ NH(y2))log qH+ ( ¼ NT(y1)+ ¾ NT(y2))log qT

NT= 7/4

NH= ¼

Qθ() = NHlog qH + NTlog qT

Em for 2 coin tosses m step l.jpg
EM for 2 coin tosses - M step

Find * which maximizes Qθ()

Qθ() =NHlog qH + NTlog qT = ¼log qH+ 7/4 log qT

We saw earlier that this is maximized when:

[The optimal parameters (0,1), will never be reached by the EM algorithm!]

Em for single random variable dice l.jpg
EM for single random variable (dice)

Now, the probability of each y(≡(x,y)) is given by a sequence of dice tosses. The dice has m outcomes, with probabilities q1,..,qm. Let Nl(y) = #(outcome l occurs in y). Then

Let Nl be the expected value of Nl(y), given x and θ:

Nl=E(Nl|x,θ) = ∑y p(y|x,θ) Nl(y),

Then we have:

Q for one dice l.jpg
Q(λ) for one dice


Em algorithm for n independent observations x 1 x n l.jpg
EM algorithm for n independent observations x1,…, xn :

Expectation step

It can be shown that, if the xjare independent, then:

Example the abo locus l.jpg

The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O.

We wish to estimate the proportion in a population of the 6 genotypes.

Example: The ABO locus

A locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type.

Suppose we randomly sampled N individuals and found that Na/a have genotype a/a, Na/b have genotype a/b, etc. Then, the MLE is given by:

The abo locus cont l.jpg
The ABO locus (Cont.)

However, testing individuals for their genotype is a very expensive. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ?

The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ?

The abo locus cont21 l.jpg
The ABO locus (Cont.)

The Hardy-Weinberg equilibrium rule states that in equilibrium the frequencies of the three alleles qa,qb,qoin the population determine the frequencies of the genotypes as follows:

qa/b= 2qa qb, qa/o= 2qa qo, qb/o= 2qb qo,

qa/a= [qa]2, qb/b= [qb]2, qo/o= [qo]2.

In fact, Hardy-Weinberg equilibrium rule follows from modeling this problem as data x with hidden parameters y:

The abo locus cont22 l.jpg
The ABO locus (Cont.)

The dice’ outcome are the three possible alleles a, b and o. The observed data are the blood types A, B, AB or O.

Each blood type isdetermined by two successive random sampling of alleles, which is an “ordered genotypes pair” – this is the hidden data.

For instance blood type A corresponds to the ordered genotypes pairs (a,a), (a,o) and (o,a).

So we have three parameters of one dice – qa,qb,qo - that we need to estimate.

Em setting for the abo locus l.jpg
EM setting for the ABO locus

The observed data x =(x1,..,xn) is a sequence of letters (blood types) from the alphabet {A,B,AB,O}. eg: (B,A,B,B,O,A,B,A,O,B, AB) are observations (x1,…x11).

The hidden data (ie the y’s)for each letter xjis the set of ordered pairs of alleles that generates it. For instance, for A it is the set {aa, ao, oa}.

The parameters= {qa ,qb, qo} are the probabilities of the alleles.

We need is to find the parameters  = {qa ,qb, qo} that maximize the likelihood of the given data.

We do this by the EM algorithm:

Em for abo loci l.jpg
EM for ABO loci

  • For each observed blood type xj{A,B,AB,O} and for each allele z in{a,b,o} we compute Nz(xj) , the expected number of times that z appear in xj.

Where the sum is taken over the ordered “genotype pairs” yj, and Nz(yj) is the number of times allele z occurs in the pair yj. eg, Na(o,b)=0; Nb(o,b) = No(o,b) = 1.

Em for abo loci25 l.jpg
EM for ABO loci

The computation for blood type B:

P(B|) = P((b,b)|) + p((b,o)|) +p((o,b)|)) = qb2 + 2qbqo.

Since Nb((b,b))=2, and Nb((b,o))=Nb((o,b)) =No((o,b))=No((b,o))=1, No(B) and Nb(B) , the expected number of occurrences of o and b in B, are given by:

Observe that Nb(B)+ No(B)= 2

Em for abo loci26 l.jpg
EM for ABO loci

Similarly, P(A|) = qa2 + 2qaqo.

P(AB|) = p((b,a)|) + p((a,b)|)) = 2qaqb ;

Na(AB) = Nb(AB)= 1

P(O|) = p((o,o)|) = qo2

No(O)= 2

[ Nb(O) = Na(O) = No(AB)= Nb(A)= Na(B)= 0 ]

E step compute n a n b and n o l.jpg
E step: compute Na, Nband No

Let #(A)=3, #(B)=5, #(AB)=1, #(O)=2 be the number of observations of A, B, AB, and O respectively.

M step: set λ*=( qa*, qb*, qo*)

Em for a general discrete stochastic processes l.jpg
EM for a general discrete stochastic processes

Now we wish to maximize likelihood of observation x with hidden data as before, ie maximize


But this time experiment (x,y) is generated by a general stochastic process. The only assumption we make is that the outcome of each experimentconsists of a (finite) sequence of samplings of r discrete random variables (dices) Z1,...,Zr , each of the Zi ‘s can be sampled few times. This can be realized by a probabilistic acyclic state machine, where at each state some Zi is sampled, and the next state is determined by the outcome – until a final state is reached.

Em for processes with many dices l.jpg











EM for processes with many dices

Example: In HMM, the random variables are the transmissions and emission probabilities: akl , ek(b).

x is the visible information

y is the sequence s of states

(x,y) is the complete HMM

As before, we can redefine y so that (x,y) = y.

Em for processes with many dices30 l.jpg
EM for processes with many dices

Each random variable Zk (k =1,...,r)has mkvalues zk,1,...zk,mkwith probabilities {qkl,|l=1,...,mk}.

Each y definesa sequence of outcomes (zk1,l1,...,zkn,ln) of the random variables used in y.

In the HMM, these are the specific transitions and emissions, defined by the states and outputs of the sequence yj .

Let Nkl(y) = #(zkl appearsin y).

Em for processes with many dices31 l.jpg

Define Nkl as the expected value of Nkl(y), given x and θ:

Nkl=E(Nkl|x,θ) = ∑y p(y|x,θ) Nkl(y),

Then we have:

EM for processes with many dices

Similarly to the single dice case, we have:

Q for processes with many dices l.jpg
Q(λ) for processes with many dices


Em algorithm for processes with many dices l.jpg
EM algorithm for processes with many dices

Similarly to the one dice case we get:

Expectation step

Set Nklto E(Nkl(y)|x,θ), ie:

Nkl= ∑y p(y|x,θ) Nkl(y)

Maximization step

Set qkl=Nkl / (∑l’ Nkl’)

Em algorithm for n independent observations x 1 x n34 l.jpg
EM algorithm for n independent observations x1,…, xn :

Expectation step

It can be shown that, if the xjare independent, then:

Correctness proof of em l.jpg
Correctness proof of EM


Let x = {y:yY} be a collection of events, as in the setting of the EM algorithm, and let:

Q(λ) = ∑y p(y|x,θ)log p(y| λ)

Then the following holds:

if Q(λ*)> Q(θ), then P(x| λ*) P(x| θ) .

Proof cont l.jpg
Proof (cont.)

By the definition of conditional probability, for each y we have,

p(x|) p(y|x,) = p(y,x|), and hence:

log p(x|) = log p(y,x|) – log p(y|x,)


log p(x| λ) = ∑yp(y|x,θ) [log p(y|λ) – log p(y|x,λ)]


log p(x|λ)


Proof end l.jpg
Proof (end)


Hint to relative entropy…

log p(x|λ) = ∑yp(y|x, θ) log p(y|λ) - ∑yp(y|x,θ) log [p(y|x,λ)]

Substituting λ=λ* and λ=θ, and then subtracting, we get

log p(x|λ*) - log p(x|θ) =

Q(λ*) – Q(θ) + D(p(y|x,θ) || p(y|x,λ*))

Relative entropy 0 ≤

≥ Q(λ*) – Q(θ) ≥ 0. QED

Em in practice l.jpg
EM in Practice

Initial parameters:

  • Random parameters setting

  • “Best” guess from other source

    Stopping criteria:

  • Small change in likelihood of data

  • Small change in parameter values

    Avoiding bad local maxima:

  • Multiple restarts

  • Early “pruning” of unpromising ones