The EM Method

1 / 17

# The EM Method - PowerPoint PPT Presentation

The EM Method. Arthur Pece aecp@diku.dk Basic concepts EM clustering algorithm EM method and relationship to ML estimation. What is EM?. Expectation-Maximization A fairly general optimization method Useful when the model includes 3 kinds of variables: visible variables x

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## The EM Method

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### The EM Method

Arthur Pece

aecp@diku.dk

Basic concepts

EM clustering algorithm

EM method and relationship to ML estimation

What is EM?
• Expectation-Maximization
• A fairly general optimization method
• Useful when the model includes 3 kinds of variables:
• visible variables x
• intermediate variables h *
• parameters/state variables s

and we want to optimize only w.r.t. the parameters.

* Here we assume that the intermediate variables are discrete

EM Method
• A method to obtain ML parameter estimates

-> maximize log-likelihood w.r.t. parameters.

Assuming that the xi are statistically independent:

likelihood for the data set = sum of likelihoods for the data points:

L = Si log p(xi | s)

= Si log Skp(xi | hk,s) p (hk | s)

(replace 2nd sum with an integral if intermediate variables are continuous rather than discrete)

EM functional

Given a pdf q(h) for the intermediate variables we define the EMfunctional:

Qq = SiSkq(hk) log p(xi | hk,s) p (hk | s)

This is usually much simpler than the log-likelihood:

L = Si log Skp(xi | hk,s) p (hk | s)

because there is no logarithm of a sum in Qq .

EM iteration

Two steps: E and M

• E step: q(h) is set equal to the pdf of h conditional on xi and the current estimate s(t) of s:

q(t)(hk) = p(hk | xi, s(t-1))

• M step: the EM functional is maximized w.r.t. s to obtain s(t).
Example: EM clustering
• m data points xi are generated by n generative processes, each process j generating a fraction wj of the data points with pdf fj (xi), parameterized by the parameter set sj (which includes wj)
• We want to estimate the parameters sj for all processes
Example: EM clustering
• Visible variables: m data points xi
• Intermediate variables:

m xn binary labels hij,

Sjhij = 1

• State variables: n parameter sets sj
EM clustering for Gaussian pdf’s
• The parameters are weight wj, centroid cj, covariance Aj
• If we knew which data point belongs to which cluster, we could compute fraction, mean and covariance for each cluster:

wj = Sihij/m

cj = Sihijxi / wj

Aj = Sihij (xi - cj) (xi - cj)T / wj

EM clustering (continued)
• Since we do not know which cluster a data point belongs to, we assign each point to all clusters, with different probabilities qij, Sjqij = 1:

wj = Siqij

cj = Siqijxi / wj

Aj = Siqij (xi - cj) (xi - cj)T / wj

EM clustering (continued)
• The probabilities qij can be computed from the cluster parameters
• Chicken & egg problem: the cluster parameters are needed to compute the probabilities, and the probabilities are needed to compute the cluster parameters
EM clustering (continued)

The solution: iterate to convergence:

• E step: for each data point and each cluster, compute the probability qij that the point belongs to the cluster (from the cluster parameters)
• M step: re-compute the cluster parameters for all clusters by weighted averages over all points (use the equations given 2 slides ago).
How to compute the probability that a given data point originates from a given process?
• Use Bayes’ theorem:

qij = wjfj (xi) / Skwkfk (xi)

This is how the cluster parameters are used to compute the qij

Non-decreasing log-likelihoodin the EM method

Let’s return to the general EM method:

we want to prove that the log-likelihood does not decrease from one iteration to the next.

To do so we introduce 2 more functionals.

Entropy and Kullback-Leibler divergence

Define the entropy

S(q) = -SiSkq(hk) log q(hk)

and the Kullback-Leibler divergence

DKL[q ; p(h| x, s)]

= Si Skq(hk) log [q(hk) /p(hk | xi, s)]

Non-decreasing log-likelihood I

It can be proven that

L = Qq + S(q) + DKL[q ; p(h| x, s)]

After the E step, q(t)(h) = p(h| x, s(t-1))

and thereforeDKL is zero:

L (t-1)= Qq(t-1) + S(q (t))

Non-decreasing log-likelihood II

After the M step, Qq is maximized in standard EM

[ Qq is increased but not maximized in GEM (generalized EM) but the result is the same ] and therefore:

Qq(t) sQq(t-1)

In addition we have that:

L (t)sQq (t) + S(q(t))

[ This is because, for any two pdf’s q and p:

DKL[q ; p] s 0 ]

Non-decreasing log-likelihood III

Putting the above results together:

L (t)sQq (t) + S(q(t) )

sQq (t-1) + S(q(t) ) = L (t-1)

which proves that L is non-decreasing.