Three examples of the EM algorithm

Three examples of the EM algorithm Week 12, Lecture 1 Statistics 246, Spring 2002

The estimation of linkage from the offspring of selfed heterozygotesR A Fisher and Bhai Balmukand, Journal of Genetics20 79-92 (1928)See also: Collected Papers of R A Fisher, Volume II, Paper 71, pp 285-298.

The problem In modern terminology, we have two linked bi-allelic loci, A and B say, with alleles A and a, and B and b, respectively, where A is dominant over a and B is dominant over b. A double heterozygote AaBb will produce gametes of four types: AB, Ab, aB and ab. Since the loci are linked, the types AB and ab will appear with a frequency different from that of Ab and aB, say 1-r and r, respectively, in males, and 1-r’ and r’ respectively in females. Here we suppose that the parental origin of these heterozygotes is from the mating AABB  aabb, so that r and r’ are the male and female recombination rates between the two loci. The problem is to estimate r and r’, if possible, from the offspring of selfed double heterozygotes.

Offspring genotypic and phenotypic frequencies Since gametes AB, Ab, aB and ab are produced in proportions (1-r)/2, r/2, r/2 and (1-r)/2 respectively by the male parent, and (1-r’)/2, r’/2, r’/2 and (1-r’)/2 respectively by the female parent, zygotes with genotypes AABB, AaBB, …etc, are produced with frequencies (1-r)(1-r’)/4, (1-r)r’/4, etc. Exercise: Complete the Punnett square of offspring genotypes and their associated frequencies. The problem here is this: although there are 16 distinct offspring genotypes, taking parental origin into account, the dominance relations imply that we only observe 4 distinct phenotypes, which we denote by A*B*, A*b*, a*B* and a*b*. Here A* (res. B*) denotes the dominant, while a* (resp. b*) denotes the recessive phenotype determined by the alleles at A (resp. B).

Offspring genotypic and phenotypic probabilities, cont. Thus individuals with genotypes AABB, AaBB, AABb or AaBb, which account for 9/16 gametic combinations (check!), all exhibit the phenotype A*B*, i.e. the dominant alternative in both characters, while those with genotypes AAbb or Aabb (3/16) exhibit the phenotype A*b*, those with genotypes aaBB and aaBb (3/16) exhibit the phenotype a*B*, and finally the double recessives aabb (1/16) exhibit the phenotype a*b*. It is a slightly surprising fact that the probabilities of the four phenotypic classes are definable in terms of the parameter = (1-r)(1-r’), as follows: a*b* has probability /4 (easy to see), a*B* and A*b* both have probabilities (1-)/4, while A*B* has probability 1 minus the sum of the preceding, which is (2+)/4. Exercise: Calculate these phenotypic probabilities.

Estimation of  Now suppose we have a random sample of n offspring from the selfing of our double heterozygote. Thus the 4 phenotypic classes will be represented roughly in proportion to their theoretical probabilities, their joint distribution being multinomial Mult [n; (2+ )/4, (1-)/4, (1-)/4, /4]. Note that here neither r nor r’ will be separately estimable from these data, but only the product (1-r)(1-r’). Note that since we know that r≤1/2 and r’≤1/2, it follows that  ≥1/4. How do we estimate ? Fisher and Balmukand discuss a variety of methods that were in the literature at the time they wrote, and compare them with maximum likelihood, which is the method of choice in problems like this. We describe a variant on their approach to illustrate the EM algorithm.

The incomplete data formulation • Let us denote (cf p. 26 of Week 11b) the counts of the 4 phenotypic classes by y1, y2, y3 and y4, these having probabilities (2+ )/4, (1- )/4, (1-)/4 and /4, respectively. Now the probability of the observing genotype AABB is /4, just as it is with aabb, and although this genotype is phenotypically indistinguishable from the 8 others with phenotype A*B*, it is convenient to imagine that we can distinguish them.So let us denote their count by x1, and let x2 denote count of the remainder of that class, so that x1+x2 = y1. Note that x2 has marginal probability 1/2. In the jargon of the EM algorithm, x1 and x2 are missing data, as we only observe their sum y1. Next, as in p.26 of Week 11b, we let y2=x3, y3=x4 and y4=x5. We now illustrate the approach of the EM algorithm, referring to material in Week 9b and Week 11b for generalities.

The EM algorithm for this problem • The complete data log likelihood at  is (Ex: Check): • (x2+x5)log + (x3+x4)log(1- ). • The expected value of the complete data log likelihood given observed data taken at ’ (E-step: think of ’ as -initial) is: • (E’(x2|y1)+y4)log + (y2+y3)log(1-). • Now E’(x2|y1) is just ky1, where k= ’/(2+’). (Ex: Check.) • The maximum over  of the expected value of the complete data log likelihood given observed data taken at ’ (M-step) • occurs at ’’ = (ky1+y4)/(ky1+y2+y3+y4). (Ex: Check) • Here we think of ’’ as -next. • It should now be clear how the E-step (calculation of k) and the M-step (calculation of ’’) can be iterated.

Comments on this example • This completes our discussion of this example. It appeared in the famous EM paper (Dempster, Laird and Rubin, JRSSB 1977) without any explanation of its genetic origins. Of course it is an illustration of the EM only, for the actual likelihood equation generated by the observed data only is a quadratic, and so easy to solve (see Fisher & Balmukand). Thus it is not necessary to use the EM in this case (some would say in any case, but that is for another time). We have omitted any of the fascinating detail provided in Fisher and Balmukand, and similarly in Dempster et al. Read these papers: both are classics, with much of interest to you. Rather than talk about details concerning the EM (most importantly, starting and stopping it, the issue of global max, and SEs for parameter estimates), I turn to another important EM example: mixtures.

Fitting a mixture model by EM to discover motifs in biopolymersT L Bailey and C Elkan, UCSD Technical Report CS94-351; ISMB94. • Here we outline some more EM theory, this being relevant to motif discovery. We follow the above report, as we will be discussing the program MEME written by these authors in a later lecture. This part is called MM. • A finite mixture model supposes that data X = (X1,…,Xn) arises from two or more groups with known distributions but different, unknown parameters  = (1, …, g), where g is the number of groups, and mixing parameters  = (1,…, g), where the s are non-negative and sum to 1. • It is convenient to introduce indicator vectors Z = (Z1,…,Zn), where Zi = (Zi1,…,Zig), and Zij = 1 if Xi is from group j, and = 0 otherwise. Thus Zi gives group membership for the ith sample. It follows that pr(Zij= 1 | , ) = j . For any given i, all Zij are 0 apart from one.

Complete data log likelihood • Under the assumption that the pairs (Zi,Xi) are mutually independent, their joint density may be written (Exercise: Carry out this calculation in detail.) • pr(Z, X | , ) = ∏ij [j pr(Xi | j) ] Zij • The complete-data log likelihood is thus • log L(,  | Z, X) = ∑∑ Zij log [ j pr(Xi | j) ]. • The EM algorithm iteratively computes the expectation of this quantity given the observed data X, and initial estimates ’ and ’ of  and  (the E-step), and then maximizes the result in the free variables  and  leading to new estimates ’’ and ’’ (the M-step). Our interest here is in the particular calculations necessary to carry out these two steps.

Mixture models: the E-step • Since the log likelihood is the sum of over i and j of terms multiplying Zij, and these are independent across i, we need only consider the expectation of one such, given Xi. Using initial parameter values ’ and ’, and the fact that the Zij are binary, we get • E( Zij | X, ’, ’) = ’j pr(Xi|’j)/ ∑k ’kpr(Xi|’k) = Z’ij, say. • Exercise: Obtain this result.

Mixture models: the M-step • Here our task is to maximize the result of an E-step: • ∑∑ Z’ij j +∑∑ Z’ij log pr(Xi | j). • The maximization over  is clearly independent of the rest and is readily seen to be achieved (Ex: check this) with • j’’ = ∑iZ’ij / n. • Maximizing over  requires that we specify the model in more detail. The case of interest to us is where g = 2, and the distributions for class 1 (the motif) and class 2 (the background) are given by position specific and a general multinomial distribution, respectively.

Mixture models: the M-step, cont. • Our initial observations are supposed to consist of N sequences from an L-letter alphabet. Unlike what we did in the last lecture, these sequences are now broken up into all n overlapping subsequences of length w, and these are our Xi. We need w+1 sets of probabilities, namely fjk and f0k, where j=1,…,w (the length of the motif) and k runs over the symbol alphabet. • With these parameters, we can write • pr(Xi | 1) = ∏j∏k fjkI(k,Xij) , and pr(Xi | 2) = ∏j∏k f0kI(k,Xij) • where Xij is the letter in the jth position of sample i, and I(k,a) = 1 if a=ak, and =0 otherwise. With this notation, for k=1,…L, write • c0k = ∑∑ Z’i2I(k,Xij), and cjk = ∑∑ Z’i1I(k,Xij). • Here c0k is the expected number of times letter ak appears in the background, and cjk the expected number of times ak appears in occurrences of the motif in the data.

Mixture models: the M-step completed. • With these preliminaries, it is straightforward to maximize the expected complete-data log likelihood, given the observations X, evaluated at initial parameters ’ and ’. Exercise: Fill in the details missing below. We obtain • f’’jk = cjk / ∑k=1L cjk , j = 0,1,…,w; k = 1,…,L. • In practice, care must be taken to avoid zero frequencies, so that either one uses explicit Dirichlet prior distributions, or one adds small constants j ,∑j = , giving • f’’jk = (cjk+ j )/ (∑k=1L cjk + ) , j = 0,1,…,w; k = 1,…,L.

Comment on the EM algorithm • A common misconception concerning the EM algorithm is that we are estimating or predicting the missing data, plugging that estimate or prediction into the complete data log likelihood, “completing the data” you might say, and then maximizing this in the free parameters, as though we had complete data. • This is emphatically NOT what is going on. • A more accurate description might be this. We are using the inferred missing data to weight different parts of the complete data log likelihood differently in such a way that the pieces combine into the maximum likelihood estimates.

An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer SequencesC E Lawrence and A A ReillyPROTEINS: Structure, Function and Genetics7:41-51 (1990) • This paper was the precursor of the Gibbs-Metropolis algorithm we described in Week11b. We now briefly describe the algorithm along the lines of the discussion just given, but in the notation used in Week11b. On p. .. of that lecture, the full data log likelihood was given, without the term for the marginal distribution of A being visible. We suppose that the ak are independent, and uniformly distributed over {1,..,nk-W+1}. Because this term does not depend on either 0 or , it disappears in the likelihood proportionality constant. Thus the complete data log likelihood is • log L(0, |R,A) = h(R{A}c)log0 + ∑jh(RA+j-1)logj , • where we use the notation log0 = ∑iilog0,i cf p.12, Week 11b.

The E-step in this case. • The expected value of the complete data log likelihood given the observed data and initial parameters ’0 and ’ is just∑A pr(A | R, ’0,’) log pr(R, A | 0 , ) (*) • where the sum is over all A = {a1,…,aK}, and so our task is to calculate the first term. Now we are treating all the rows (sequences) as mutually independent, so pr(A | R, ’0,’) factorizes in k, and we need only deal with a typical row, the kth, say. Letting ak denote the random variable corresponding to the start of the motif in the kth row, then • pr(ak = i) = 1/(nk-W+1), i=1, …, nk-W+1. • We can readily calculate pr(ak = i | ’0,’, R) by Bayes’ theorem, and these get multiplied and inserted in (*) above. Exercise: Carry out this calculation.

The M-step in this case. • Once we have calculated the expected value of the complete data log likelihood given the observed data and initial parameters ’0 and ’, we then maximize it in the free variables 0 and , leading to new parameters ’’0 and ’’. • How is this done? Without giving the gory details, just notice that (*) is a weighted combination of multinomial log likelihoods just like the one we met in our previous example, the mixture model. There the weights were Z’s, and here they are pr(A | R, ’0 ,’)s. It follows (Exercise: Fill in the details) that the maximizing values of 0 and , which we denote by ’’0 and ’’, are ratios of expected counts similar to the c0 and cj in the mixture discussion. As there, we will want to deal with small or zero counts by invoking Dirichlet priors.

Three examples of the EM algorithm

Three examples of the EM algorithm

Presentation Transcript

The EM algorithm

EM algorithm

The EM algorithm (Part 1)

EM algorithm and applications

Three examples …

The EM algorithm

Expectation-Maximization (EM) Algorithm

Examples of the Unification Algorithm

The EM algorithm

Three examples of division of labor

Expectation-Maximization (EM) Algorithm

EM Algorithm

Three examples of application of Sussix

The EM algorithm Lecture #11

Expectation–maximization (EM) algorithm

Enhanced EM (EEM) Algorithm

em examples

The Expectation Maximization (EM) Algorithm

EM Algorithm

Three examples …

Algorithm Time Analysis Examples

Speech recognition and the EM algorithm