1 / 18

EM Algorithm – Motif Finding Tutorial #09

EM Algorithm – Motif Finding Tutorial #09. The class has been edited from Bud Mishra’s lecture which is available at www.cs.nyu.edu/mishra/COURSES/02.COBIO . Changes made by Ydo Wexler. Expectation Maximization (EM). A general purpose method for learning from incomplete data Intuition:

khalil
Download Presentation

EM Algorithm – Motif Finding Tutorial #09

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EM Algorithm – Motif Finding Tutorial #09 The class has been edited from Bud Mishra’s lecture which is available at www.cs.nyu.edu/mishra/COURSES/02.COBIO . Changes made by Ydo Wexler. .

  2. Expectation Maximization (EM) • A general purpose method for learning from incomplete data Intuition: • If we had access to counts, then we can estimate parameters • However, missing values do not allow to perform counts • “Complete” counts using current parameter assignment

  3. Expected Counts P(Y=H|X=H,Z=T,) = 0.3 N (X,Y ) X Y # =1=0=1=1 +0.3+0.4+0.7+0.6 HHTT HTHT P(Y=H|X=T,) = 0.4 “Real” counts “missing” counts Expectation Maximization (EM) Data Y Z X HTHHT ??HTT T??TH 1.30.41.71.6 Current model

  4. Reiterate Updated network (G,1) Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) Computation Reparameterize X1 X1 X2 X2 X3 X3 H (M-Step) (E-Step) H Y1 Y1 Y2 Y2 Y3 Y3 EM (cont.) Initial network (G,0)  Training Data

  5. Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point MLE from Incomplete Data • Finding MLE parameters: nonlinear optimization problem log P(x| ) E ’[log P(x,y| )] 

  6. MLE from Incomplete Data • Finding MLE parameters: nonlinear optimization problem log P(x| ) E ’[log P(x,y| )]  Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point

  7. Sequence Motifs • A Sequence of patterns of biological significance. • Examples: • DNA: Protein binding sites • (e.g. promoters, regulatory sequences) • Protein: sequences corresponding to conserved pieces of structure • (e.g. Local features, At various scales: blocks, domains & families)

  8. Sequence Motifs - EM Algorithm • Use EM (Expectation Minimization) algorithm to find multiple motifs in a set of sequences. • Description of a motif: • W = (Fixed) width of a motif • P = [plc] = Matrix of probabilities that letter l occurs at position c = |S| xW matrix

  9. Pr = Example • DNA motif of width • W = 3, • S = { A, T, C, G} • r = motif • Pr = 4 x 3 stochastic matrix

  10. Computational Problem • Given: • A set of sequences, G • A width parameter W • Find: • Motifs of width W common to sequences G and present their probabilistic descriptions. • Assume: • One motif in each sequence • The probability that the motif will appear at certain location is uniform for all locations Note that motif start sites in each sequence are unknown (hidden).

  11. Position 1 2 3 4 Basic EM Approach • Total number of sequences = m • Minimum length of a sequence = l • Z = matrix of probabilities • zij = Probability that the motif starts at position j in sequence i.

  12. EM Algorithm • Set initial values for P • do • Re-estimate Z from P • Re-estimate P from Z • until change in P < ε • return P

  13. if motif starts at pos. j in seq. i otherwise. • Pr (Si| Iij = 1 , ρ) = EM Algorithm • Some definitions: • si– the ith sequence • lkj = the char. at pos. (j-1)+k in seq. Si How well Si fits the motif in position j

  14. Pr = Example Si = AGGCTGTAGACAC • Pr(TGT | Ii5 =1,ρ) = rT,1xrG,2xrT,3 = 0.2 x 0.1 x 0.1 = 2 x 10-3

  15. Estimating Z =Pr(Iij = 1 | r, Si) • zij = Estimates the starting position in each Si zij = Pr ( Iij =1 | r, Si) = Pr( Si, Iij = 1 | r)/ Pr(Si | r) = Pr( Si | Iij = 1, r) Pr(Iij = 1)/ åk Pr( Si | Iik = 1, r) Pr(Iik = 1) = Pr( Si | Iij = 1, r) / åk Pr( Si | Iik = 1, r) • Follows from an application of the Bayes’ rule and the assumption that “it is equally likely that the motif will start in any position.” Pr(Iij = 1) = Pr(Iik=1)

  16. is the expected number of occurrences of the character c at the kth position of a motif r (assuming that the motif “start position” is known.) Estimating Pr • Given Z, estimate the probability that the character c occurs at the kth position of a motif. The 1’s are added to avoid division by 0 • pck = (nck + 1)/ åd (ndk + 1)

  17. Pr = Example Si = AGGCTGTAGACAC zi1 = 6 x 10-3/sum zi2 = 3 x 10-3/sum zi3 = 6 x 10-3/sum 0.1 x 0.1 x 0.6 0.3 x 0.1 x 0.1 0.3 x 0.2 x 0.1

  18. Example • s1 : A C A G C A • s2 : A G G C A G • s3 : T C A G T C z1,1 z1,3 z2,1 z3,3 pA,1 = (z11 +z13+ z21 + z33 +1)/ (z11 + z12 + L+ z33 + z34 +4)

More Related