Pair HMMs and edit distance

Pair HMMs and edit distance Ristad & Yianilos

Special meeting Wed 4/14 • What: Evolving and Self-Managing Data Integration Systems • Who: AnHai Doan, Univ. of Illinois at Urbana-Champaign • When: Wednesday, April 14, 2004 @ 11am (food at 10:30am) • Where: Sennott Square Building, room 5317

Special meeting 4/28 (last class) • First International Joint Conference on Information Extraction, Information Integration, and Sequential Learning • 10:30-11:50 am, Wean Hall 4601 • All project proposals have been accepted as paper abstracts, and you’re all invited to present for 10min (including questions)

Pair HMMs – Ristad & Yianolis • HMM review • notation • inference (forward algorithm) • learning (forward-backward & EM) • Pair HMMs • notation • generating edit strings • distance metrics (stochastic, viterbi) • inference (forward) • learning (forward-backward & EM) • Results from R&Y paper • K-NN with trained distance, hidden prototypes • problem: phoneme strings => words • Advanced Pair HMMs • adding state (eg for affine gap models) • Smith Waterman? • CRF training? last week today

HMM Notation

HMM Example Sample output: xT=heehahaha, sT=122121212 Pr(1->2) Pr(1->1) Pr(2->2) Pr(2->x) Pr(1->x) 1 2 Pr(2->1)

HMM Inference Key point: Pr(si=l)depends only on Pr(l’->l) and si-1 x1 x2 x3 xT

HMM Inference Key point: Pr(si=l)depends only on Pr(l’->l) and si-1 so you can propogate probabilities forward... x1 x2 x3 xT

HMM Inference – Forward Algorithm x1 x2 x3 xT

HMM Learning - EM Expectation maximization: • Find expectations, i.e. Pr(si=l) for i=1,...,T • forward algorithm + epsilon • hidden variables are states s at times t=1,...,t=T • Maximize probability of parameters given expectations: • replace #(l’->l)/#(l’) with weighted version of counts • replace #(l’->x)/#(l’) with weighted version

HMM Inference Forward algorithm: computes probabilities α(l,t) based on information in first t letters of string, ignores “downstream” information x1 x2 x3 xT

x1 x2 x3 xT HMM Inference

HMM Learning - EM Expectation maximization: • Find expectations, i.e. Pr(si=l) for i=1,...,T • forward backward algorithm • hidden variables are states s at times t=1,...,t=T • Maximize probability of parameters given expectations: • replace #(l’->l)/#(l’) with weighted version of counts • replace #(l’->x)/#(l’) with weighted version

Pair HMM Notation

Pair HMM Example 1

Pair HMM Example 1 Sample run: zT = <h,t>,<e,e><e,e><h,h>,<e,->,<e,e> Strings x,y produced by zT: x=heehee, y=teehe Notice that x,y is also produced by z4 + <e,e>,<e,-> and many other edit strings

Distances based on pair HMMs

Pair HMM Inference Dynamic programming is possible: fill out matrix left-to-right, top-down

Pair HMM Inference x1 x2 x3 xT

Pair HMM Inference

Pair HMM Inference One difference: after i emissions of pair HMM, we do not know the column position i=2 i=1 i=1 i=1 i=3 i=3

Pair HMM Inference: Forward-Backward

Multiple states 2 1 3

l=2 l=1 t=1 t=1 t=2 t=2 t=2 t=2 ... ... t=T t=T v=1 v=1 ... ... v=2 v=2 ... ... ... ... v=K v=K ... ... An extension: multiple states conceptually, add a “state” dimension to the model EM methods generalize easily to this setting

Back to R&Y paper... • They consider “coarse” and “detailed” models, as well as mixturesof both. • Coarse model is like a back-off model – merge edit operations into equivalence classes (e.g. based on equivalence classes for chars). • Test by learning distance for K-NN with an additional latent variable

K-NN with latent prototypes y test example y (a string of phonemes) learned phonetic distance possible prototypes x (known word pronounciation ) x1 x2 x3 xm words from dictionary w1 w2 wK

K-NN with latent prototypes Method needs (x,y) pairs to train a distance – to handle this, an additional level of E/M is used to pick the “latent prototype” to pair with each y y learned phonetic distance x1 x2 x3 xm w1 w2 wK

Hidden prototype K-nn

Experiments • E1: on-line pronounciation dictionary • E2: subset of E1 with corpus words • E3: dictionary from training corpus • E4: dictionary from training + test corpus (!) • E5: E1 + E3

Experiments

Special meeting Wed 4/14 • What: Evolving and Self-Managing Data Integration Systems • Who: AnHai Doan, Univ. of Illinois at Urbana-Champaign • When: Wednesday, April 14, 2004 @ 11am (food at 10:30am) • Where: Sennott Square Building, room 5317

Special meeting 4/28 (last class) • First International Joint Conference on Information Extraction, Information Integration, and Sequential Learning • 10:30-11:50 am, Wean Hall 4601 • All project proposals have been accepted as paper abstracts, and you’re all invited to present for 10min (including questions)

Pair HMMs and edit distance

Pair HMMs and edit distance

Presentation Transcript

Edit Distance and Large Data Sets

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Minimum Edit Distance

Efficient Approximation of Edit Distance

Pair-HMMs and CRFs

Minimum Edit Distance

HMMs

Class 8: Pair HMMs

Class 5: HMMs and Profile HMMs

Minimum Edit Distance

Edit Distance

Minimum Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Proteins, Pair HMMs, and Alignment

Minimum Edit Distance

Dynamic Programming: Edit Distance

Edit Distance