1 / 25

Kernels for Relation Extraction & Semi-Markov Models

Kernels for Relation Extraction & Semi-Markov Models. William Cohen 3-27-2007. …and announcements. Projects and such: Last class is (officially) Thus May 3. Projects due Fri May 11. But I was soft about allowing small projects…so We’ll meet May 8 and May 10 also

winona
Download Presentation

Kernels for Relation Extraction & Semi-Markov Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kernels for Relation Extraction& Semi-Markov Models William Cohen 3-27-2007

  2. …and announcements • Projects and such: • Last class is (officially) Thus May 3. • Projects due Fri May 11. • But I was soft about allowing small projects…so • We’ll meet May 8 and May 10 also • Send me mail if you can’t attend • Project presentations start April 17 – two per session, 30min each, plus questions. • Preliminary reports are ok in your presentation

  3. Announcements • Lectures thru mid-April: • SRL overview on this Thus • Bootstrapping next week • Critique for Etzioni paper due Tues. • Please keep talks to 20min talking

  4. Kernels vs Structured Output Spaces • Two kinds of structured learning: • HMMs, CRFs, VP-trained HMM, structured SVMs, stacked learning, ….: the output of the learner is structured. • Eg for linear-chain CRF, the output is a sequence of labels—a string Yn • Bunescu & Mooney (EMNLP, NIPS): the input to the learner is structured. • EMNLP: structure derived from a dependency graph. New!

  5. x  x’ x1× x2× x3 × x4× x5 = 4*1*3*1*4 = 48 features x3 x2 x4 x5 x1 K( x1 × … × xn, y1 × … × yn ) = ( x1 × … × xn )∩ (y1 × … × yn) …

  6. and the NIPS paper… • Similar representation for relation instances: x1 × … × xn where each xi is a set…. • …but instead of informative dependency path elements, the x’s just represent adjacent tokens. • To compensate: use a richer kernel

  7. Subsequence kernels [Lohdi et al, JMLR 2002] • Example strings: • “Elvis Presley was born on Jan 8”  s1) PERSON was born on DATE. • “William Cohen was born in New York City on April 6”  s2) PERSON was born in LOCATION on DATE. • Plausible pattern: • PERSON was born … on DATE. • What we’ll actually learn: • u = PERSON … was … born … on … DATE. • u matches s if exists i=i1,…,in so that s[i]=s[i1]…s[in]=u • For string s1, i=1234. For string s2, i=12367 i=i1,…,in are increasing indices in s

  8. Subsequence kernels s1) PERSON was born on DATE. s2) PERSON was born in LOCATION on DATE. • Pattern: • u = PERSON … was … born … on … DATE. • u matches s if exists i=i1,…,in so that s[i]=s[i1]…s[in]=u • For string s1, i=1234. For string s2, i=12367 • How to we say that s1 matches better than s2? • Weight a match of s to u by λlength(i) wherelength(i)=in-i1+1 • Now let’s define K(s,t) = the sum over all u that match both s and t of matchWeight(u,s)*matchweight(u,t)

  9. K’i(s,t) = #patterns u that match s and t where the last index is at the very end of s and t. These recursions allow dynamic programming

  10. Subsequence kernel with features • set of all sparse subsequencesu of x1 × … × xn witheach u downweighted according to sparsity Relaxation of old kernel: We don’t have to match everywhere, just at selected locations For every position we decide to match at, we get a penalty of λ To pick a “feature” inside (x1 …xn)’ Pick a subset of locations i=i1,…,ikand then Pick a feature value in each location In the preprocessed vector x’ weight every feature for i by λlength(i) = λik-i1+1

  11. Subsequence kernel or Where c(x,y) = Number of ways x and y match (i.e number of common features)

  12. * c(x,t[j]) all j Number of ways x and t[j] match (i.e number of common features)

  13. * c(x,t[j]) * c(x,t[j]) all j

  14. Additional details • Special domain-specific tricks for combining the subsequences for what matches in the fore, aft, and between sections of a relation-instance pair. • Subsequences are of length less than 4. • Is DP needed for this now? • Count fore-between, between-aft, and between subsequences separately.

  15. Results Protein-protein interaction

  16. Semi-Markov Models

  17. t x y State-of-the-art NER: Sequential Word Classification I met Prof. F. Douglas at the zoo Question: how can we guide this using a dictionary D? Simple answer: make membership in D a feature fd what about a dictionary entry like “Fred Douglis”?

  18. Semi-Markov models t x y l,u x y

  19. Prediction Viterbi finds this efficiently. Many models fit this framework

  20. Proposed Semi-Markov model previous label Start of Sj j-th label end of Sj

  21. Modified Viterbi Best segmentation ending at position i and assigned label y Maximum segment length L=4 V(1,L) V(2,L) V(3,L) V(4,L) V(5,L) V(1,P) V(2,P) V(3,P) V(4,P) V(5,O) V(1,O) V(2,O) V(3,O) V(4,O) V(5,P)?

  22. [Sarawagi & Cohen, NIPS 2004] Internal dictionary: formed from training examples External dictionary: from external source

  23. Learning Rates for Semi-Markov CRFs

More Related