1 / 83

Minimum-Risk Training of Approximate CRF-Based NLP Systems

Minimum-Risk Training of Approximate CRF-Based NLP Systems. Veselin Stoyanov and Jason Eisner. Overview. We will show significant improvements on three data sets. How do we do it? A new training algorithm! Don’t be afraid of discriminative models with approximate inference!

yehudi
Download Presentation

Minimum-Risk Training of Approximate CRF-Based NLP Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Minimum-Risk Training of Approximate CRF-Based NLP Systems VeselinStoyanov and Jason Eisner

  2. Overview • We will show significant improvements on three data sets. • How do we do it? • A new training algorithm! • Don’t be afraid of discriminative models with approximate inference! • Use our software instead!

  3. Minimum-Risk Training of Approximate CRF-Based NLP Systems • NLP Systems: Loremipsum dolor sit amet, consectetueradipiscingelit, seddiamnonummynibheuismodtinciduntutlaoreetdolore magna aliquameratvolutpat. Utwisienim ad minim veniam, quisnostrudexercitationullamcorpersuscipitlobortisnislutaliquip ex ea commodoconsequat. Duisautemveleumiriure dolor in hendrerit in vulputatevelitessemolestieconsequat, velillumdoloreeufeugiatnullafacilisis at veroeros et accumsan et iustoodiodignissim qui blanditpraesentluptatumzzrildelenitaugueduisdoloretefeugaitnullafacilisi. Nam libertempor cum solutanobiseleifend option conguenihilimperdiet doming id quod mazimplacerat facer possimassum. Typi non habentclaritateminsitam; estususlegentis in iis qui faciteorumclaritatem. Investigationesdemonstraveruntlectoreslegere me lius quod ii leguntsaepius. Claritasestetiamprocessusdynamicus, qui sequitur mutationemconsuetudiumlectorum. Mirumestnotare quam litteragothica, quam nuncputamus… NLP System

  4. Minimum-Risk Training of Approximate CRF-BasedNLP Systems • Conditional random fields (CRFs) [Lafferty et al., 2001] • Discriminative models of probability p(Y|X). • Used successfully for many NLP problems.

  5. Minimum-Risk Training of Approximate CRF-BasedNLP Systems • Linear chain CRF: • Exact inference is tractable. • Training via maximum likelihood estimation is tractable and convex. Y1 Y2 Y3 Y4 x1 x2 x3 x4

  6. Minimum-Risk Training of Approximate CRF-BasedNLP Systems • CRFs (like BNs and MRFs) are models of conditional probability. • In NLP we are interested in making predictions. • Build prediction systems around CRFs.

  7. Minimum-Risk Training of Approximate CRF-BasedNLP Systems • Inference: compute quantities about the distribution. DT .9 NN .05 … NN .8 JJ .1 … VBD .7 VB .1 … IN .9 NN .01 … DT .9 NN .05 … NN .4 JJ .3 … . .99 , .001 … The cat sat on the mat .

  8. Minimum-Risk Training of Approximate CRF-BasedNLP Systems • Decoding: coming up with predictions based on the probabilities. DT NN VBD IN DT NN . The cat sat on the mat .

  9. Minimum-Risk Training of ApproximateCRF-BasedNLP Systems • General CRFs: Unrestricted model structure. • Inference is intractable. • Learning? Y2 Y1 Y3 Y4 X1 X2 X3

  10. General CRFs • Why sacrifice tractable inference and convex learning? • Because a loopy model can represent the data better! • Now you can train your loopy CRF using ERMA (Empirical Risk Minimization under Approximations)!

  11. Minimum-Risk Training of Approximate CRF-BasedNLP Systems • In linear-chain CRFs, we can use Maximum Likelihood Estimation (MLE): • Compute gradients of the log likelihood running exact inference. • The likelihood is convex, so learning finds a global minimizer.

  12. Minimum-Risk Trainingof Approximate CRF-BasedNLP Systems • We use CRFs with several approximations: • Approximate inference. • Approximate decoding. • Mis-specified model structure. • MAP training (vs. Bayesian). • And we are still maximizing data likelihood? Could be present in linear-chain CRFs as well.

  13. Minimum-Risk Trainingof Approximate CRF-BasedNLP Systems • End-to-End Learning [Stoyanov, Ropson & Eisner, AISTATS2011]: • We should learn parameters that work well in the presence of approximations. • Match the training and test conditions. • Find the parameters that minimize training loss.

  14. Minimum-Risk Trainingof Approximate CRF-BasedNLP Systems • Select ϴthat minimizes training loss. • i.e., perform Empirical Risk Minimization under Approximations (ERMA). Black box decision function parameterized by ϴ (Appr.) Inference (Appr.) Decoding x p(y|x) ŷ L(y*,ŷ)

  15. Optimization Criteria

  16. Optimization Criteria MLE

  17. Optimization Criteria MLE

  18. Optimization Criteria MLE

  19. Minimum-Risk Trainingof Approximate CRF-BasedNLP Systems through Back Propagation • Use back propagation to compute gradients with respect to output loss • Use a local optimizer to find the parameters that (locally) minimize training loss

  20. Our Contributions • Apply ERMA [Stoyanov, Ropson and Eisner; AISTATS2011] to three NLP problems. • We show that: • General CRFs work better when they match dependencies in the data. • Minimum risk training results in more accurate models. • ERMA software package available at www.clsp.jhu.edu/~ves/software

  21. The Rest of this Talk • Experimental results • A brief explanation of the ERMA algorithm

  22. Experimental Evaluation

  23. Implementation • The ERMA software package(www.clsp.jhu.edu/~ves/software) • Includes syntax for describing general CRFs. • Can optimize several commonly used loss functions: MSE, Accuracy, F-score. • The package is generic: • Little effort to model new problems. • About1-3 days to express each problem in our formalism.

  24. Specifics • CRFs used with loopy BP for inference. • sum-product BP • i.e., loopy forward-backward • max-product BP (annealed) • i.e., loopy Viterbi • Two loss functions: Accuracy and F1.

  25. Modeling Congressional Votes The ConVote corpus [Thomas et al., 2006] First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill…

  26. Modeling Congressional Votes The ConVote corpus [Thomas et al., 2006] First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Yea

  27. Modeling Congressional Votes The ConVote corpus [Thomas et al., 2006] Mr. Sensenbrenner First , I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Had it not been for the heroic actions of the passengers of United flight 93 who forced the plane down over Pennsylvania, congress's ability to serve … Yea Yea

  28. Modeling Congressional Votes The ConVote corpus [Thomas et al., 2006] Mr. Sensenbrenner First , I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Had it not been for the heroic actions of the passengers of United flight 93 who forced the plane down over Pennsylvania, congress's ability to serve … Yea Yea

  29. Modeling Congressional Votes An example from the ConVote corpus [Thomas et al., 2006] • Predict representative votes based on debates.

  30. Modeling Congressional Votes An example from the ConVote corpus [Thomas et al., 2006] • Predict representative votes based on debates. Y/N

  31. Modeling Congressional Votes An example from the ConVote corpus [Thomas et al., 2006] • Predict representative votes based on debates. Y/N Text First , I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill…

  32. Modeling Congressional Votes An example from the ConVote corpus [Thomas et al., 2006] • Predict representative votes based on debates. Y/N Y/N Context Text Text First, I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill…

  33. Modeling Congressional Votes

  34. Modeling Congressional Votes

  35. Modeling Congressional Votes

  36. Modeling Congressional Votes

  37. Modeling Congressional Votes *Boldfaced results are significantly better than all others (p < 0.05)

  38. Information Extraction from Semi-Structured Text What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of Technology Topic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737) ABSTRACT: We will demonstrate the system "automata" that implements finite state machines… … After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package CMU Seminar Announcement Corpus [Freitag, 2000]

  39. Information Extraction from Semi-Structured Text What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of Technology Topic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737) ABSTRACT: We will demonstrate the system "automata" that implements finite state machines… … After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package speaker start time location speaker CMU Seminar Announcement Corpus [Freitag, 2000]

  40. Skip-Chain CRF for Info Extraction … … S O S S O S S • Extract speaker, location, stime, and etime from seminar announcement emails … … Prof. Klaus Who: Sutner will Prof. Sutner CMU Seminar Annoncement Corupus [Freitag, 2000] Skip-chain CRF [Sutton and McCallum, 2005; Finkel et al., 2005]

  41. Semi-Structured Information Extraction

  42. Semi-Structured Information Extraction

  43. Semi-Structured Information Extraction

  44. Semi-Structured Information Extraction *Boldfaced results are significantly better than all others (p < 0.05).

  45. Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libyahas not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports Reuters Corpus Version 2 [Lewis et al, 2004]

  46. Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports Reuters Corpus Version 2 [Lewis et al, 2004]

  47. Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports

  48. Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports [Ghamrawi and McCallum, 2005; Finley and Joachims, 2008]

  49. Multi-Label Classification

  50. Multi-Label Classification

More Related