1 / 57

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors. David A. Smith Jason Eisner Johns Hopkins University. Only Connect…. Training trees. Textual Entailment. Raw text. LM. Parser. Trained. ( Dependency ). Learning. Weischedel 2004. IE. Parallel & comparable corpora.

tobias
Download Presentation

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors David A. Smith Jason Eisner Johns Hopkins University

  2. Only Connect… Training trees Textual Entailment Raw text LM Parser Trained (Dependency) Learning Weischedel 2004 IE Parallel & comparable corpora Quirk et al. 2005 MT Pantel & Lin 2002 Out-of-domain text Lexical Semantics David A. Smith & Jason Eisner

  3. Outline: Bootstrapping Parsers • What kind of parser should we train? • How should we train it semi-supervised? • Does it work? (initial experiments) • How can we incorporate other knowledge? David A. Smith & Jason Eisner

  4. Re-estimation: EM or Viterbi EM TrainedParser David A. Smith & Jason Eisner

  5. Re-estimation: EM or Viterbi EM (iterate process) TrainedParser Oops! Not much supervised training. So most of these parses were bad. Retraining on all of them overwhelms the good supervised data. David A. Smith & Jason Eisner

  6. Simple Bootstrapping: Self-Training So only retrain on “good” parses ... ? TrainedParser David A. Smith & Jason Eisner

  7. Simple Bootstrapping: Self-Training So only retrain on “good” parses ... TrainedParser at least, those the parser itself thinks are good. (Can we trust it? We’ll see ...) David A. Smith & Jason Eisner

  8. Why Might This Work? • Sure, now we avoid harming the parser with bad training. • But why do we learn anything new from the unsup. data? TrainedParser But unsupervised parses have • Few positiveornegative features • Mostly unknown features • Words or situations not seen in training data After training, training parses have • Many features with positive weights • Few features with negative weights Still, sometimes enoughpositive features to be sure it’s the right parse David A. Smith & Jason Eisner

  9. Why Might This Work? • Sure, we avoid bad guesses that harm the parser. • But why do we learn anything new from the unsup. data? TrainedParser Now, retraining the weights makes the gray (and red)features greener Still, sometimes enoughpositive features to be sure it’s the right parse David A. Smith & Jason Eisner

  10. Why Might This Work? • Sure, we avoid bad guesses that harm the parser. • But why do we learn anything new from the unsup. data? TrainedParser Now, retraining the weights makes the gray (and red)features greener ... and makes features redder for the “losing” parses of this sentence (not shown) Still, sometimes enoughpositive features to be sure it’s the right parse Learning! David A. Smith & Jason Eisner

  11. This Story Requires Many Redundant Features! More features  more chances to identify correct parseeven when we’re undertrained • Bootstrapping for WSD (Yarowsky 1995) • Lots of contextual features  success • Co-training for parsing (Steedman et. al 2003) • Feature-poor parsers  disappointment • Self-training for parsing (McClosky et al. 2006) • Feature-poor parsers  disappointment • Reranker with more features  success David A. Smith & Jason Eisner

  12. This Story Requires Many Redundant Features! More features  more chances to identify correct parseeven when we’re undertrained • So, let’s bootstrap a feature-rich parser! • In our experiments so far, we followMcDonald et al. (2005) • Our model has 450 million features (on Czech) • Prune down to 90 million frequent features • About 200are considered per possible edge Note: Even more features proposed at end of talk David A. Smith & Jason Eisner

  13. “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • No global features of a parse • Each feature is attached to some edge • Simple; allows fast O(n2) or O(n3) parsing Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

  14. “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? yes, lots of green ... Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

  15. “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  den (“bright day”) Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

  16. “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  N (“bright NOUN”) jasný  den (“bright day”) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C David A. Smith & Jason Eisner

  17. “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  N (“bright NOUN”) jasný  den (“bright day”) A N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C David A. Smith & Jason Eisner

  18. “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  N (“bright NOUN”) jasný  den (“bright day”) A N preceding conjunction A N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C David A. Smith & Jason Eisner

  19. “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? not as good, lots of red ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C David A. Smith & Jason Eisner

  20. “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasný  hodiny (“bright clocks”) ... undertrained ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C David A. Smith & Jason Eisner

  21. jasný  hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasn- hodi- (“bright clock,”stems only) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C být- jasn- stud- dubn- den- a- hodi- odbí- třin- David A. Smith & Jason Eisner

  22. jasný  hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasn- hodi- (“bright clock,”stems only) Aplural Nsingular Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C být- jasn- stud- dubn- den- a- hodi- odbí- třin- David A. Smith & Jason Eisner

  23. jasný  hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasný hodiny (“bright clock,”stems only) A Nwhere N followsa conjunction Aplural Nsingular Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C být- jasn- stud- dubn- den- a- hodi- odbí- třin- David A. Smith & Jason Eisner

  24. “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • “bright day” or “bright clocks”? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C být- jasn- stud- dubn- den- a- hodi- odbí- třin- David A. Smith & Jason Eisner

  25. our current weight vector “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • Score of an edge e =   features(e) • Standard algos  validparse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C být jasný studený dubnový den a hodiny odbit třináct David A. Smith & Jason Eisner

  26. our current weight vector can‘t have both(no crossing links) Can’t have all three(no cycles) Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • Score of an edge e =   features(e) • Standard algos  validparse with max total score can’t have both(one parent per word) Thus, an edge may lose (or win) because of a consensus of other edges. Retraining then learns toreduce (or increase) its score. David A. Smith & Jason Eisner

  27. Only Connect… Training trees Textual Entailment Raw text LM TrainedParser Learning IE Parallel & comparable corpora MT Out-of-domain text Lexical Semantics David A. Smith & Jason Eisner

  28. Can we recast this declaratively? Only retrain on “good” parses ... TrainedParser at least, those the parser itself thinks are good. David A. Smith & Jason Eisner

  29. Try to be confident on the unsupervised parses Try to predict the supervised parses Bootstrapping as Optimization Maximize a function on supervised and unsupervised data Entropy regularization (Brand 1999; Grandvalet & Bengio; Jiao et al.) Yesterday’s talk: How to compute these for non-projective models See Hwa ‘01 for projective tree entropy David A. Smith & Jason Eisner

  30. When we’re pretty sure the true parse is A or B, we reduce entropy Hby becoming even surer( retraining  on the example) When we’re not sure, the example doesn’t affect ( not retraining on the example) H/p ? H p sure ofparse B(H0) sure ofparse A(H0) Claim: Gradient descent on this objective function works like bootstrapping not sure(H1) David A. Smith & Jason Eisner

  31. Claim: Gradient descent on this objective function works like bootstrapping • This gives us a tunable parameter : • Connect to Abney’s view of bootstrapping (=0) • Obtain Viterbi variant (limit as   ) • Obtain Gini variant (=2) • Still get Shannon entropy (limit as   1) • Also easier to compute in some circumstances In the paper, we generalize: replace Shannon entropy H() with Rényi entropy H() David A. Smith & Jason Eisner

  32. Experimental Questions • Are confident parses (or edges) actually good for retraining? • Does bootstrapping help accuracy? • What is being learned? David A. Smith & Jason Eisner

  33. ridiculously small (pilot experiments, sorry) Experimental Design • Czech, German, and Spanish (some Bulgarian) • CoNLL-X dependency trees • Non-projective (MST) parsing • Hundreds of millions of features • Supervised training sets of 100 & 1000 trees • Unparsed but tagged sets of 2k to 70k sentences • Stochastic gradient descent • First optimize just likelihood on seed set • Then optimize likelihood + confidence criterion on all data • Stop when accuracy peaks on development data David A. Smith & Jason Eisner

  34. Are confident parses accurate?Correlation of entropy with accuracy Shannon entropy “Viterbi” self-training -.32 -.26 Gini = -log(expected 0/1 gain) log(# of parses):favor short sentences; Abney’s Yarowsky alg. -.27 -.25 David A. Smith & Jason Eisner

  35. How Accurate Is Bootstrapping? 100-tree supervised set  (baseline) +71K +37K +2K Significant on paired permutation test David A. Smith & Jason Eisner

  36. 90%: Maybe enough precisionso retraining doesn’t hurt Maybe enoughrecall so retrainingwill learn new things How Does Bootstrapping Learn? Precision Recall David A. Smith & Jason Eisner

  37. Supervisedbaselines EM (joint) MLE (joint) MLE (cond.) Boot. (cond.) Bootstrapping vs. EM Two ways to add unsupervised data Compare on a feature-poor model that EM can handle (DMV) 90 80 70 60 50 40 30 20 10 0 Bulgarian German Spanish 100 training trees, 100 dev trees for model selection David A. Smith & Jason Eisner

  38. There’s No Data Like More Data Training trees Textual Entailment Raw text LM TrainedParser Learning IE Parallel & comparable corpora MT Out-of-domain text Lexical Semantics David A. Smith & Jason Eisner

  39. No. Just use them to get further noisy features. It was a bright cold day in April and the clocks were striking thirteen “Token” Projection What if some sentences have parallel text? • Project 1-best English dependencies (Hwa et al. ‘04)??? • Imperfect or free translation • Imperfect parse • Imperfect alignment Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

  40. “Token” Projection What if some sentences have parallel text? Probably aligns to some English link A  N Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen David A. Smith & Jason Eisner

  41. “Token” Projection What if some sentences have parallel text? Probably aligns to some English pathN  in  N Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen Cf. “quasi-synchronous grammars”(Smith & Eisner, 2006) David A. Smith & Jason Eisner

  42. Parsed Gigaword corpus clock strike …will no longer be royal when the clock strikes midnight. But when the clock strikes 11 a.m. and the race cars rocket… …vehicles and pedestrians after the clock struck eight. …when the clock of a no-passenger Airbus A-320 struck… …born right after the clock struck 12:00 p.m. of December… …as the clock in Madrid’s Plaza del Sol strikes 12 times. “Type” Projection Can we use world knowledge, e.g., from comparable corpora? Probably translate as English words that usually link as N  V when cosentential Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

  43. be exist subsist bright broad cheerful pellucid straight … cold fresh hyperborean stone-cold April day daytime and plus clock meter metre strike thirteen “Type” Projection Can we use world knowledge, e.g., from comparable corpora? Probably translate as English words that usually link as N  V when cosentential Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

  44. Conclusions • Declarative view of bootstrapping as entropy minimization • Improvements in parser accuracy with feature-rich models • Easily added features from alternative data sources, e.g. comparable text • In future: consider also the WSD decision list learner: is it important for learning robust feature weights? David A. Smith & Jason Eisner

  45. Thanks Noah Smith Keith Hall The Anonymous Reviewers Ryan McDonald for making his code available David A. Smith & Jason Eisner

  46. Extra slides …

  47. Dependency Treebanks David A. Smith & Jason Eisner

  48. A Supervised CoNLL-X System What system was this? David A. Smith & Jason Eisner

  49. How Does Bootstrapping Learn? Supervised iter. 10 Supervised iter. 1 Boostrapping w/ R2 Boostrapping w/ Rinf David A. Smith & Jason Eisner

  50. How Does Bootstrapping Learn? David A. Smith & Jason Eisner

More Related