Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors David A. Smith Jason Eisner Johns Hopkins University

Only Connect… Training trees Textual Entailment Raw text LM Parser Trained (Dependency) Learning Weischedel 2004 IE Parallel & comparable corpora Quirk et al. 2005 MT Pantel & Lin 2002 Out-of-domain text Lexical Semantics David A. Smith & Jason Eisner

Outline: Bootstrapping Parsers • What kind of parser should we train? • How should we train it semi-supervised? • Does it work? (initial experiments) • How can we incorporate other knowledge? David A. Smith & Jason Eisner

Re-estimation: EM or Viterbi EM TrainedParser David A. Smith & Jason Eisner

Re-estimation: EM or Viterbi EM (iterate process) TrainedParser Oops! Not much supervised training. So most of these parses were bad. Retraining on all of them overwhelms the good supervised data. David A. Smith & Jason Eisner

Simple Bootstrapping: Self-Training So only retrain on “good” parses ... ? TrainedParser David A. Smith & Jason Eisner

Simple Bootstrapping: Self-Training So only retrain on “good” parses ... TrainedParser at least, those the parser itself thinks are good. (Can we trust it? We’ll see ...) David A. Smith & Jason Eisner

Why Might This Work? • Sure, now we avoid harming the parser with bad training. • But why do we learn anything new from the unsup. data? TrainedParser But unsupervised parses have • Few positiveornegative features • Mostly unknown features • Words or situations not seen in training data After training, training parses have • Many features with positive weights • Few features with negative weights Still, sometimes enoughpositive features to be sure it’s the right parse David A. Smith & Jason Eisner

Why Might This Work? • Sure, we avoid bad guesses that harm the parser. • But why do we learn anything new from the unsup. data? TrainedParser Now, retraining the weights makes the gray (and red)features greener Still, sometimes enoughpositive features to be sure it’s the right parse David A. Smith & Jason Eisner

Why Might This Work? • Sure, we avoid bad guesses that harm the parser. • But why do we learn anything new from the unsup. data? TrainedParser Now, retraining the weights makes the gray (and red)features greener ... and makes features redder for the “losing” parses of this sentence (not shown) Still, sometimes enoughpositive features to be sure it’s the right parse Learning! David A. Smith & Jason Eisner

This Story Requires Many Redundant Features! More features  more chances to identify correct parseeven when we’re undertrained • Bootstrapping for WSD (Yarowsky 1995) • Lots of contextual features  success • Co-training for parsing (Steedman et. al 2003) • Feature-poor parsers  disappointment • Self-training for parsing (McClosky et al. 2006) • Feature-poor parsers  disappointment • Reranker with more features  success David A. Smith & Jason Eisner

This Story Requires Many Redundant Features! More features  more chances to identify correct parseeven when we’re undertrained • So, let’s bootstrap a feature-rich parser! • In our experiments so far, we followMcDonald et al. (2005) • Our model has 450 million features (on Czech) • Prune down to 90 million frequent features • About 200are considered per possible edge Note: Even more features proposed at end of talk David A. Smith & Jason Eisner

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • No global features of a parse • Each feature is attached to some edge • Simple; allows fast O(n2) or O(n3) parsing Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? yes, lots of green ... Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  den (“bright day”) Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  N (“bright NOUN”) jasný  den (“bright day”) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C David A. Smith & Jason Eisner

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  N (“bright NOUN”) jasný  den (“bright day”) A N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C David A. Smith & Jason Eisner

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  N (“bright NOUN”) jasný  den (“bright day”) A N preceding conjunction A N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C David A. Smith & Jason Eisner

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? not as good, lots of red ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C David A. Smith & Jason Eisner

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasný  hodiny (“bright clocks”) ... undertrained ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C David A. Smith & Jason Eisner

jasný  hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasn- hodi- (“bright clock,”stems only) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C být- jasn- stud- dubn- den- a- hodi- odbí- třin- David A. Smith & Jason Eisner

jasný  hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasn- hodi- (“bright clock,”stems only) Aplural Nsingular Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C být- jasn- stud- dubn- den- a- hodi- odbí- třin- David A. Smith & Jason Eisner

jasný  hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasný hodiny (“bright clock,”stems only) A Nwhere N followsa conjunction Aplural Nsingular Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C být- jasn- stud- dubn- den- a- hodi- odbí- třin- David A. Smith & Jason Eisner

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • “bright day” or “bright clocks”? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C být- jasn- stud- dubn- den- a- hodi- odbí- třin- David A. Smith & Jason Eisner

our current weight vector “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • Score of an edge e =   features(e) • Standard algos  validparse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C být jasný studený dubnový den a hodiny odbit třináct David A. Smith & Jason Eisner

our current weight vector can‘t have both(no crossing links) Can’t have all three(no cycles) Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • Score of an edge e =   features(e) • Standard algos  validparse with max total score can’t have both(one parent per word) Thus, an edge may lose (or win) because of a consensus of other edges. Retraining then learns toreduce (or increase) its score. David A. Smith & Jason Eisner

Only Connect… Training trees Textual Entailment Raw text LM TrainedParser Learning IE Parallel & comparable corpora MT Out-of-domain text Lexical Semantics David A. Smith & Jason Eisner

Can we recast this declaratively? Only retrain on “good” parses ... TrainedParser at least, those the parser itself thinks are good. David A. Smith & Jason Eisner

Try to be confident on the unsupervised parses Try to predict the supervised parses Bootstrapping as Optimization Maximize a function on supervised and unsupervised data Entropy regularization (Brand 1999; Grandvalet & Bengio; Jiao et al.) Yesterday’s talk: How to compute these for non-projective models See Hwa ‘01 for projective tree entropy David A. Smith & Jason Eisner

When we’re pretty sure the true parse is A or B, we reduce entropy Hby becoming even surer( retraining  on the example) When we’re not sure, the example doesn’t affect ( not retraining on the example) H/p ? H p sure ofparse B(H0) sure ofparse A(H0) Claim: Gradient descent on this objective function works like bootstrapping not sure(H1) David A. Smith & Jason Eisner

Claim: Gradient descent on this objective function works like bootstrapping • This gives us a tunable parameter : • Connect to Abney’s view of bootstrapping (=0) • Obtain Viterbi variant (limit as   ) • Obtain Gini variant (=2) • Still get Shannon entropy (limit as   1) • Also easier to compute in some circumstances In the paper, we generalize: replace Shannon entropy H() with Rényi entropy H() David A. Smith & Jason Eisner

Experimental Questions • Are confident parses (or edges) actually good for retraining? • Does bootstrapping help accuracy? • What is being learned? David A. Smith & Jason Eisner

ridiculously small (pilot experiments, sorry) Experimental Design • Czech, German, and Spanish (some Bulgarian) • CoNLL-X dependency trees • Non-projective (MST) parsing • Hundreds of millions of features • Supervised training sets of 100 & 1000 trees • Unparsed but tagged sets of 2k to 70k sentences • Stochastic gradient descent • First optimize just likelihood on seed set • Then optimize likelihood + confidence criterion on all data • Stop when accuracy peaks on development data David A. Smith & Jason Eisner

Are confident parses accurate?Correlation of entropy with accuracy Shannon entropy “Viterbi” self-training -.32 -.26 Gini = -log(expected 0/1 gain) log(# of parses):favor short sentences; Abney’s Yarowsky alg. -.27 -.25 David A. Smith & Jason Eisner

How Accurate Is Bootstrapping? 100-tree supervised set  (baseline) +71K +37K +2K Significant on paired permutation test David A. Smith & Jason Eisner

90%: Maybe enough precisionso retraining doesn’t hurt Maybe enoughrecall so retrainingwill learn new things How Does Bootstrapping Learn? Precision Recall David A. Smith & Jason Eisner

Supervisedbaselines EM (joint) MLE (joint) MLE (cond.) Boot. (cond.) Bootstrapping vs. EM Two ways to add unsupervised data Compare on a feature-poor model that EM can handle (DMV) 90 80 70 60 50 40 30 20 10 0 Bulgarian German Spanish 100 training trees, 100 dev trees for model selection David A. Smith & Jason Eisner

There’s No Data Like More Data Training trees Textual Entailment Raw text LM TrainedParser Learning IE Parallel & comparable corpora MT Out-of-domain text Lexical Semantics David A. Smith & Jason Eisner

No. Just use them to get further noisy features. It was a bright cold day in April and the clocks were striking thirteen “Token” Projection What if some sentences have parallel text? • Project 1-best English dependencies (Hwa et al. ‘04)??? • Imperfect or free translation • Imperfect parse • Imperfect alignment Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

“Token” Projection What if some sentences have parallel text? Probably aligns to some English link A  N Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen David A. Smith & Jason Eisner

“Token” Projection What if some sentences have parallel text? Probably aligns to some English pathN  in  N Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen Cf. “quasi-synchronous grammars”(Smith & Eisner, 2006) David A. Smith & Jason Eisner

Parsed Gigaword corpus clock strike …will no longer be royal when the clock strikes midnight. But when the clock strikes 11 a.m. and the race cars rocket… …vehicles and pedestrians after the clock struck eight. …when the clock of a no-passenger Airbus A-320 struck… …born right after the clock struck 12:00 p.m. of December… …as the clock in Madrid’s Plaza del Sol strikes 12 times. “Type” Projection Can we use world knowledge, e.g., from comparable corpora? Probably translate as English words that usually link as N  V when cosentential Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

be exist subsist bright broad cheerful pellucid straight … cold fresh hyperborean stone-cold April day daytime and plus clock meter metre strike thirteen “Type” Projection Can we use world knowledge, e.g., from comparable corpora? Probably translate as English words that usually link as N  V when cosentential Byl jasný studený dubnový den a hodiny odbíjely třináctou David A. Smith & Jason Eisner

Conclusions • Declarative view of bootstrapping as entropy minimization • Improvements in parser accuracy with feature-rich models • Easily added features from alternative data sources, e.g. comparable text • In future: consider also the WSD decision list learner: is it important for learning robust feature weights? David A. Smith & Jason Eisner

Thanks Noah Smith Keith Hall The Anonymous Reviewers Ryan McDonald for making his code available David A. Smith & Jason Eisner

Extra slides …

Dependency Treebanks David A. Smith & Jason Eisner

A Supervised CoNLL-X System What system was this? David A. Smith & Jason Eisner

How Does Bootstrapping Learn? Supervised iter. 10 Supervised iter. 1 Boostrapping w/ R2 Boostrapping w/ Rinf David A. Smith & Jason Eisner

How Does Bootstrapping Learn? David A. Smith & Jason Eisner

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors

Presentation Transcript

Bootstrapping

Entropic Urbanism

Pray with Dependency

XML Parsers

Entropic Gravity

LR PARSERS

Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network

Parsers

Bootstrapping

Training dependency parsers by jointly optimizing multiple objectives

Bootstrapping

Bootstrapping

Parsers

Mainconsole Family Intelligent. Flexible. Feature-rich

Bootstrapping

Entropic Gravity

Parsers

Get feature-rich Taskrabbit clone with surprising offers