1 / 29

Boosting-based parse re-ranking with subtree features

Boosting-based parse re-ranking with subtree features. Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs. Discriminative methods for parsing. have shown a remarkable performance compared to traditional generative models, e.g., PCFG two approaches

iona
Download Presentation

Boosting-based parse re-ranking with subtree features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.

  2. Discriminative methods for parsing • have shown a remarkable performance compared to traditional generative models, e.g., PCFG • two approaches • re-ranking [Collins 00, Collins 02] • discriminative machine learning algorithms are used to rerank n-best outputs of generative/conditional parsers. • dynamic programming • Max margin parsing [Tasker 04]

  3. 0.2 0.5 0.1 Reranking x: I buy cars with money G(x) n-best results • Let x be an input sentence, and y be a parse tree for x • Let G(x) be a function that returns a set of n-best results for x • A re-ranker gives a score to each sentence and selects the result which has the highest score y1 y2 y3 ….

  4. Scoring with linear model • is a feature function that maps output y into space • is a parameter vector (weights) modeled with training data

  5. Two issues in linear model [1/2] • How to estimate the weights ? • try to minimize a loss for given training data • definition of loss: ME SVMs Boosting

  6. Two issues in linear model [2/2] • How to define the feature set ? • use all subtrees • Pros: - natural extension of CFG rules - can capture long contextual information • Cons: naïve enumerations give huge complexities

  7. A question for all subtrees • Do we always need all subtrees? • only a small set of subtrees is informative • most subtrees are redundant • Goal: automatic feature selection from all subtrees • can perform fast parsing • can give good interpretation to selected subtrees • Boosting meets our demand!

  8. Why Boosting? • Different regularization strategies for • L1 (Boosting) • better when most given features are irrelevant • can remove redundant features • L2 (SVMs) • better when most given features are relevant • uses features as much as they can • Boosting meets our demand, because most subtrees are irrelevant and redundant

  9. Current weights Next weights Update feature k with an increment δ select the optimal pair <k,δ> that minimizes the Loss RankBoost [Freund03]

  10. A variant ofBranch-and-Bound • Define a search space in which the whole set of subtrees is given • Find the optimal subtree by traversing this search space • Prune the search space by proposing a criterion How to find the optimal subtree? • Set of all subtrees is huge • Need to find the optimal subtree efficiently

  11. Ad-hoc techniques • Size constraints • Use subtrees whose size is less than s (s = 6~8) • Frequency constraints • Use subtrees that occur no less than f times in training data (f = 2 ~ 5) • Pseudo iterations • After several 5- or 10-iterations of boosting, we alternately perform 100- or 300 pseudo iterations, in which the optimal subtee is selected from the cache that maintains the features explored in the previous iterations.

  12. Relation to previous work Boosting vs Kernel methods [Collins 00] Boosting vs Data Oriented Parsing [Bod 98]

  13. Kernels [Collins 00] • Kernel methods reduce the problem into the dual form that only depends on dot products of two instances (parsed trees) • Pros • No need to provide explicit feature vector • A dynamic programming is used to calculate dot products between trees, which is very efficient! • Cons • Require a large number of kernel evaluations in testing • Parsing is slow • Difficult to see which features are relevant

  14. DOP [Bod 98] • DOP is not based on re-ranking • DOP deals with the all the subtrees representation explicitly like our method • Pros • high accuracy • Cons • exact computation is NP-complete • cannot always provide sparse feature representation • very slow since the number of subtrees the DOP uses is huge

  15. Kernels vs DOP vs Boosting

  16. Experiments WSJ parsing Shallow parsing

  17. Experiments • WSJ parsing • Standard data: training: 2-21, test 23 of PTB • Model2 of Collins 99 was used to obtain n-best results • exactly the same setting as [Collins 00 (Kernels)] • Shallow parsing • CoNLL 2000 shared task • training:15-18, test: 20 of PTB • CRF-based parser [Sha 03] was used to obtain n-best results

  18. Tree representations • WSJ parsing • lexicalized tree • each non-terminal has a special node labeled with a head word • Shallow parsing • right-branching tree where adjacent phrases are child/parent relation • special node for right/left boundaries

  19. Results: WSJ parsing LR/LP = labeled recall/precision. CBs is the average number of cross brackets per sentence. 0 CBs, and 2CBs are the percentage of sentences with 0 or 2 crossing brackets, respectively • Comparable to other methods • Better than kernel method that uses all subtree representations with different parameter estimation

  20. Results: Shallow parsing Fβ=1 is a harmonic mean between precision and recall • Comparable to other methods • Our method is also comparable to Zhang’s method even without extra linguistic features

  21. Advantages • Compact feature set • WSJ parsing: ~ 8,000 • Shallow parsing: ~ 3,000 • Kernels implicitly use a huge number of features • Parsing is very fast • WSJ parsing: 0.055 sec./sentence • Shallow parsing: 0.042 sec./sentence (n-best parsing time is NOT included)

  22. Advantages, cont’d • Sparse feature representations allow us to analyze which kinds of subtrees are relevant WSJ parsing Shallow parsing positive subtrees positive subtrees negative subtrees negative subtrees

  23. Conclusions • All subtrees are potentially used as features • Boosting • L1 norm regularization performs automatic feature selection • Branch and bound • enables us to find the optimal subtrees efficiently • Advantages: • comparable accuracy to other parsing methods • fast parsing • good interpretability

  24. Efficient computation

  25. a 1 b c t 2 4 7 a c a b a 1 1 3 5 6 rightmost- path b c b c 2 2 4 4 a 1 c a b c a b 3 5 6 3 5 6 b c 2 4 7 c a b 7 3 5 6 Right most extension [Asai02, Zaki02] • Extend a given tree of size (n-1) by adding a new node to obtain trees of size n • a node is added to the right-most-path • a node is added as the rightmost sibling

  26. Right most extension, cont. • Recursive applications of right most extensions create a search space

  27. Pruning strategy μ(t )=0.4 implies the gain of any supertree of t is no grater than 0.4 Pruning • For all propose an upper bound such that • Can prune the node t if , where is a suboptimal gain

  28. Upper bound of the gain

More Related