1 / 19

Probabilistic Models of Nonprojective Dependency Trees

Probabilistic Models of Nonprojective Dependency Trees. Noah A. Smith Language Technologies Institute Machine Learning Dept. School of Computer Science Carnegie Mellon University. David A. Smith Center for Language and Speech Processing Computer Science Dept. Johns Hopkins University.

gaenor
Download Presentation

Probabilistic Models of Nonprojective Dependency Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Models of Nonprojective Dependency Trees Noah A. Smith Language Technologies Institute Machine Learning Dept. School of Computer Science Carnegie Mellon University David A. Smith Center for Language and Speech Processing Computer Science Dept. Johns Hopkins University EMNLP-CoNLL

  2. See Also On the Complexity of Non-Projective Data-Driven Dependency Parsing R. McDonald and G. Satta IWPT 2007 Structured-Prediction Models via the Matrix-Tree Theorem T. Koo, A. Globerson, X. Carreras and M. Collins EMNLP-CoNLL 2007 Coming Up Next! EMNLP-CoNLL

  3. Nonprojective Syntax ROOT I ‘ll give a talk tomorrow on bootstrapping ROOT ista meam norit gloria canitiem thatNOM myACC may-know gloryNOM going-grayACC That glory shall last till I go gray How would we parse this? EMNLP-CoNLL

  4. Edge-Factored Models (McDonald et al., 2005) Score edges in isolation Find maximum spanning tree with Chu-Liu-Edmonds NP hard to add sibling or degree constraints, hidden node variables children parents Non-neg. score for each edge Find edge sum among legal trees Unlabeled for now What about training? EMNLP-CoNLL

  5. If Only It Were Projective… ROOT I ‘ll give a talk on bootstrapping tomorrow An Inside-Outside algorithm gives us • Normalizing constant for globally normalized models • Posterior probability of edges • Summing over hidden variables But we can’t use Inside-Outside for nonprojective parsing! EMNLP-CoNLL

  6. Graph Theory to the Rescue! O(n3) time! Tutte’s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! EMNLP-CoNLL

  7. Building the Kirchoff (Laplacian) Matrix • Negate edge scores • Sum columns (children) • Strike root row/col. • Take determinant N.B.: This allows multiple children of root, but see Koo et al. 2007. EMNLP-CoNLL

  8. Why Should This Work? Clear for 1x1 matrix; use induction Chu-Liu-Edmonds analogy: Every node selects best parent If cycles, contract and recur Undirected case; special root cases for directed EMNLP-CoNLL

  9. When You Have a Hammer… Matrix-Tree Theorem • Sequence-normalized log-linear models (Lafferty et al. ‘01) • Minimum Bayes-risk parsing (cf. Goodman ‘96) • Hidden-variable models • O(n) inference with length constraints (cf. N. Smith & Eisner, ‘05) • Minimum risk training (D. Smith & Eisner ‘06) • Tree (Rényi) entropy (Hwa ‘01; S & E ‘07) EMNLP-CoNLL

  10. Analogy to Other Models EMNLP-CoNLL

  11. More Machinery: The Gradient Invert Kirchoff matrix K in O(n3) time via LU factorization Since The edge gradient is also edge posterior probability. Use the chain rule to backpropagate into s(i,j), whatever its internal structure may be. EMNLP-CoNLL

  12. Nonprojective Conditional Log-Linear Training • CoNLL 2006 Danish and Dutch • CoNLL 2007 Arabic and Czech • Features from McDonald et al. 2005 • Compared with MSTParser’s MIRA max-margin training • Trained LL weights with stochastic gradient descent • Same #iterations and stopping criteria as MIRA Significance on paired permutation test EMNLP-CoNLL

  13. Minimum Bayes-Risk Parsing Select the tree, not with the highest probability, but the most expected correct edges. Plug posteriors into MST MIRA doesn’t estimate probs. N.B. One could do mBr inside MIRA. EMNLP-CoNLL

  14. Edge Clustering (Supervised) labeled dependency parsing OBJ SUBJ Franz loves Milena OR A X Sum out all possible edge labelings if we don’t care about labels per se. B Y C Z Franz loves Milena Simple idea: conjoin each model feature with a cluster EMNLP-CoNLL

  15. Edge Clustering No significant gains or losses from clustering EMNLP-CoNLL

  16. NP-A NP-B NP-A What’s Wrong with Edge Clustering? No interaction Edge labels don’t interact Unlike clusters on PCFG nonterminals (e.g. Matsuzaki et al.’05) Cf. small/no gains for unlabeled accuracy from supervised labeled parsers A B Franz loves Milena Interaction in rewrite rule EMNLP-CoNLL

  17. Constraints on Link Length • Max. left/right child distances L and R (Cf. Eisner & N. Smith‘05) • Band-diagonal Kirchoff matrix once root row and column are removed • Inversion in O(min(L3R2, L2R3)n) time Example with L=1, R=2 EMNLP-CoNLL

  18. Conclusions • O(n3) inference for edge-factored nonprojective dependency models • Performance closely comparable to MIRA • Learned edge clustering doesn’t seem to help unlabeled parsing • Many other applications to hit EMNLP-CoNLL

  19. Thanks Jason Eisner Keith Hall Sanjeev Khudanpur The Anonymous Reviewers Ryan McDonald & Michael Collins & colleagues For sharing drafts EMNLP-CoNLL

More Related