1 / 83

Tree Kernels for Parsing: (Collins & Duffy, 2001)

Tree Kernels for Parsing: (Collins & Duffy, 2001). Advanced Statistical Methods in NLP Ling 572 February 28, 2012. Roadmap. Collins & Duffy, 2001 Tree Kernels for Parsing: Motivation Parsing as r eranking Tree kernels for similarity Case study: Penn Treebank parsing.

shing
Download Presentation

Tree Kernels for Parsing: (Collins & Duffy, 2001)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tree Kernels for Parsing:(Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012

  2. Roadmap • Collins & Duffy, 2001 • Tree Kernels for Parsing: • Motivation • Parsing as reranking • Tree kernels for similarity • Case study: Penn Treebank parsing

  3. Motivation: Parsing • Parsing task: • Given a natural language sentence, extract its syntactic structure • Specifically, generate a corresponding parse tree

  4. Motivation: Parsing • Parsing task: • Given a natural language sentence, extract its syntactic structure • Specifically, generate a corresponding parse tree • Approaches:

  5. Motivation: Parsing • Parsing task: • Given a natural language sentence, extract its syntactic structure • Specifically, generate a corresponding parse tree • Approaches: • “Classical” approach: • Hand-write CFG productions; use standard alg, e.g. CKY

  6. Motivation: Parsing • Parsing task: • Given a natural language sentence, extract its syntactic structure • Specifically, generate a corresponding parse tree • Approaches: • “Classical” approach: • Hand-write CFG productions; use standard alg, e.g. CKY • Probabilistic approach: • Build large treebank of parsed sentences • Learn production probabilities • Use probabilistic versions of standard alg • Pick highest probability parse

  7. Parsing Issues • Main issues:

  8. Parsing Issues • Main issues: • Robustness: get reasonable parse for any input • Ambiguity: select best parse given alternatives • “Classic” approach:

  9. Parsing Issues • Main issues: • Robustness: get reasonable parse for any input • Ambiguity: select best parse given alternatives • “Classic” approach: • Hand-coded grammars often fragile • No obvious mechanism to select among alternatives • Probabilistic approach:

  10. Parsing Issues • Main issues: • Robustness: get reasonable parse for any input • Ambiguity: select best parse given alternatives • “Classic” approach: • Hand-coded grammars often fragile • No obvious mechanism to select among alternatives • Probabilistic approach: • Fairly good robustness, small probabilities for any • Select by probability, but decisions are local • Hard to capture more global structure

  11. Approach: Parsing by Reranking • Intuition: • Identify collection of candidate parses for each sentence • e.g. output of PCFG parser • For training, identify gold standard parse, sentence pair

  12. Approach: Parsing by Reranking • Intuition: • Identify collection of candidate parses for each sentence • e.g. output of PCFG parser • For training, identify gold standard parse, sentence pair • Create a parse tree vector representation • Identify (more global) parse tree features

  13. Approach: Parsing by Reranking • Intuition: • Identify collection of candidate parses for each sentence • e.g. output of PCFG parser • For training, identify gold standard parse, sentence pair • Create a parse tree vector representation • Identify (more global) parse tree features • Train a reranker to rank gold standard highest

  14. Approach: Parsing by Reranking • Intuition: • Identify collection of candidate parses for each sentence • e.g. output of PCFG parser • For training, identify gold standard parse, sentence pair • Create a parse tree vector representation • Identify (more global) parse tree features • Train a reranker to rank gold standard highest • Apply to rerank candidate parses for new sentence

  15. Parsing as Reranking, Formally • Training data pairs: {(si,ti)} • where si is a sentence, ti is a parse tree

  16. Parsing as Reranking, Formally • Training data pairs: {(si,ti)} • where si is a sentence, ti is a parse tree

  17. Parsing as Reranking, Formally • Training data pairs: {(si,ti)} • where si is a sentence, ti is a parse tree • C(si) = {xij}: • Candidate parses for si • wlog, xi1 is the correct parse for si

  18. Parsing as Reranking, Formally • Training data pairs: {(si,ti)} • where si is a sentence, ti is a parse tree • C(si) = {xij}: • Candidate parses for si • wlog, xi1 is the correct parse for si • h(xij): feature vector representation of xij

  19. Parsing as Reranking, Formally • Training data pairs: {(si,ti)} • where si is a sentence, ti is a parse tree • C(si) = {xij}: • Candidate parses for si • wlog, xi1 is the correct parse for si • h(xij): feature vector representation of xij • Training: Learn

  20. Parsing as Reranking, Formally • Training data pairs: {(si,ti)} • where si is a sentence, ti is a parse tree • C(si) = {xij}: • Candidate parses for si • wlog, xi1 is the correct parse for si • h(xij): feature vector representation of xij • Training: Learn • Decoding: Compute

  21. Parsing as Reranking: Training • Consider the hard-margin SVM model: • Minimize ||w||2 subject to constraints

  22. Parsing as Reranking: Training • Consider the hard-margin SVM model: • Minimize ||w||2 subject to constraints • What constraints?

  23. Parsing as Reranking: Training • Consider the hard-margin SVM model: • Minimize ||w||2 subject to constraints • What constraints? • Here, ranking constraints: • Specifically, correct parse outranks all other candidates

  24. Parsing as Reranking: Training • Consider the hard-margin SVM model: • Minimize ||w||2 subject to constraints • What constraints? • Here, ranking constraints: • Specifically, correct parse outranks all other candidates • Formally,

  25. Parsing as Reranking: Training • Consider the hard-margin SVM model: • Minimize ||w||2 subject to constraints • What constraints? • Here, ranking constraints: • Specifically, correct parse outranks all other candidates • Formally,

  26. Parsing as Reranking: Training • Consider the hard-margin SVM model: • Minimize ||w||2 subject to constraints • What constraints? • Here, ranking constraints: • Specifically, correct parse outranks all other candidates • Formally,

  27. Reformulating with α • Training learns αij, such that

  28. Reformulating with α • Training learns αij, such that • Note: just like SVM equation, w/different constraint

  29. Reformulating with α • Training learns αij, such that • Note: just like SVM equation, w/different constraint • Parse scoring:

  30. Reformulating with α • Training learns αij, such that • Note: just like SVM equation, w/different constraint • Parse scoring: • After substitution, we have

  31. Reformulating with α • Training learns αij, such that • Note: just like SVM equation, w/different constraint • Parse scoring: • After substitution, we have • After the kernel trick, we have

  32. Reformulating with α • Training learns αij, such that • Note: just like SVM equation, w/different constraint • Parse scoring: • After substitution, we have • After the kernel trick, we have • Note: With a suitable kernel K, don’t need h(x)s

  33. Parsing as reranking:Perceptron algorithm • Similar to SVM, learns separating hyperplane

  34. Parsing as reranking:Perceptron algorithm • Similar to SVM, learns separating hyperplane • Modeled with weight vector w • Using simple iterative procedure • Based on correcting errors in current model

  35. Parsing as reranking:Perceptron algorithm • Similar to SVM, learns separating hyperplane • Modeled with weight vector w • Using simple iterative procedure • Based on correcting errors in current model • Initialize αij=0

  36. Parsing as reranking:Perceptron algorithm • Similar to SVM, learns separating hyperplane • Modeled with weight vector w • Using simple iterative procedure • Based on correcting errors in current model • Initialize αij=0 • For i=1,…,n; for j=2,…,n • If f(xi1)>f(xij): • continue • else: αij+= 1

  37. Defining the Kernel • So, we have a model: • Framework for training • Framework for decoding

  38. Defining the Kernel • So, we have a model: • Framework for training • Framework for decoding • But need to define a kernel K

  39. Defining the Kernel • So, we have a model: • Framework for training • Framework for decoding • But need to define a kernel K • We need: K: X x X  R

  40. Defining the Kernel • So, we have a model: • Framework for training • Framework for decoding • But need to define a kernel K • We need: K: X x X  R • Recall that X is a tree, and K is a similarity function

  41. What’s in a Kernel? • What are good attributes of a kernel?

  42. What’s in a Kernel? • What are good attributes of a kernel? • Capture similarity between instances • Here, between parse trees

  43. What’s in a Kernel? • What are good attributes of a kernel? • Capture similarity between instances • Here, between parse trees • Capture more global parse information than PCFG

  44. What’s in a Kernel? • What are good attributes of a kernel? • Capture similarity between instances • Here, between parse trees • Capture more global parse information than PCFG • Computable tractably, even over complex, large trees

  45. Tree Kernel Proposal • Idea: • PCFG models learn MLE probabilities on rewrite rules • NP  N vs NP  DT N vs NP  PN vs NP DT JJ N • Local to parent:children levels

  46. Tree Kernel Proposal • Idea: • PCFG models learn MLE probabilities on rewrite rules • NP  N vs NP  DT N vs NP  PN vs NP DT JJ N • Local to parent:children levels • New measure incorporates all tree fragments in parse • Captures higher order, longer distances dependencies • Track counts of individual rules + much more

  47. Tree Fragment Example • Fragments of NP over ‘apple’ • Not exhaustive

  48. Tree Representation • Tree fragments: • Any subgraph with more than one node • Restriction: Must include full (not partial) rule productions • Parse tree representation: • h(T) = (h1(T),h2(T),…hn(T)) • n: number of distinct tree fragments in training data • hi(T): # of occurrences of ithtree fragment in current tree

  49. Tree Representation • Tree fragments: • Any subgraph with more than one node • Restriction: Must include full (not partial) rule productions • Parse tree representation: • h(T) = (h1(T),h2(T),…hn(T)) • n: number of distinct tree fragments in training data • hi(T): # of occurrences of ithtree fragment in current tree

  50. Tree Representation • Pros:

More Related