1 / 48

Corpora and Statistical Methods Lecture 11

Corpora and Statistical Methods Lecture 11. Albert Gatt. Part 2. Statistical parsing. Preliminary issues. How parsers are evaluated. Evaluation. The issue: what objective criterion are we trying to maximise?

sian
Download Presentation

Corpora and Statistical Methods Lecture 11

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpora and Statistical MethodsLecture 11 Albert Gatt

  2. Part 2 Statistical parsing

  3. Preliminary issues How parsers are evaluated

  4. Evaluation • The issue: • what objective criterion are we trying to maximise? • i.e. under what objective function can I say that my parser does “well” (and how well?) • need a gold standard • Possibilities: • strict match of candidate parse against gold standard • match of components of candidate parse against gold standard components

  5. Evaluation • A classic evaluation metric is the PARSEVAL one • initiative to compare parsers on the same data • not initially concerned with stochastic parsers • evaluate parser output piece by piece • Main components: • compares gold standard tree to parser tree • typically, gold standard is the tree in a treebank • computes: • precision • recall • crossing brackets

  6. Correct node = node in candidate parse which: has same node label originally omitted from PARSEVAL to avoid theoretical conflict spans the same words PARSEVAL: labeled recall

  7. The proportion of correctly labelled and correctly spanning nodes in the candidate. PARSEVAL: labeled precision

  8. Combining Precision and Recall • As usual, Precision and recall can be combined into a single F-measure:

  9. PARSEVAL: crossed brackets • number of brackets in the candidate parse which cross brackets in the treebank parse • e.g. treebank has ((X Y) Z) and candidate has (X (Y Z)) • Unlike precision/recall, this is an objective function to minimise

  10. Current performance • Current parsers achieve: • ca. 90% precision • >90% recall • 1% cross-bracketed constituents

  11. Some issues with PARSEVAL • These measures evaluate parses at the level of individual decisions (nodes). • ignore the difficulty of getting a globally correct solution by carrying out a correct sequence of decisions • Success on crossing brackets depends on the kind of parse trees used • Penn Treebank has very flat trees (not much embedding), therefore likelihood of crossed brackets decreases. • In PARSEVAL, if a constituent is attached lower in a tree than the gold standard, all its daughters are counted wrong.

  12. Probabilistic parsing with PCFGs The basic algorithm

  13. The basic PCFG parsing algorithm • Many statistical parsers use a version of the CYK algorithm. • Assumptions: • CFG productions are in Chomsky Normal Form. • A  BC • A  a • Use indices between words: • Book the flight through Houston • (0) Book (1) the (2) flight (3) through (4) Houston (5) • Procedure (bottom-up): • Traverse input sentence left-to-right • Use a chart to store constituents and their span + their probability.

  14. Probabilistic CYK: example PCFG • S  NP VP [.80] • NP  Det N [.30] • VP  V NP [.20] • V  includes [.05] • Det  the [.4] • Det  a [.4] • N  meal [.01] • N  flight [.02]

  15. Probabilistic CYK: initialisation //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik& C in chartkj: chartij := chartij U {A} • The flight includes a meal.

  16. Probabilistic CYK: lexical step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} • The flight includes a meal.

  17. Probabilistic CYK: lexical step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} • The flight includes a meal.

  18. Probabilistic CYK: syntactic step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik& C in chartkj: chartij := chartij U {A} • The flight includes a meal.

  19. Probabilistic CYK: lexical step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} • The flight includes a meal.

  20. Probabilistic CYK: lexical step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} • The flight includes a meal.

  21. Probabilistic CYK: syntactic step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} • The flight includes a meal.

  22. Probabilistic CYK: syntactic step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik& C in chartkj: chartij := chartij U {A} • The flight includes a meal.

  23. Probabilistic CYK: syntactic step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik& C in chartkj: chartij := chartij U {A} • The flight includes a meal.

  24. Probabilistic CYK: syntactic step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik& C in chartkj: chartij := chartij U {A} • The flight includes a meal.

  25. Probabilistic CYK: summary • Cells in chart hold probabilities • Bottom-up procedure computes probability of a parse incrementally. • To obtain parse trees, cells need to be augmented with backpointers.

  26. Probabilistic parsing with lexicalised PCFGs Main approaches (focus on Collins (1997,1999)) see also: Charniak (1997)

  27. Unlexicalised PCFG Estimation • Charniak (1996) used Penn Treebank POS and phrasal categories to induce a maximum likelihood PCFG • only used relative frequency of local trees as the estimates for rule probabilities • did not apply smoothing or any other techniques • Works surprisingly well: • 80.4% recall; 78.8% precision (crossed brackets not estimated) • Suggests that most parsing decisions are mundane and can be handled well by unlexicalized PCFG

  28. Probabilistic lexicalised PCFGs • Standard format of lexicalised rules: • associate head word with non-terminal e.g. dumped sacks into VP(dumped)  VBD(dumped) NP(sacks) PP(into) • associate head tag with non-terminal VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) • Types of rules: • lexical rules expand pre-terminals to words: • e.g. NNS(sacks,NNS)  sacks • probability is always 1 • internal rules expand non-terminals • e.g. VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)

  29. Estimating probabilities • Non-generative model: • take an MLE estimate of the probability of an entire rule • non-generative models suffer from serious data sparseness problems • Generative model: • estimate the probability of a rule by breaking it up into sub-rules.

  30. Collins Model 1 • Main idea: • represent CFG rules as expansions into Head + left modifiers + right modifiers • Li/Ri is of the form L/R(word,tag); e.g. NP(sacks,NNS) • STOP: special symbol indicating left/right boundary. • Parsing: • Given the LHS, generate the head of the rule, then the left modifiers (until STOP) and right modifiers (until STOP) inside-out. • Each step has a probability.

  31. Collins Model 1: example VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) • Head H(hw,ht):

  32. Collins Model 1: example VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) • Head H(hw,ht): • Left modifiers:

  33. Collins Model 1: example VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) • Head H(hw,ht): • Left modifiers: • Right modifiers:

  34. Collins Model 1: example VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) • Head H(hw,ht): • Left modifiers: • Right modifiers: • Total probability: multiplication of (1) – (3)

  35. Variations on Model 1: distance • Collins proposed to extend rules by conditioning on distance of modifiers from the head: • a function of the yield of modifiers seen. Distance for R2 probability = words under R1

  36. Using a distance function • Simplest kind of distance function is a tuple of binary features: • Is the string of length 0? • Does the string contain a verb? • … • Example uses: • if the string has length 0, PR should be higher: • English is right-branching & most right modifiers are adjacent to the head verb • if string contains a verb, PR should be higher: • accounts for preference to attach dependencies to main verb

  37. Further additions • Collins Model 2: • subcategorisation preferences • distinction between complements and adjuncts. • Model 3 augmented to deal with long-distance (WH) dependencies.

  38. Rules may condition on words that never occur in training data. Collins used 3-level backoff model. Combined using linear interpolation. use head word use head tag parent only Smoothing and backoff

  39. Other parsing approaches

  40. Data-oriented parsing • Alternative to “grammar-based” models • does not attempt to derive a grammar from a treebank • treebankdata is stored as fragments of trees • parser uses whichever trees seem to be useful

  41. Data-oriented parsing • Suppose we want to parse Sue heard Jim. • Corpus contains the following potentially useful fragments: Parser can combine these to give a parse

  42. Data-oriented Parsing • Multiple fundamentally distinct derivations of a single tree. • Parse using Monte Carlo simulation methods: • randomly produce a large sample of derivations • use these to find the most probable parse • disadvantage: needs very large samples to make parses accurate, therefore potentially slow

  43. Data-oriented parsing vs. PCFGs • Possible advantages: • using partial trees directly accounts for lexical dependencies • also accounts for multi-word expressions and idioms (e.g. take advantage of) • while PCFG rules only represent trees of depth 1, DOP fragments can represent trees of arbitrary length • Similarities to PCFG: • tree fragments could be equivalent to PCFG rules • probabilities estimated for grammar rules are exactly the same as for tree fragments

  44. History Based Grammars (HBG) • General idea: any derivational step can be influenced by any earlier derivational step • (Black et al. 1993) • the probability of expansion of the current node conditioned on all previous nodes along the path from the root

  45. History Based Grammars (HBG) • Black et al lexicalise their grammar. • every phrasal node inherits 2 words: • its lexical head H1 • a secondary head H2, deemed to be useful • e.g. the PP in the bank might have H1=in and H2=bank • Every non-terminal is also assigned: • a syntactic category (Syn) e.g. PP • a semantic category (Sem) e.gwith-Data • Use the index I that indicates what number child of the parent node is being expanded

  46. HBG Example (Black et al 1993)

  47. History Based Grammars (HBG) • Estimation of the probability of a rule R: • probability of: • the current rule R to be applied • its Syn and Sem category • its heads H1 and H2 • conditioned on: • Syn and Sem of parent node • the rule that gave rise to the parent • the index of this child relative to the parent • the heads H1 and H2 of the parent

  48. Summary • This concludes our overview of statistical parsing • We’ve looked at three important models • Also considered basic search techniques and algorithms

More Related