1 / 31

Corpora and Statistical Methods Lecture 11

Corpora and Statistical Methods Lecture 11. Albert Gatt. Part 1. Probabilistic Context-Free Grammars and beyond. Context-free grammars: reminder. Many NLP parsing applications rely on the CFG formalism Definition : CFG is a 4-tuple: ( N, Σ ,P,S) :

wells
Download Presentation

Corpora and Statistical Methods Lecture 11

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpora and Statistical MethodsLecture 11 Albert Gatt

  2. Part 1 Probabilistic Context-Free Grammars and beyond

  3. Context-free grammars: reminder • Many NLP parsing applications rely on the CFG formalism • Definition: • CFG is a 4-tuple: (N,Σ,P,S): • N = a set of non-terminal symbols (e.g. NP, VP) • Σ= a set of terminals (e.g. words) • N and Σ are disjoint • P = a set of productions of the form Aβ • A Є N • βЄ (N U Σ)* (any string of terminals and non-terminals) • S = a designated start symbol (usually, “sentence”)

  4. CFG Example • S  NP VP • S  Aux NP VP • NP  Det Nom • NP  Proper-Noun • Det  that | the | a • …

  5. Probabilistic CFGs • A CFG where each production has an associated probability • PCFG is a 5-tuple: (N,Σ,P,S, D): • D: P -> [0,1] a function assigning each rule in P a probability • usually, probabilities are obtained from a corpus • most widely used corpus is the Penn Treebank

  6. The Penn Treebank • English sentences annotated with syntax trees • built at the University of Pennsylvania • 40,000 sentences, about a million words • text from the Wall Street Journal • Other treebanks exist for other languages (e.g. NEGRA for German)

  7. Example tree

  8. S NP VP VBZ NNP NNP NP Mr Vinken is NP PP NN NN IN chairman of NNP Elsevier Building a tree: rules • S  NP VP • NP  NNP NNP • NNP  Mr • NNP Vinken • …

  9. Characteristics of PCFGs • In a PCFG, the probability P(Aβ) expresses the likelihood that the non-terminal A will expand as β. • e.g. the likelihood that S  NP VP • (as opposed to SVP, or S  NP VP PP, or… ) • can be interpreted as a conditional probability: • probability of the expansion, given the LHS non-terminal • P(Aβ) = P(Aβ|A) • Therefore, for any non-terminal A, probabilities of every rule of the form A  β must sum to 1 • If this is the case, we say the PCFG is consistent

  10. Uses of probabilities in parsing • Disambiguation: given n legal parses of a string, which is the most likely? • e.g. PP-attachment ambiguity can be resolved this way • Speed: parsing is a search problem • search through space of possible applicable derivations • search space can be pruned by focusing on the most likely sub-parses of a parse • Parser can be used as a model to determine the probability of a sentence, given a parse • typical use in speech recognition, where input utterance can be “heard” as several possible sentences

  11. Using PCFG probabilities • PCFG assigns a probability to every parse-tree t of a string W • e.g. every possible parse (derivation) of a sentence recognised by the grammar • Notation: • G = a PCFG • s = a sentence • t = a particular tree under our grammar • t consists of several nodes n • each node is generated by applying some rule r

  12. simply the multiplication of the probability of every rule (node) that gives rise to t (i.e. the derivation of t) this is both the joint probability of t and s, and the probability of t alone why? Probability of a tree vs. a sentence

  13. P(t,s) = P(t) • But P(s|t) must be 1, since the tree t is a parse of all the words of s

  14. Picking the best parse in a PCFG • A sentence will usually have several parses • we usually want them ranked, or only want the n-best parses • we need to focus on P(t|s,G) • probability of a parse, given our sentence and our grammar • definition of the best parse for s:

  15. Picking the best parse in a PCFG • Problem: t can have multiple derivations • e.g. expand left-corner nodes first, expand right-corner nodes first etc • so P(t|s,G) should be estimated by summing over all possible derivations • Fortunately, derivation order makes no difference to the final probabilities. • can assume a “canonical derivation” d of t • P(t) =def P(d)

  16. Probability of a sentence • Simply the sum of probabilities of all parses of that sentence • since s is only a sentence if it’s recognised by G, i.e. if there is some t for s under G all those trees which “yield” s

  17. Flaws I: Structural independence • Probability of a rule r expanding node n depends only on n. • Independent of other non-terminals • Example: • P(NP  Pro) is independent of where the NP is in the sentence • but we know that NPPro is much more likely in subject position • Francis et al (1999) using the Switchboard corpus: • 91% of subjects are pronouns; • only 34% of objects are pronouns

  18. Flaws II: lexical independence • vanilla PCFGs ignore lexical material • e.g. P(VP  V NP PP) independent of the head of NP or PP or lexical head V • Examples: • prepositional phrase attachment preferences depend on lexical items; cf: • dump [sacks into a bin] • dump [sacks] [into a bin] (preferred parse) • coordination ambiguity: • [dogs in houses] and [cats] • [dogs] [in houses and cats]

  19. Weakening the independence assumptions in PCFGs

  20. Lexicalised PCFGs • Attempt to weaken the lexical independence assumption. • Most common technique: • mark each phrasal head (N,V, etc) with the lexical material • this is based on the idea that the most crucial lexical dependencies are between head and dependent • E.g.: Charniak 1997, Collins 1999

  21. Lexicalised PCFGs: Matt walks • Makes probabilities partly dependent on lexical content. • P(VPVBD|VP) becomes: P(VPVBD|VP, h(VP)=walk) • NB: normally, we can’t assume that all heads of a phrase of category C are equally probable. S(walks) NP(Matt) VP(walk) NNP(Matt) VBD(walk) Matt walks

  22. Practical problems for lexicalised PCFGs • data sparseness: we don’t necessarily see all heads of all phrasal categories often enough in the training data • flawed assumptions: lexical dependencies occur elsewhere, not just between head and complement • I got the easier problem of the two to solve • of the twoandto solvebecome more likely because of the prehead modifier easier

  23. Structural context • The simple way: calculate p(t|s,G) based on rules in the canonical derivation d of t • assumes that p(t) is independent of the derivation • could condition on more structural context • but then we could lose the notion of a canonical derivation, i.e. P(t) could really depend on the derivation!

  24. Structural context: probability of a derivation history • How to calculate P(t) based on a derivation d? • Observation: • (probability that a sequence of m rewrite rules in a derivation yields s) • can use the chain rule for multiplication

  25. Approach 2: parent annotation • Annotate each node with its parent in the parse tree. • E.g. if NP has parent S, then rename NP to NP^S • Can partly account for dependencies such as subject-of • (NP^S is a subject, NP^VP is an object) S(walks) NP^S VP^S NNP^NP VBD^VP Matt walks

  26. The main point • Many different parsing approaches differ on what they condition their probabilities on

  27. Other grammar formalisms

  28. Phrase structure vs. Dependency grammar • PCFGs are in the tradition of phrase-structure grammars • Dependency grammar describes syntax in terms of dependencies between words • no non-terminals or phrasal nodes • only lexical nodes with links between them • links are labelled, labels from a finite list

  29. Dependency Grammar <ROOT> main GAVE obj: subj: dat: address I him attr: MY

  30. Dependency grammar • Often used now in probabilistic parsing • Advantages: • directly encode lexical dependencies • therefore, disambiguation decisions take lexical material into account directly • dependencies are a way of decomposing PSRs and their probability estimates • estimating probability of dependencies between 2 words is less likely to lead to data sparseness problems

  31. Summary • We’ve taken a tour of PCFGs • crucial notion: what the probability of a rule is conditioned on • flaws in PCFGs: independence assumptions • several proposals to go beyond these flaws • dependency grammars are an alternative formalism

More Related