250 likes | 368 Views
Corpora and Statistical Methods Lecture 5. Albert Gatt. Application 3: Verb selectional restrictions . Observation. Some verbs place high restrictions on the semantic category of the NPs they take as arguments. Assumption : we’re focusing attention on Direct Objects only
E N D
Corpora and Statistical MethodsLecture 5 Albert Gatt
Observation • Some verbs place high restrictions on the semantic category of the NPsthey take as arguments. • Assumption: we’re focusing attention on Direct Objects only • e.g. eat selects for FOOD DOs: • eat cake • eat some fresh vegetables • grow selects for LEGUME DOs: • grow potatoes
Not all verbs are equally constraining • Some verbs seem to place fewer restrictions than others: • see doesn’t seem too restrictive: • see John • see the potato • see the fresh vegetables • …
Problem definition • For a given verb and a potential set of arguments (nouns), we want to learn to what extent the verb selects for those arguments • rather than individual nouns, we’re better off using noun classes (FOOD etc), since these allow us to generalise more • can obtain these using a standard resource, e.g. WordNet
Kullback-Leibler divergence • We are often in a position where we estimate a probability distribution from (incomplete) data • This problem is inherent in sampling. • We end up with a distribution P, which is intended as a model of distribution Q. • How good is P as a model? • Kullback-Leiblerdivergence tells us how well our model matches the actual distribution.
Motivating example • Suppose I’m interested in the semantic type or class to which a noun belongs, e.g.: • cake, meat, cauliflower are types of FOOD (among other things) • potato, carrot are types of LEGUME (among other things) • How do I infer this? • It helps if I know that certain predicates, like grow select for some types of DO, not others • *grow meat, *grow cake • grow potatoes, grow carrots
Motivating example cont/d • Ingredients • C: the class of interest (e.g. LEGUME) • v: the verb of interest (e.g. grow) • P(C) = probability of class C • prior probability of finding some element of C as DO of any verb • P(C|v) = probability of C given that we know that a noun is a DO of grow • this is my posterior probability • More precise way of asking the question: • Does the probability distribution of C change given the info about v?
Ingredients for KL Divergence • some prior distribution P • some posterior distribution Q • Intuition: KL-Divergence measures how much information we gain about P, given that we know Q • if it’s 0, then we gain no info • Given two probability distributions P and Q, with probability mass functions p(x) and q(x), KL-Divergence is denoted D(p||q)
Calculating KL-Divergence divergence between prior and posterior probability distributions
More on the interpretation of KL-Divergence • If probability distribution P is interpreted as “the truth” and distribution Q is my approximation, then: • D(p||q) tells me how much extra info I need to add to Q to get to the actual truth
Back to our problem: Applying KL-divergence to selectional restrictions
Resnik’s model (Resnik 1996) • 2 main ingredients: • Selectional Preference Strength (S): how strongly a verb constrains its direct object (a global estimate) • Selectional Association (A): how much a verb v is associated with a given noun class (a specific estimate for a given class)
Notation • v = a verb of interest • S(v) = the selectional preference strength of v • c = a noun class • C = the set of all the noun classes • A(v,c) = the selectional association between v and class c
Selectional Preference Strength • S(v) is the KL-Divergence between: • the overall prior distribution of all noun classes • the posterior distribution of noun classes in the direct object position of v • how much info we gain from knowing the probability that members of a class occur as DO of v • works as a global estimate of how much v constrains its arguments semantically • the more it constrains them, the more info we stand to gain from knowing that an argument occurs as DO of v
S(grow): prior vs. posterior Source: Resnik 1996, p. 135
Calculating S(v) This quantifies the extent to which our prior and posterior probability estimates diverge. how much info do we gain about C by knowing it’s the object of v?
Some more examples • How much info do we gain if we know what a noun is DO of? • quite a lot if it’s an argument of eat • not much if it’s an argument of find • none if it’s an argument of see • Source: Manning and Schutze 1999, p. 290
Selectional association • This is estimated based on selectional preference strength • tells us how much a verb is associated with a specific class, given the extent to which it constrains its arguments • given a class c, A(v,c) tells us how much of S(v) is contributed by c
Calculating A(v,c) this is part of our summation for S(v) dividing by S(v) gives the proportion of S(v) which is caused by class c
From A(v,c) to A(v,n) • We know how to estimate the association strength of a class with v • Problem: • some nouns can occur in more than one class • Let classes(n) be the classes in which noun n belongs:
Example • Susan interrupted the chair. • chair is in class FURNITURE • chair is in class PEOPLE • A(interrupt,PEOPLE) > A(interrupt,FURNITURE) • A(interrupt,chair) = A(interrupt,PEOPLE) • Note that this is a kind of word-sense disambiguation!
Some results from Resnik 1996 • There are some fairly atypical examples: • these are due to the disambiguation method • e.g. tragedy can be in COMM class, and so is assigned A(answer,COMM) as it’s a(v,n)
Overall evaluation • Resnik’s results were shown to correlate very well with results from a psycholinguistic study • The method is promising: • seems to mirror human intuitions • may have some psychological validity • Possibly an alternative, data-driven account of the semantic bootstrapping hypothesis of Pinker 1989?