1 / 53

Corpora and Statistical Methods

Corpora and Statistical Methods. Albert Gatt. Word Sense Disambiguation. What are word senses?. Cognitive definition: mental representation of meaning used in psychological experiments relies on introspection (notoriously deceptive) Dictionary-based definition:

yuri-dean
Download Presentation

Corpora and Statistical Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpora and Statistical Methods Albert Gatt

  2. Word Sense Disambiguation

  3. What are word senses? • Cognitive definition: • mental representation of meaning • used in psychological experiments • relies on introspection (notoriously deceptive) • Dictionary-based definition: • adopt sense definitions in a dictionary • most frequently used resource is WordNet

  4. WordNet • Taxonomic representation of words (“concepts”) • Each word belongs to a synset, which contains near-synonyms • Each synset has a gloss • Words with multiple senses (polysemy) belong to multiple synsets • Synsetsorganised by hyponymy (IS-A) relations • Also, other lexical relations, depending on category

  5. How many senses? • Example: interest • pay 3% interest on a loan • showed interest in something • purchased an interest in a company. • the national interest… • have X’s best interest at heart • have an interest in word senses • The economy is run by business interests

  6. Wordnet entry for interest (noun) • a sense of concern with and curiosity about someone or something … (Synonym: involvement) • the power of attracting or holding one’s interest… (Synonym: interestingness) • a reason for wanting something done ( Synonym: sake) • a fixed charge for borrowing money… • a diversion that occupies one’s time and thoughts … (Synonym: pastime) • a right or legal share of something; a financial involvement with something (Synonym: stake) • (usually plural) a social group whose members control some field of activity and who have common aims, (Synonym: interest group)

  7. Some issues • Are all these really distinct senses? Is WordNet too fine-grained? • Would native speakers distinguish all these as different? • Cf. The distinction between sense ambiguity and underspecification (vagueness): • one could argue that there are fewer senses, but these are underspecified out of context

  8. Translation equivalence • Many WSD applications rely on translation equivalence • Given: parallel corpus (e.g. English-German) • if word w in English has n translations in German, then each translation represents a sense • e.g. German translations of interest: • Zins: financial charge (WN sense 4) • Anteil: stake in a company (WN sense 6) • Interesse: all other senses

  9. Some terminology • WSD Task: given an ambiguous word, find the intended sense in context • Sense tagging: task of labelling words as belonging to one sense or another. • needs some a priori characterisation of senses of each relevant word • Discrimination: • distinguishes between occurrences of words based on senses • not necessarily explicit labelling

  10. Some more terminology • Two types of WSD task: • Lexical sample task: focuses on disambiguating a small set of target words, using an inventory of the senses of those words. • All-words task: focuses on entire texts and a lexicon, where every word in the text has to be disambiguated • Serious data sparseness problems!

  11. Approaches to WSD • All methods rely on training data. Basic idea: • Given word w in context c • learn how to predict sense s of w based on various features of w • Supervised learning: training data is labelled with correct senses • can do sense tagging • Unsupervised learning: training data is unlabelled • but many other knowledge sources used • cannot do sense tagging, since this requires a priori senses

  12. Supervised learning • Words in training data labelled with their senses • She pays 3% interest/INTEREST-MONEY on the loan. • He showed a lot of interest/INTEREST-CURIOSITY in the painting. • Similar to POS tagging • given a corpus tagged with senses • define features that indicate one sense over another • learn a model that predicts the correct sense given the features

  13. Features (e.g. plant) • Neighbouring words: • plant life • manufacturing plant • assembly plant • plant closure • plant species • Content words in a larger window • animal • equipment • employee • automatic

  14. Other features • Syntactically related words • e.g. object, subject…. • Topic of the text • is it about SPORT? POLITICS? • Part-of-speech tag, surrounding part-of-speech tags

  15. Some principles proposed (Yarowsky 1995) • One sense per discourse: • typically, all occurrences of a word will have the same sense in the same stretch of discourse (e.g. same document) • One sense per collocation: • nearby words provide clues as to the sense, depending on the distance and syntactic relationship • e.g. plant life: all (?) occurrences of plant+life will indicate the botanic sense of plant

  16. Training data • SENSEVAL • Shared Task competition • datasets available for WSD, among other things • annotated corpora in many languages • (NB: SENSEVAL now merged with the broader SEMEVAL tasks) • Pseudo-words • create training corpus by artificially conflating words • e.g. all occurrences of man and hammer with man-hammer • easy way to create training data • Multi-lingual parallel corpora • translated texts aligned at the sentence level • translation indicates sense

  17. SemCor corpus • Corpus consisting of files from the Brown Corpus (1m words) • SemCor contains 352 files, 360k words total: • 186 files contain sense tags for all content words • The remainder only have sense tags for verbs • Originally created using the WordNet 1.6 sense inventory; since updated to the more recent versions of WordNet

  18. SemCor Example (= Brown file a01) <s snum=1> [...] <wfcmd=ignore pos=DT>an</wf> <wfcmd=done pos=NN lemma=investigation wnsn=1 lexsn=1:09:00::>investigation</wf> <wfcmd=ignore pos=IN>of</wf> <wfcmd=done pos=NN lemma=atlantawnsn=1 lexsn=1:15:00::>Atlanta</wf> <wfcmd=ignore pos=POS>'s</wf> <wfcmd=done pos=JJ lemma=recent wnsn=2 lexsn=5:00:00:past:00>recent</wf> <wfcmd=done pos=NN lemma=primary_electionwnsn=1 lexsn=1:04:00::>primary_election</wf> [...] </s>

  19. SemCor example <wfcmd=done pos=NN lemma=investigation wnsn=1lexsn=1:09:00::>investigation</wf> • Senses in WordNet: • S: (n) probe, investigation (an inquiry into unfamiliar or questionable activities) "there was a congressional probe into the scandal" • S: (n) investigation, investigating (the work of inquiring into something thoroughly and systematically) • This occurrence involves the first sense

  20. Data representation • Example sentence: An electric guitar and bass player stand off to one side... • Target word: bass • Possible senses: fish, musical instrument... • Relevant features are represented as vectors, e.g.:

  21. Part 1 Supervised methods I: Naive Bayes

  22. Naïve Bayes Classifier • Identify the features (F) • e.g. surrounding words • other cues apart from surrounding context • Combine evidence from all features • Decision rule: decide on sense s’ iff • Example: drug. F = words in context • medication sense: price, prescription, pharmaceutical • illegal substance sense: alcohol, illicit, paraphernalia

  23. Naive Bayes Classifier • Problem: We usually don’t know the probability of a sense given the features: P(sk|F) • In our corpus, we have access to P(F|sk) • We can compute P(sk|F) from P(F|sk) • For this we use Bayes’ theorem

  24. Deriving Bayes’ rule from the multiplication rule • Recall that:

  25. Deriving Bayes’ rule from the multiplication rule • Recall that: • Given symmetry of intersection, multiplication rule can be written in two ways:

  26. Deriving Bayes’ rule from the multiplication rule • Recall that: • Given symmetry of intersection, multiplication rule can be written in two ways: • Bayes’ rule involves the substitution of one equation into the other, to replace P(A and B)

  27. Deriving P(A) • Often, it’s not clear where P(A) should come from • we start out from conditional probabilities! • Given that we have two sets of outcomes of interest, A and B, P(A) can be derived from the following observation: • i.e. The events in A are made up of those which are only in A (but not in B) and those which are in both A and B.

  28. B A Finding P(A) -- I P(A) must either be in one or the other (or both), since A is composed of these two sets.

  29. Finding P(A) -- II Step 1: Applying the addition rule: Step 2: Substituting into Bayes’ equation to replace P(A):

  30. Using Bayes’ rule fro WSD • We usually don’t know P(sk|F)but we can compute from training data: P(sk) (the prior) and P(F|sk) • P(f)can be eliminated because it is constant for all senses in the corpus

  31. The independence assumption • It’s called “naïve” because: • i.e. all features are assumed to be independent • Obviously, this is often not true. • e.g. finding illicit in the context of drug may not be independent of finding pusher. • cf. our discussion of collocations! • Also, topics often constrain word choice.

  32. Training the naive Bayes classifier • We need to compute: • P(s) for all senses s of w • P(fj|s) for all features fj

  33. Part 2 Supervised methods II: Information-theoretic and collocation-based approaches

  34. Information-theoretic measures • Find the single, most informative feature to predict a sense. • E.g. using a parallel corpus: • prendre(FR) can translate as take or make • prendreunedécision: make a decision • prendreunemesure: take a measure [to…] • Informative feature in this case: direct object • mesure indicates take • décision indicates make • Problem: need to identify the correct value of the feature that indicates a specific sense.

  35. Brown et al’s algorithm • Given: translations T of word w • Given: values X of a useful feature (e.g. mesure, décisionas values of DO) • Step 1: random partition P of T • While improving, do: • create partition Q of X that maximises I(P;Q) • find a partition P of T that maximises I(P;Q) comment: relies on mutual info to find clusters of translations mapping to clusters of feature values

  36. Using dictionaries and thesauri • Lesk (1986): one of the first to exploit dictionary definitions • the definition corresponding to a sense can contain words which are good indicators for that sense • Method: • Given: ambiguous word w with senses s1…sn with glosses g1…gn. • Given: the word w in context c • compute overlap between c & each gloss • select the maximally matching sense

  37. Expanding a dictionary • Problem with Lesk: • often dictionary definitions don’t contain sufficient information • not all words in dictionary definitions are good informants • Solution: use a thesaurus with subject/topic categories • e.g. Roget’s thesaurus • http://www.gutenberg.org/cache/epub/22/pg22.html.utf8

  38. Using topic categories • Suppose every sense sk of word w has subject/topic tk • wcan be disambiguated by identifying the words related to tk in the thesaurus • Problems: • general-purpose thesauri don’t list domain-specific topics • several potentially useful words can be left out • e.g. … Navratilova plays great tennis … • proper name here useful as indicator of topic SPORT

  39. Expanding a thesaurus: Yarowsky 1992 • Given: context c and topic t • For all contexts and topics, compute p(c|t) using Naïve Bayes • by comparing words pertaining to t in the thesaurus with words in c • if p(c|t) > α, then assign topic t to context c • For all words in the vocabulary, update the list of contexts in which the word occurs. • Assign topic t to each word in c • Finally, compute p(w|t) for all w in the vocabulary • this gives the “strength of association” of w with t

  40. Yarowsky 1992: some results SENSE ROGET TOPICS ACCURACY star space object UNIVERSE 96% celebrity ENTERTAINER 95% shape INSIGNIA 82.% sentence punishment LEGAL_ACTION 99% set of words GRAMMAR 98%

  41. Bootstrapping • Yarowsky (1995) suggested the one sense per discourse/collocation constraints. • Yarowsky’s method: • select the strongest collocational feature in a specific context • disambiguate based only on this feature • (similar to the information-theoretic method discussed earlier)

  42. One sense per collocation • For each sense s of w, initialise F, the collocations found in the dictionary definition for s • One sense per collocation: • identify the set of contexts containing collocates of s • for each sense s of w, update F to contain those collocates such that, for forall s’ ≠ s (where alpha is a threshold)

  43. One sense per discourse • For each document: • find the majority sense of w out of those found in previous step • assign all occurrences of w the majority sense • This is implemented as a post-processing step. Reduces error rate by ca. 27%.

  44. Part 3 Unsupervised disambiguation

  45. Preliminaries • Recall: unsupervised learning can do sense discrimination not tagging • akin to clustering occurrences with the same sense • e.g. Brown et al 1991: cluster translations of a word • this is akin to clustering senses

  46. Brown et al’s method • Preliminary categorisation: • Set P(w|s) randomly for all words w and senses s of w. • Compute, for each context c of w the probability P(c|s) that the context was generated by sense s. • Use (1) and (2) as a preliminary estimate. Re-estimate iteratively to find best fit to the corpus.

  47. Characteristics of unsupervised disambiguation • Can adapt easily to new domains, not covered by a dictionary or pre-labelled corpus • Very useful for information retrieval • If there are many senses (e.g. 20 senses for word w), the algorithm will split contexts into fine-grained sets • NB: can go awry with infrequent senses

  48. Part 4 Some issues with WSD

  49. The task definition • The WSD task traditionally assumes that a word has one and only one sense in a context. • Is this true? • Is it possible to see evidence of co-activation (one word displaying more than one sense)? • this would bring competition to the licensed trade • competition = “act of competing” • competition = “people/organisations who are competing”

  50. Systematic polysemy • Not all senses are so easy to distinguish. E.g. competition in the “agent competing” vs “act of competing” sense. • The polysemy here is systematic • Compare bank/bank where the senses are utterly distinct (and most linguists wouldn’t consider this a case of polysemy, but homonymy) • Can translation equivalence help here? • depends if polysemy is systematic in all languages

More Related