1 / 23

Three Approaches to Unsupervised WSD

Three Approaches to Unsupervised WSD. Dmitriy Dligach. Unsupervised WSD. No training corpora needed No predefined tag set needed Three approaches Context-group Discrimination (Schutze, 1998) Graph-based Algorithms (Agirre et al., 2006) HyperLex (Veronis, 2004)

damita
Download Presentation

Three Approaches to Unsupervised WSD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Three Approaches to Unsupervised WSD Dmitriy Dligach

  2. Unsupervised WSD • No training corpora needed • No predefined tag set needed • Three approaches • Context-group Discrimination (Schutze, 1998) • Graph-based Algorithms (Agirre et al., 2006) • HyperLex (Veronis, 2004) • PageRank (Brin and Page, 1998) • Predominant Sense (McCarthy, 2006) • Thesaurus generation • Method in (Lin, 1998) • Earlier version in (Hindle, 1990)

  3. Context-group Discrimination Algorithm • Sense Representations • Generate word vectors • Generate context vectors (from co-occurrence matrix) • Generate sense vectors (by clustering context vectors) • Disambiguate by computing proximity

  4. Word Vectors • wi • Two strategies to select dimensions • Local: select words from the contexts of the ambiguous word within a 50-word window • Either 1,000 most frequent words, or • Use 2 measure of dependence to pick 1,000 words • Global: select from the entire corpus regardless of the target word • Select 20,000 most frequent words as features • 2,000 as dimensions • 20,000-by-2,000 co-occurrence matrix

  5. Context Vectors • This representation conflates senses • Represent context as the centroid of the word vectors • IDF-valued vectors

  6. Sense Vectors • Cluster approx. 2,000 context vectors • Use a combination of group-average agglomerative clustering and EM • Choose a random sample of 50 (2000) and cluster using GAAC O(n2) • Centroids of the resulting clusters become the input to the EM • The procedure is still linear • Perform an SVD on context vectors • Re-represent context vectors by their values on the 100 principal dimensions

  7. Evaluation • Hand-labeled corpus of 10 naturally ambiguous and 10 artificial words • Throw out low-frequency senses and leave only 2 most frequent • Number of clusters • 2 clusters: use gold standard to evaluate • 10 clusters: no gold standard; use purity • Sense-based IR

  8. Results (highlights) • Overall performance for pseudo-words is higher than for naturally ambiguous words • Some pseudowords (wide range/consulting firm) and words (space in area, volume sense) show poor performance due to being topically amorphous • IR evaluation • vector-space model with senses as dimensions • 7.4% improvement on TREC-1 collection

  9. Graph-based Algorithms • Build a co-occurrence matrix • View it as a graph • Small world properties • Most nodes have few connections • Few are highly connected • Look for densely populated regions • Known as High-Density Components • Map ambiguous instances to one of these regions

  10. A Sample Co-Occurrence Graph • barrage – dam, play-off, barrier, roadblock, police cordon, barricade

  11. Algorithm Details • Nodes correspond to words • Edges reflect the degree of semantic association between words • Model with conditional probabilities • wA,B = 1 – max[p(A|B), p(B|A)] • Detect high-density components • Sort nodes by their degree • Take the top one (root hub) and remove along with all its neighbors (hoping to eliminate the entire component) • Iterate until all the high-density components are found

  12. E.g.

  13. Disambiguation • Delineate high-density components • Need to attach them back to the root hubs • Attach the target word to all root hubs • Compute the MST • Map the ambiguous instance to one of the components • Examine each word in its context • Compute the distance from each of these words to each root hub (each word is under exactly one hub) • Compute the total score for each hub

  14. PageRank • Based on PageRank (Brin and Page, 1998) and adopted for weighted graphs • An alternative way to rank nodes • Algorithm • Initialize nodes to random values • Compute PageRank • Iterate a fixed number of times

  15. Evaluation • First need to optimize 10 parameters • P1.Minimum frequency of edges (occurrences) • P2.Minimum frequency of vertices (words) • P3.Edges with weights above this value are removed • Train on Senseval2 using unsupervised metrics • Entropy, Purity, and Fscore • Evaluate on Senseval3 • Lexical sample data • 10 point gain over the MFS baseline • Beat by 1 point a supervised system with lexical features • All-words task • Little training data • Supervised systems barely beat the MFS baseline • This system is less than 1 point below the best system • The difference in performance is not statistically significant

  16. Finding Predominant Sense • Predominant senses in WordNet are derived from SemCor (a relatively small subset of Brown) • Idiosyncrasies • tiger (audacious person not the animal) • star (depending on context celebrity or celestial body)

  17. Distributional Similarity • Nouns that occur in object positions of the same verbs are similar (e.g. beer and vodka as objects of to drink) • Can automatically generate thesaurus-like neighborhood list for the target word (Hindle 1990), (Lin 1998) • w0:s0, w1:s1, …, wn:sn • neighborhood list conflates different senses • quality and quantity of neighbors must relate to the predominant sense • need to compute the proximity of each neighbor to each of the senses of the target word (e.g. lesk, jcn)

  18. Algorithm • w – the target word • Nw = {n1, n2, …, nk} – the ordered set of top k most similar neighbors of the target word • {dss(w, n1), dss(w, n2), …, dss(w, nk)} – distributional similarity score for each of the k neighbors • wsi senses(w) – senses of the target word • wnss(wsi, nj) – WordNet similarity score between WordNet sense i of the target word and the sense njof the neighbor j that maximizes this score • PrevalenceScore(wsi) – ranking of the sense i of the target word as being the predominant sense.

  19. Experiment 1 • Derive a thesaurus from BNC • SemCor experiments • Metric: accuracy of finding the MFS • Metric: WSD accuracy • Baseline: random accuracy • Upper bound for WSD task is 67% • Both experiments beat the random baseline (54% and 48% respectively) • Hand Examination • some error due to genre and time period variations

  20. Experiment 2 • Use Senseval2 all-words task • Label with first sense computed • automatically • according to SemCor • Senseval2 data itself (upper bound) • Automatic precision/recall are only a few points less than SemCor’s

  21. Experiment 3 • Investigate how the MFS changes across domains • SPORTS and FINANCE domains of the Reuters corpus • No hand annotated data, so hand-examine • Most words displayed the expected change in MFS • tiechanges from draw to affiliation

  22. Discussion: Algorithms • Context • Bag-of-words: Schutze and Agirre et. al. • Syntactic: McCarthy et. al. • Is bag-of-words sufficient? • E.g. topically amorphous words • Co-occurrence • Co-occurrence matrix: Schutze and Agirre et al. • Used to to look for similar nouns: McCarthy et.al • Order of co-occurrence • First order: all three papers • Second order: Schutze and McCarthy et. al. • Higher-order: Agirre • PageRank computes global rankings • MST links all nodes to the root • Advantage of the graph-based methods

  23. Discussion: Evaluation • Testbeds: little ground for cross-comparison • Schutze: his own corpus • Agirre et al: train parameters on Senseval2 and test on Senseval3 data • McCarthy et al: test on SemCor, Senseval2, Reuters • Methodology • Map clusters to the gold standard (Schutze and Agirre et. al.) • Unsupervised evaluation (Schutze and Agirre et. al) • Compare to various baselines (MFS, Lesk, Random baseline) • Use an application (Schutze)

More Related