1 / 45

Unsupervised Methods

Unsupervised Methods. Association Measures. Association between items: assoc(x,y) term-term, term-document, term-category, … Simple measure: freq(x,y), log( freq(x,y))+1 Based on contingency table. Mutual Information. The item corresponding to x,y in the Mutual Information for X,Y:

darva
Download Presentation

Unsupervised Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Methods

  2. Association Measures • Association between items: assoc(x,y) • term-term, term-document, term-category, … • Simple measure: freq(x,y), log(freq(x,y))+1 • Based on contingency table

  3. Mutual Information • The item corresponding to x,y in the Mutual Information for X,Y: • Disadvantage: the MI value is inflated for low freq(x,y) • Examples: results for two NLP articles

  4. Log-Likelihood Ratio Test • Comparing the likelihood of the data given two competing hypotheses (Dunning,93) • Does not depend heavily on assumptions of normality, can be applied to small samples • Used to test if p(x|y) = p(x|~y) = p(x), by comparing it to the general case (inequality) • High log-likelihood score indicates that the data is much less likely if assuming equality

  5. Log-Likelihood (cont.) • Likelihood function: • The likelihood ratio: • is asymptotically distributed • High : the data is less likely given

  6. Log-Likelihood for Bigrams

  7. Log-Likelihood for Binomial • Maximum obtained for:

  8. Measuring Term Topicality • For query relevance ranking: Inverse Document Frequency • For term extraction: • Frequency • Frequency ratio for specialized vs. general corpus • Entropy of term co-occurrence distribution • Burstiness: • Entropy of distribution (frequency) in documents • Proportion of topical documents for term (freq>1) within all documents containing term (Katz, 1996)

  9. Similarity Measures • Cosine: • Min/Max: • KL to Average:

  10. A Unifying Schema of Similarity(with Erez Lotan) • A general schema encoding most measures • Identifies explicitly the important factors that determine (word) similarity • Provides the basis for: • a general and efficient similarity computation procedure • evaluating and comparing alternative measures and components

  11. count(u,att) count(v,att) 016050 800045 006 007110 030 017 004 joint(assoc(u,att),assoc(u,att)) u v assoc(v,att) assoc(u,att) joint(assoc(u,att),assoc(u,att)) Mapping to Unified Similarity Scheme

  12. Association and Joint Association • assoc(u,att): quantify association strength • mutual information, weighted log frequency, conditional probability (orthogonal to scheme) • joint(assoc(u,att),assoc(v,att)): quantify the “similarity” of the two associations • ratio, difference, min, product

  13. Normalization • Global weight of a word vector: • For cosine: • Normalization factor: • For cosine:

  14. The General Similarity Scheme

  15. Min/Max Measures • May be viewed as:

  16. Associations Used with Min/Max • Log-frequency and Global Entropy Weight (Grefenstette, 1994): • Mutual information (Dagan et al., 1993/5):

  17. Cosine Measure • Used for word similarity (Ruge, 1992) with: assoc(u,att)=ln(freq(u,att)) • Popular for document ranking (vector space)

  18. Methodological Benefits • Joint work with Erez Lotan (Dagan 2000 and in preparation) • Uniform understanding of similarity measure structure • Modular evaluation/comparison of measure components • Modular implementation architecture, easy experimentation by “plugging” alternative measure combinations

  19. Thesaurus for query expansion (e.g. “insurance laws”): Similar words for law :WordSimilarityJudgmentregulation 0.050242 +rule 0.048414 +legislation 0.038251 +guideline 0.035041 +commission 0.034499 -bill 0.033414 +budget 0.031043 -regulator 0.031006 +code 0.030998 +circumstance 0.030534 - Empirical Evaluation • Precision and comparative Recall at each point in the list

  20. Comparing Measure Combinations Precision Recall • Min/Max schemes worked better than cosine and Jensen-Shannon (almost by 20 points); stable over association measures

  21. Effect of Co-occurrence Type on Semantic Similarity

  22. Efficient implementation through sparse matrix indexing • By computing over common attributes only (both ) v1 … vj vm u words att1attiattn attributes 1 . . . i n atti Similarity Results 1 . . . j m Computational Benefits • Complexity reduced by “sparseness” factor – #non-zero cells / total #cells • Two orders of magnitude in corpus data

  23. General Scheme - Conclusions • A general mathematical scheme • Identifies the important factors for measuring similarity • Efficient general procedure based on scheme • Empirical comparison of different measure components (measure structure and assoc) • Successful application in an Internet crawler for thesaurus construction (small corpora)

  24. Clustering Methods • Input: A set of objects (words, documents) • Output: A set of clusters (sets of elements) • Based on a criterion for the quality of a class, which guides cluster split/merge/modification • a distance function between objects/classes • a global quality function

  25. Clustering Types • Soft / Hard • Hierarchical / Flat • Top-down / bottom-up • Predefined number of clusters or not • Input: • all point-to-point distances • original vector representation for points, computing needed distances during clustering

  26. Applications of Clustering • Word clustering • Constructing a hierarchical thesaurus • Compactness and generalization in word cooccurrence modeling (will be discussed later) • Document clustering • Browsing of document collections and search query output • Assistance in defining a set of supervised categories

  27. Hierarchical Agglomerative Clustering Methods (HACM) 1. Initialize every point as a cluster 2. Compute a merge score for all cluster pairs 3. Perform the best scoring merge 4. Compute the merge score between the new cluster and all other clusters 5. If more than one cluster remains, return to 3

  28. Types of Merge Score • Minimal distance between the two candidates for the merge. Alternatives for cluster distance: • Single link: distance between two nearest points • Complete ling: distance between two furthest points • Group average: average pairwise distance for all points • Centroid: distance between the two cluster centroids • Based on the “quality” of the merged class: • Ward’s method: minimal increase in total within-group sum of squares (average squared distance to centroid) • Based on a global criterion (in Brown et al., 1992: minimal reduction in average mutual information)

  29. Unsupervised Statistics and Generalizations for Classification • Many supervised methods use cooccurrence statistics as features or probability estimates • eat a {peach,beach} • fire a missile vs. fire the prime minister • Sparse data problem: if alternative cooccurrences never occurred, how to estimate their probabilities, or their relative “strength” as features?

  30. Anaphora resolution (Dagan, Justeson, Lappin, Lease, Ribak 1995) The terrorist pulled the grenade from his pocket and threw it at the policeman ? Traditional AI-style approach Manually encoded semantic preferences/constraints Actions Weapon <object – verb> Cause_movement Bombs grenade throw drop Application: Semantic Disambiguation

  31. Semantic confidence combined with syntactic preferences it  grenade • “Language modeling” for disambiguation Statistical Approach “Semantic” Judgment Corpus (text collection) <verb–object: throw-grenade> 20 times <verb–object: throw-pocket> 1 time

  32. I bought soap bars I bought window bars sense1 sense2 sense1 sense2 (‘chafisa’) (‘sorag’) (‘chafisa’) (‘sorag’) ? ? Corpus (text collection) Sense1:<noun-noun: soap-bar> 20 times<noun-noun: chocolate-bar> 15 timesSense2:<noun-noun: window-bar> 17 times<noun-noun: iron-bar> 22 times • “Hidden” senses – supervised labeling required? What about sense disambiguation?(for translation)

  33. Map ambiguous constructs to second language (all possibilities): <noun-noun: soap-bar> 1<noun-noun: ‘cahfisat-sabon’> 20 times2<noun-noun: ‘sorag-sabon’> 0 times <noun-noun: window-bar> 1<noun-noun: ‘cahfisat-chalon’> 0 times 2<noun-noun: ‘sorag-chalon’> 15 times Hebrew Corpus Solution: Mapping to Another Language English(-English)-Hebrew Dictionary: bar1 ‘chafisa’ soap  ‘sabon’ window  ‘chalon’bar2 ‘sorag’ • Exploiting ambiguities difference • Principle – intersecting redundancies(Dagan and Itai 1994)

  34. Selection Model Highlights • Multinomial model, under certain linguistic assumptions • Selection “confidence” – lower bound for odds-ratio: • Overlapping ambiguous constructs are resolved through constraint propagation, by decreasing confidence order. • Results (HebrewEnglish):Coverage: ~70% Precision within coverage: ~90% • ~20% improvement over choosing most frequent translation (the common baseline)

  35. folderfiledirectoryrecord… file_cabinetcupboardclosetsuitcase… Similar Similar <verb-object> <verb-object> print print Data Sparseness and Similarity <verb–object: ‘hidpis-tikiya’> ? <verb–object: print-folder> 0 times <verb–object: print-file_cabinet> 0 times • Standard approach: “back-off” to single term frequency • Similarity-based inference:

  36. printeraseopenretrievebrowsesave… • Association between word u (“folder”) and its “attributes” (context words/features) is based on mutual information: • Similarity between u and v (weighted Jaccard, [0,1]): Computing Distributional Similarity folderfile Similar

  37. folderfiledirectoryrecord… file_cabinetcupboardclosetsuitcase… Similar Similar <verb-object> <verb-object> print print Disambiguation Algorithm • Selection of preferred alternative: • Hypothesized similarity-based frequency derived from average association for similar words(incorporating single term frequency) • Comparing hypothesized frequencies

  38. Computation and Evaluation • Heuristic search used to speed computation of k most similar words • Results (HebrewEnglish): • 15% coverage increase, while decreasing precision by 2% • Accuracy 15% better than back-off to single word frequency(Dagan, Marcus and Markovitch 1995)

  39. Counts are obtained from a sample of the probability space: sample • Maximum Likelihood Estimate proportional to sample counts: MLE estimate – 0 probability for unobserved events • Smoothing discounts observed events, leaving probability “mass” to unobserved events: discounted estimate for observed events positive estimate for unobserved events Probabilistic Framework - Smoothing

  40. Good-Turing smoothing scheme – discount & redistribute: • Katz seminal back-off scheme (speech language modeling): • Similarity-based smoothing:(Dagan, Lee, Pereira 1999) Smoothing Conditional Attribute Probability

  41. Jensen-Shannon divergence (KL-distance to the average) Information loss by approximating u and v by their average β controls the relative influence of close vs. remote neighbors • L1norm Similarity/Distance Functions for Probability Distributions

  42. Sample Results • Most similar words to “guy”: Typical common verb contexts: see get give tell take … PC : an earlier attempt for similarity-based smoothing • Several smoothing experiments (A performed best): • Language modeling for speech (hunt bears?pears) • Perplexity (predicting test corpus likelihood) • Data recovery task (similar to sense disambiguation) • Insensitive to exact value of β

  43. Class-Based Generalization • Obtain a cooccurrence-based clustering of words and model a word cooccurrence by word-class or class-class cooccurrence • Brown et al., CL 1992: Mutual information clustering; class-based model interpolated to n-gram model • Pereira, Tishby, Lee, ACL 1993: soft, top-down distributional clustering for bigram modeling • Similarity/class-based methods: general effectiveness yet to be shown

  44. Conclusions • (Relatively) simple models cover a wide range of applications • Usefulness in (hybrid) systems: automatic processing and knowledge acquisition

  45. Discussion

More Related