1 / 164

Language Independent Methods of Clustering Similar Contexts (with applications)

Language Independent Methods of Clustering Similar Contexts (with applications). Ted Pedersen University of Minnesota, Duluth http://www.d.umn.edu/~tpederse tpederse@d.umn.edu. The Problem. A context is a short unit of text

paki-peters
Download Presentation

Language Independent Methods of Clustering Similar Contexts (with applications)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth http://www.d.umn.edu/~tpederse tpederse@d.umn.edu EuroLAN-2005 Summer School

  2. The Problem • A context is a short unit of text • often a phrase to a paragraph in length, although it can be longer • Input: N contexts • Output: K clusters • Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters EuroLAN-2005 Summer School

  3. Language Independent Methods • Do not utilize syntactic information • No parsers, part of speech taggers, etc. required • Do not utilize dictionaries or other manually created lexical resources • Based on lexical features selected from corpora • No manually annotated data of any kind, methods are completely unsupervised in the strictest sense • Assumption: word segmentation can be done by looking for white spaces between strings EuroLAN-2005 Summer School

  4. Outline (Tutorial) • Background and motivations • Identifying lexical features • Measures of association & tests of significance • Context representations • First & second order • Dimensionality reduction • Singular Value Decomposition • Clustering methods • Agglomerative & partitional techniques • Cluster labeling • Evaluation techniques • Gold standard comparisons EuroLAN-2005 Summer School

  5. Outline (Practical Session) • Headed contexts • Name Discrimination • Word Sense Discrimination • Abbreviations • Headless contexts • Email/Newsgroup Organization • Newspaper text • Identifying Sets of Related Words EuroLAN-2005 Summer School

  6. SenseClusters • A package designed to cluster contexts • Integrates with various other tools • Ngram Statistics Package • Cluto • SVDPACKC • http://senseclusters.sourceforge.net EuroLAN-2005 Summer School

  7. Many thanks… • Satanjeev (“Bano”) Banerjee (M.S., 2002) • Founding developer of the Ngram Statistics Package (2000-2001) • Now PhD student in the Language Technology Institute at Carnegie Mellon University http://www-2.cs.cmu.edu/~banerjee/ • Amruta Purandare (M.S., 2004) • Founding developer of SenseClusters (2002-2004) • Now PhD student in Intelligent Systems at the University of Pittsburgh http://www.cs.pitt.edu/~amruta/ • Anagha Kulkarni (M.S., 2006, expected) • Enhancing SenseClusters since Fall 2004! • http://www.d.umn.edu/~kulka020/ • National Science Foundation (USA) for supporting Bano, Amruta, Anagha and me (!) via CAREER award #0092784 EuroLAN-2005 Summer School

  8. Practical Session • Experiment with SenseClusters • http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi • Has both a command line and web interface (above) • Can be installed on Linux/Unix machine without too much work • http://senseclusters.sourceforge.net • Has some dependencies that must be installed, so having supervisor access and/or sysadmin experience helps • Complete system (SenseClusters plus dependencies) is available on CD EuroLAN-2005 Summer School

  9. Background and Motivations EuroLAN-2005 Summer School

  10. Headed and Headless Contexts • A headed context includes a target word • Our goal is to collect multiple contexts that mention a particular target word in order to try identify different senses of that word • A headless context has no target word • Our goal is to identify the contexts that are similar to each other EuroLAN-2005 Summer School

  11. Headed Contexts (input) • I can hear the ocean in that shell. • My operating system shell is bash. • The shells on the shore are lovely. • The shell command line is flexible. • The oyster shell is very hard and black. EuroLAN-2005 Summer School

  12. Headed Contexts (output) • Cluster 1: • My operating system shell is bash. • The shell command line is flexible. • Cluster 2: • The shells on the shore are lovely. • The oyster shell is very hard and black. • I can hear the ocean in that shell. EuroLAN-2005 Summer School

  13. Headless Contexts (input) • The new version of Linux is more stable and better support for cameras. • My Chevy Malibu has had some front end troubles. • Osborne made on of the first personal computers. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! EuroLAN-2005 Summer School

  14. Headless Contexts (output) • Cluster 1: • The new version of Linux is more stable and better support for cameras. • Osborne made one of the first personal computers. • Cluster 2: • My Chevy Malibu has had some front end troubles. • The brakes went out, and the car flew into the house. • With the price of gasoline, I think I’ll be taking the bus more often! EuroLAN-2005 Summer School

  15. Applications • Web search results are headed contexts • Term you search for is included in snippet • Web search results are often disorganized – two people sharing same name, two organizations sharing same abbreviation, etc. often have their pages “mixed up” • Organizing web search results is an important problem. • If you click on search results or follow links in pages found, you will encounter headless contexts too… EuroLAN-2005 Summer School

  16. EuroLAN-2005 Summer School

  17. EuroLAN-2005 Summer School

  18. EuroLAN-2005 Summer School

  19. EuroLAN-2005 Summer School

  20. EuroLAN-2005 Summer School

  21. Applications • Email (public or private) is made up of headless contexts • Short, usually focused… • Cluster similar email messages together • Automatic email foldering • Take all messages from sent-mail file or inbox and organize into categories EuroLAN-2005 Summer School

  22. EuroLAN-2005 Summer School

  23. EuroLAN-2005 Summer School

  24. Applications • News article are another example of headless contexts • Entire article or first paragraph • Short, usually focused • Cluster similar articles together EuroLAN-2005 Summer School

  25. EuroLAN-2005 Summer School

  26. EuroLAN-2005 Summer School

  27. EuroLAN-2005 Summer School

  28. Underlying Premise… • You shall know a word by the company it keeps • Firth, 1957 (Studies in Linguistic Analysis) • Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis) • Harris, 1968 (Mathematical Structures of Language) • Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis) • Miller and Charles, 1991 (Language and Cognitive Processes) • Various extensions… • Similar contexts will have similar meanings, etc. • Names that occur in similar contexts will refer to the same underlying person, etc. EuroLAN-2005 Summer School

  29. Identifying Lexical Features Measures of Association and Tests of Significance EuroLAN-2005 Summer School

  30. What are features? • Features represent the (hopefully) salient characteristics of the contexts to be clustered • Eventually we will represent each context as a vector, where the dimensions of the vector are associated with features • Vectors/contexts that include many of the same features will be similar to each other EuroLAN-2005 Summer School

  31. Where do features come from? • In unsupervised clustering, it is common for the feature selection data to be the same data that is to be clustered • This is not cheating, since data to be clustered does not have any labeled classes that can be used to assist feature selection • It may also be necessary, since we may need to cluster all available data, and not hold out some for a separate feature identification step • Email or news articles EuroLAN-2005 Summer School

  32. Feature Selection • “Test” data – the contexts to be clustered • Assume that the feature selection data is the same as the test data, unless otherwise indicated • “Training” data – a separate corpus of held out feature selection data (that will not be clustered) • may need to use if you have a small number of contexts to cluster (e.g., web search results) • This sense of “training” due to Schütze (1998) EuroLAN-2005 Summer School

  33. Lexical Features • Unigram – a single word that occurs more than a given number of times • Bigram – an ordered pair of words that occur together more often than expected by chance • Consecutive or may have intervening words • Co-occurrence – an unordered bigram • Target Co-occurrence – a co-occurrence where one of the words is the target word EuroLAN-2005 Summer School

  34. Bigrams • fine wine (window size of 2) • baseball bat • house of representatives (window size of 3) • president of the republic (window size of 4) • apple orchard • Selected using a small window size (2-4 words), trying to capture a regular (localized) pattern between two words (collocation?) EuroLAN-2005 Summer School

  35. Co-occurrences • tropics water • boat fish • law president • train travel • Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations EuroLAN-2005 Summer School

  36. Bigrams and Co-occurrences • Pairs of words tend to be much less ambiguous than unigrams • “bank” versus “river bank” and “bank card” • “dot” versus “dot com” and “dot product” • Three grams and beyond occur much less frequently (Ngrams very Zipfian) • Unigrams are noisy, but bountiful EuroLAN-2005 Summer School

  37. “occur together more often than expected by chance…” • Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix • Throw out bigrams that include one or two stop words • Expected values are calculated, based on the model of independence and observed values • How often would you expect these words to occur together, if they only occurred together by chance? • If two words occur “significantly” more often than the expected value, then the words do not occur together by chance. EuroLAN-2005 Summer School

  38. 2x2 Contingency Table EuroLAN-2005 Summer School

  39. 2x2 Contingency Table EuroLAN-2005 Summer School

  40. 2x2 Contingency Table EuroLAN-2005 Summer School

  41. Measures of Association EuroLAN-2005 Summer School

  42. Measures of Association EuroLAN-2005 Summer School

  43. Interpreting the Scores… • G^2 and X^2 are asymptotically approximated by the chi-squared distribution… • This means…if you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get … EuroLAN-2005 Summer School

  44. EuroLAN-2005 Summer School

  45. Interpreting the Scores… • Values above a certain level of significance can be considered grounds for rejecting the null hypothesis • H0: the words in the bigram are independent • 3.841 is associated with 95% confidence that the null hypothesis should be rejected EuroLAN-2005 Summer School

  46. Measures of Association • There are numerous measures of association that can be used to identify bigram and co-occurrence features • Many of these are supported in the Ngram Statistics Package (NSP) • http://www.d.umn.edu/~tpederse/nsp.html EuroLAN-2005 Summer School

  47. Measures Supported in NSP • Log-likelihood Ratio (ll) • True Mutual Information (tmi) • Pearson’s Chi-squared Test (x2) • Pointwise Mutual Information (pmi) • Phi coefficient (phi) • T-test (tscore) • Fisher’s Exact Test (leftFisher, rightFisher) • Dice Coefficient (dice) • Odds Ratio (odds) EuroLAN-2005 Summer School

  48. NSP • Will explore NSP during practical session • Integrated into SenseClusters, may also be used in stand-alone mode • Can be installed easily on a Linux/Unix system from CD or download from • http://www.d.umn.edu/~tpederse/nsp.html • I’m told it can also be installed on Windows (via cygwin or ActivePerl), but I have no personal experience of this… EuroLAN-2005 Summer School

  49. Summary • Identify lexical features based on frequency counts or measures of association – either in the data to be clustered or in a separate set of feature selection data • Language independent • Unigrams usually only selected by frequency • Remember, no labeled data from which to learn, so somewhat less effective as features than in supervised case • Bigrams and co-occurrences can also be selected by frequency, or better yet measures of association • Bigrams and co-occurrences need not be consecutive • Stop words should be eliminated • Frequency thresholds are helpful (e.g., unigram/bigram that occurs once may be too rare to be useful) EuroLAN-2005 Summer School

  50. Related Work • Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on log-likelihood and exact tests http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf • Pedersen, 1996 (SCSUG) explanation of exact tests, and comparison to log-likelihood http://arxiv.org/abs/cmp-lg/9608010 (also see Pedersen, Kayaalp, and Bruce, AAAI-1996) • Dunning, 1993 (Computational Linguistics) introduces log-likelihood ratio for collocation identification http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf EuroLAN-2005 Summer School

More Related