1 / 66

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 26 4 /22/2013. Recommended reading. http://en.wikipedia.org/wiki/Cluster_analysis

mikkel
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing • Lecture 26 • 4/22/2013

  2. Recommended reading • http://en.wikipedia.org/wiki/Cluster_analysis • Martin Redington, Nick Chater, and Steven Finch. 1998. Distributional information: a powerful cue for acquiring syntactic categories. Cognitive Science, 22(4), 425-469. • Toben H. Mintz, Elissa L. Newport, and Thomas Bever. 2002. The distributional structure of grammatical categories in speech to young children. Cognitive Science, 26, 393-425. • Marie Labelle. 2005. The acquisition of grammatical categories: a state of the art. In Henri Cohen & Claire Lefebvre (eds.), Handbook of Categorization in Cognitive Science, Elsevier, 433-457. • E. Chan dissertation, Chapter 6

  3. Outline • POS induction and language acquisition • Agglomerative clustering • Results of agglomerative clustering on POS induction • Failure of previous work to induce lexical categories • Algorithm for induction of lexical categories

  4. POS classes and language acquisition • POS categories such as Noun, Verb, Adjective, etc. • Two theories about their source: • Rationalist: categories are hard-wired in the brain • Counterevidence: categories are not universal across languages • Empiricist: categories are learned • Can be learned by distributional clustering algorithms • Accounts for variability in POS categories across languages

  5. The linking problem • Assume there is a “language of thought” • We have a predisposition to view the world in terms of actions, objects, properties, etc. • The linking problem • If POS (grammatical) categories are learned, they must be mapped onto the semantic representation in order to be used • If POS categories are innate, words in the external language must be mapped onto them, through experience • Combines rationalist and empiricist points of view

  6. Distributional POS induction • Without any knowledge of categories, the (initial) process for learning categories must be distributional, even under a nativist view • Intuition: neighboring words • the ___ of • to ____ • This is the “distributional learning” hypothesis

  7. Matrix of frequencies of words x contextual features

  8. What result are we trying to obtain? • POS induction: word classes or lexical categories? • “Word classes” • Discover categories according to distributional context • If the number of classes is high, indicate fine-grained syntactic and/or semantic classes • Lexical categories • “Nouns”, “Verbs”, “Adjectives” • Set of categories used by linguists (for English and English-like languages)

  9. Can we use K-means clustering for POS induction? • Yes, if you believe that POS categories are innate: • Nouns, Verbs, Adjectives == 3 separate clusters (won’t work that well… see later sections) • No, if you believe that POS categories are derived from experience • K-means is not the proper algorithm, since the number of clusters is hard-coded • # of open-class POS categories varies across languages • Want an algorithm that produces a variable number of clusters • Agglomerative / hierarchical clustering

  10. Outline • POS induction and language acquisition • Agglomerative clustering • Results of agglomerative clustering on POS induction • Failure of previous work to induce lexical categories • Algorithm for induction of lexical categories

  11. Agglomerative clustering(also called hierarchical clustering) • Pre-compute similarity matrix between every pair of points being clustered. • Algorithm: • Each data point begins in its own cluster • Successively merge least distant / most similar clusters together • Using some definition of distance between clusters • Produces a dendrogram describing clustering history • Similarity between merged clusters decreases through iterations of clustering • Obtain a discrete set of clusters by choosing a cutoff level for similarity • Number of clusters is not fixed in advance (unlike k-means)

  12. Agglomerative clustering • Initially, each item in its own cluster • At each iteration, merge 2 most similar (least distant) clusters

  13. Agglomerative clustering

  14. Agglomerative clustering

  15. Agglomerative clustering

  16. Agglomerative clustering

  17. Agglomerative clustering

  18. Agglomerative clustering: dendrogram

  19. Another example of a dendogram

  20. Quantifying the distance between 2 clusters • Single-link clustering: • The distance between 2 clusters is the shortest distance between any 2 points in the two clusters • Complete-link clustering: • The distance between 2 clusters is the longest distance between any 2 points in the two clusters • Average-link clustering: • The distance between 2 clusters is the average distance between all pairs of points in the two clusters • Produces different results

  21. Single-link clustering • Distance between 2 clusters is the shortest distance between any 2 points in each cluster http://www.solver.com/xlminer/help/HClst/HClst_intro.html

  22. Complete-link clustering • Distance between 2 clusters is the longest distance between any 2 points in each cluster

  23. Average-link clustering • Distance between 2 clusters is the average of the distances between all pairs of points in each cluster

  24. Produce a discrete set of clusters • A dendrogram shows the clustering process, where the end result is a single cluster containing all the data points • To produce a discrete set of clusters, we need to pick a cutoff value for the similarity • 2 ways to use threshold value for similarity: • We could grow the entire dendrogram and then “prune” it to produce a discrete set of clusters, • Or we could stop merging clusters once they reach a certain level of similarity

  25. # of clusters for different similarity thresholds Sim = 5%: 1 cluster Sim = 20%: 4 clusters Sim = 50%: 6 clusters Sim = 80%: 8 clusters

  26. Model selection in k-means and agglomerative clustering • K-means: number of clusters is determined by choice of the constant k • Agglomerative: number of clusters not explicitly stated • However, the # of clusters is indirectly determined, through the choice of threshold for similarity (learning bias) • No “magic formula” for the best value for similarity threshold

  27. Outline • POS induction and language acquisition • Agglomerative clustering • Results of agglomerative clustering on POS induction • Failure of previous work to induce lexical categories • Algorithm for induction of lexical categories

  28. NLP research in psycholinguistics • Psycholinguistics • Typically involves experiments on human subjects • But there is also some research on algorithmic models, that are tested on corpora • Use the CHILDES corpus • Child Language Data Exchange System • http://childes.psy.cmu.edu/ • Transcripts of adult-child conversations, for many languages

  29. Examples of parent-child conversation

  30. Examples of parent-child conversation

  31. Redington, Fitch, & Chater (1998) • Applied to English CHILDES corpus: • Child-directed speech, 2.5 million words • Words that are being clustered: • 1,000 most-freq words in corpus • Contextual features: • w-2, w-1, w+1, w+2 for the 150 most-freq words in corpus • Similarity function: • Rank correlation, rescaled from [-1, 1] to [0, 1] • Algorithm: • Average-link agglomerative clustering

  32. Redington, Fitch, & Chater (1998)

  33. Redington, Fitch, & Chater (1998)

  34. Redington, Fitch, & Chater (1998)

  35. Outline • POS induction and language acquisition • Agglomerative clustering • Results of agglomerative clustering on POS induction • Failure of previous work to induce lexical categories • Algorithm for induction of lexical categories

  36. Baker 2005

  37. Lexical category induction • “Lexical” = has meaningful content, non performing grammatical function, is open-class, • Nouns, Verbs, Adjectives • Assumed in traditional grammars and many linguistic theories • (Ignore typological problems…) • What kind of learning procedure could acquire these categories from data? • How could a child acquire these categories?

  38. Standard distributional clustering doesn’t exactly find lexical categories • Next slides: pick similarity threshold for producing discrete set of clusters • Based on Redington et. al. (1998) • No threshold produces clusters in exact one-to-one correspondence with Nouns, Adjectives, and Verbs

  39. One category

  40. Two categories

  41. all Verbs Nouns, Adjectives conflated

  42. Three Verb clusters Nouns, Adjectives separate

  43. Mintz, Newport, & Bever (2002) • Similar to Redington et al. • Next slide: horizontal lines shows low similarity thresholds and resulting clusters • Same problems as Redington et al. (1998)

  44. At a high similarity threshold, Nouns are grouped with Adjectives At a lower similarity threshold, we have 4 clusters, but Nouns are still grouped with Adjectives If we want Nouns and Adjectives to be separate, there would be 6 clusters

  45. Distributional theory of POS categoriesdoesn’t work • Derived from experience • Form classes: • Define word class by context • Examples: • class 1: the ___ of • class 2: to ___ • Firth 1957: “Ye shall know a word by its context” • Doesn’t work for finding the lexical categories: • Nouns not separated from Adjectives, unless there are too many clusters • No one-to-one correspondence between cluster and open-class category

  46. Interpret as a procedure that a child is using • If a child is using distributional context to learn POS categories, • Then, based on experimental results on corpora, • The theory does not predict an exact correspondence between induced categories and (psycho)linguists’ standard lexical categories of “Noun”, “Verb”, and “Adjective” • Still an open problem

  47. Some limitations of distributional context • Contextual feature does not always determine the class of a word • the ___ (noun, adjective, adverb) • Contextual feature does not predict an entire class of words • a ___ (noun beginning with consonant) • an ___ (noun beginning with vowel) • Masculine / Feminine words, vs. nouns in general

  48. Some limitations of distributional context • Example: not able to represent generalization “Adjectives appear to the left of Nouns” • Define “adjective” • Words in “adjective” cluster • Defined by presence of specific words to left/right, rather than the presence of a particular category • Example: “Adjectives appear to the left of “cat”” • 1st-order distr. context is linguistically inadequate • These features limit what any clustering algorithm can do

  49. Outline • POS induction and language acquisition • Agglomerative clustering • Results of agglomerative clustering on POS induction • Failure of previous work to induce lexical categories • Algorithm for induction of lexical categories

More Related