1 / 24

Distributional Clustering of English Words

Distributional Clustering of English Words. Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian Lee- Dept. of Computer Science, Cornell University

azure
Download Presentation

Distributional Clustering of English Words

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributional Clustering of English Words Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian Lee- Dept. of Computer Science, Cornell University Presenter- Juan Ramos, Dept. of Computer Science, Rutgers Universtiy, juramos@cs.rutgers.edu

  2. Overview • Purpose: evaluate a method for clustering words according to their distribution in particular syntactic contexts. • Methodology: find lowest distortion sets of clusters of words to determine models of word coocurrence.

  3. Applications • Scientific POV: lexical acquisition of words • Practical POV: classification concerns data sparseness in garmmar models. • Address clusters in large corpus of documents

  4. Definitions • Context: function of given word in its sentence. • Eg: a noun as a direct object • Sense class: hidden model describing word association tendencies • Mix of cluster and cluster probability given a word • Cluster: probabilistic concept of a sense class

  5. Problem Setting • Restrict problem to verbs (V) and nouns (N) in main verb-direct object relationship • f (v, n) = frequencies of occurrence of verb, noun pairs • Text must be pre-formatted to fit specifications • For given noun n, conditional distribution p(n, v) = f(v,n)/(sum (v, f(v,n))

  6. Problem Setting cont. • Goal: create set C of clusters and probabilityies p(c|n). • Each c in C associated to cluster centroid p(c) • p(c) = average of p(n) over all v in V.

  7. Distributional Similarity • Given two distributions p, q, KL distance is D(p || q) = sum (x, p(x) log (p(x)/q(x))) • D(p || q) = 0 implies p = q • Small D(p || q) implies two distributions are likely instances of a centroid p(c). • D(p || q) measures loss of data by using p(c).

  8. Theoretical Foundation • Given unstructured V, N, training data of X independent pairs of verbs and nouns. • Problem: learn joint distribution of pairs given X • Not quite unsupervised, not quite supervised • No internal structure in pairs • Learn underlying distribution

  9. Distributional Clustering • Approximately decompose p(n,v) to p’(n,v) = sum (c in C, p(c|n)*p(c, v)). • p(c|n) = membership probability of n in c • p(c,v) = p(v|c) = probability of v given centroid for c • Assuming p(n), p’(v) coincide, p’(n,v) = sum(c in C, p(c)*p(n|c)*p(v|c))

  10. Maximum Likelihood Cluster Centroids • Used to maximize goodness of fit between data and p’(n,v) • For sequence of pairs S, S’s model log prob. is: l(S) = sum(N, log (sum (c in C, p’(n,v)))). • Maximize according to p(n|c) and p(v|c). • Variation of l(S):

  11. Maximum Entropy Cluster Membership • Assume independence between variations of p(n|c) and p(v|c). • Can find Bayes inverses of p(n|c) given p(v|c) and p(v|n) • p(v|c) that maximize l(S) also minimize average distortion between cluster model and data

  12. Entropy Cluster Membership cont. • Average cluster distortion: • Entropy:

  13. Entropy Cluster Membership cont. • Class and membership distributions: • Z(c) and Z(n) are normalization sums • Previous equations reduce log-likelihood to: • At maximum, variation vanishes

  14. KL Distortion • Attempt to minimize KL distortion through variation of KL distances: • Results in weighted average of noun distributions.

  15. Free Energy Function • Combined minimum distortion and max entropy equivalent to minimum of free energy: F = <D> - H/beta • F determines <D> and H through partial derivatives: • Min of F determines balance between disordering max entropy and ordering distortion min.

  16. Hierarchical Clustering • Number of clusters is determined through sequence of increases of beta. • Higher beta implies more local influence of noun on definition of centroids. • Start with low beta and a single c in C • Search for lowest beta that splits c into two or more leaf c’s. • Repeat until |C| reaches desired size.

  17. Experimental Results • Classify 64 nouns appearing as direct objects of verb ‘fire’ in Associated Press documents, 1988, where |V| = 2147. • Four words most similar to cluster centroid and KL distances for first splits. • Split 1: cluster of ‘fire’ as discharging weapons vs. cluster of ‘fire’ as releasing employees • Split 2: weapons as projectiles vs. weapons as guns.

  18. Clustering on Verb ‘fire’

  19. Evaluation

  20. Evaluation cont.

  21. Conclusions • Clustering is efficient, informative, and returns good predictions • Future work • Make clustering method more rigorous • Introduce human judgment, i.e. a more supervised approach • Extend model to other word relationships

  22. References

  23. References cont.

  24. More References

More Related