i256 applied natural language processing n.
Skip this Video
Loading SlideShow in 5 Seconds..
I256: Applied Natural Language Processing PowerPoint Presentation
Download Presentation
I256: Applied Natural Language Processing

play fullscreen
1 / 65

I256: Applied Natural Language Processing

107 Views Download Presentation
Download Presentation

I256: Applied Natural Language Processing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. I256: Applied Natural Language Processing Marti Hearst Nov 6, 2006

  2. Today • Text Clustering • Latent Semantic Indexing (LSA)

  3. Text Clustering • Finds overall similarities among groups of documents • Finds overall similarities among groups of tokens • Picks out some themes, ignores others

  4. Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

  5. Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

  6. Clustering Applications • Find semantically related words by combining similarity evidence from multiple indicators • Try to find overall trends or patterns in text collections Slide by Vasileios Hatzivassiloglou

  7. “Training” in Clustering • Clustering is an unsupervised learning method • For each data set, a totally fresh solution is constructed • Therefore, there is no training • However, we often use some data for which we have additional information on how it should be partitioned to evaluate the performance of the clustering method Slide by Vasileios Hatzivassiloglou

  8. Pair-wise Document Similarity A B C D nova galaxy heat h’wood film role diet fur 1 3 1 5 2 2 1 5 4 1 How to compute document similarity?

  9. Pair-wise Document Similarity(no normalization for simplicity) nova galaxy heat h’wood film role diet fur 1 3 1 5 2 2 1 5 4 1 A B C D

  10. Pair-wise Document Similarity(cosine normalization)

  11. Document/Document Matrix

  12. Hierarchical clustering methods • Agglomerative or bottom-up: • Start with each sample in its own cluster • Merge the two closest clusters • Repeat until one cluster is left • Divisive or top-down: • Start with all elements in one cluster • Partition one of the current clusters in two • Repeat until all samples are in singleton clusters Slide by Vasileios Hatzivassiloglou

  13. Agglomerative Clustering A B C D E F G H I

  14. Agglomerative Clustering A B C D E F G H I

  15. AgglomerativeClustering A B C D E F G H I

  16. Merging Nodes • Each node is a combination of the documents combined below it • We represent the merged nodes as a vector of term weights • This vector is referred to as the cluster centroid Slide by Vasileios Hatzivassiloglou

  17. Merging criteria • We need to extend the distance measure from samples to sets of samples • The complete linkage method • The single linkage method • The average linkage method Slide by Vasileios Hatzivassiloglou

  18. merge Single-link merging criteria each word type isa single-point cluster • Merge closest pair of clusters: • Single-link: clusters are close if any of their points are dist(A,B) = min dist(a,b) for aA, bB

  19. ... Bottom-Up Clustering – Single-Link Fast, but tend to get long, stringy, meandering clusters

  20. distancebetweenclusters Bottom-Up Clustering – Complete-Link • Again, merge closest pair of clusters: • Complete-link: clusters are close only if all of their points are dist(A,B) = max dist(a,b) for aA, bB

  21. distancebetweenclusters Bottom-Up Clustering – Complete-Link Slow to find closest pair – need quadratically many distances

  22. K-Means Clustering • 1 Decide on a pair-wise similarity measure • 2 Find K centers using agglomerative clustering • take a small sample • group bottom up until K groups found • 3 Assign each document to nearest center, forming new clusters • 4 Repeat 3 as necessary

  23. k-Medians • Similar to k-means but instead of calculating the means across features, it selects as ci the sample in cluster Ci that minimizes (the median) • Advantages • Does not require feature vectors • Distance between samples is always available • Statistics with medians are more robust than statistics with means Slide by Vasileios Hatzivassiloglou

  24. Choosing k • In both hierarchical and k-means/medians, we need to be told where to stop, i.e., how many clusters to form • This is partially alleviated by visual inspection of the hierarchical tree (the dendrogram) • It would be nice if we could find an optimal k from the data • We can do this by trying different values of k and seeing which produces the best separation among the resulting clusters. Slide by Vasileios Hatzivassiloglou

  25. Scatter/Gather: Clustering a Large Text Collection Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 • Cluster sets of documents into general “themes”, like a table of contents • Display the contents of the clusters by showing topical terms andtypical titles • User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes”

  26. S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p) 12 steller phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

  27. Clustering Retrieval Results • Tends to place similar docs together • So can be used as a step in relevance ranking • But not great for showing to users • Exception: good for showing what to throw out!

  28. Another use of clustering • Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. • “Project” these onto a 2D graphical representation • Looks neat, but doesn’t work well as an information retrieval interface.

  29. Clustering Multi-Dimensional Document Space(image from Wise et al 95)

  30. How to evaluate clusters? • In practice, it’s hard to do • Different algorithms’ results look good and bad in different ways • It’s difficult to distinguish their outcomes • In theory, define an evaluation function • Typically choose something easy to measure (e.g., the sum of the average distance in each class)

  31. Two Types of Document Clustering • Grouping together of “similar” objects • Hard Clustering -- Each object belongs to a single cluster • Soft Clustering -- Each object is probabilistically assigned to clusters Slide by Inderjit S. Dhillon

  32. Soft clustering • A variation of many clustering methods • Instead of assigning each data sample to one and only one cluster, it calculates probabilities of membership for all clusters • So, a sample might belong to cluster A with probability 0.4 and to cluster B with probability 0.6 Slide by Vasileios Hatzivassiloglou

  33. Application: Clustering of adjectives • Cluster adjectives based on the nouns they modify • Multiple syntactic clues for modification • The similarity measure is Kendall’s τ, a robust measure of similarity • Clustering is done via a hill-climbing method that minimizes the combined average dissimilarity Predicting the semantic orientation of adjectives, V Hatzivassiloglou, KR McKeown, EACL 1997 Slide by Vasileios Hatzivassiloglou

  34. Clustering of nouns • Work by Pereira, Tishby, and Lee • Dissimilarity is KL divergence • Asymmetric relationship: nouns are clustered, verbs which have the nouns as objects serve as indicators • Soft, hierarchical clustering Slide by Vasileios Hatzivassiloglou

  35. Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93

  36. Distributional Clustering of English Words - Pereira, Tishby and Lee, ACL 93

  37. Latent Semantic Analysis • Mathematical/statistical technique for extracting and representing the similarity of meaning of words • Represents word and passage meaning as high-dimensional vectors in the semantic space • Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage meaning • Its success depends on: • Sufficient scale and sampling of the data it is given • Choosing the right number of dimensions to extract Slide by Kostas Kleisouris

  38. LSA Characteristics • Why is reducing dimensionality beneficial? • Some words with similar occurrence patterns are projected onto the same dimension • Closely mimics human judgments of meaning similarity Slide by Schone, Jurafsky, and Stenchikova

  39. Sample Applications of LSA • Essay Grading • LSA is trained on a large sample of text from the same domain as the topic of the essay • Each essay is compared to a large set of essays scored by experts and a subset of the most similar identified by LSA • The target essay is assigned a score consisting of a weighted combination of the scores for the comparison essays Slide by Kostas Kleisouris

  40. Sample Applications of LSA • Prediction of differences in comprehensibility of texts • By using conceptual similarity measures between successive sentences • LSA has predicted comprehension test results with students • Evaluate and give advice to students as they write and revise summaries of texts they have read • Assess psychiatric status • By representing the semantic content of answers to psychiatric interview questions Slide by Kostas Kleisouris

  41. Sample Applications of LSA • Improving Information Retrieval • Use LSA to match users’ queries with documents that have the desired conceptual meaning • Not used in practice – doesn’t help much when you have large corpora to match against, but maybe helpful for a few difficult queries and for term expansion Slide by Kostas Kleisouris

  42. LSA intuitions • Implements the idea that the meaning of a passage is the sum of the meanings of its words: meaning of word1 + meaning of word2 + … + meaning of wordn = meaning of passage • This “bag of words” function shows that a passage is considered to be an unordered set of word tokens and the meanings are additive. • By creating an equation of this kind for every passage of language that a learner observes, we get a large system of linear equations. Slide by Kostas Kleisouris

  43. LSA Intuitions • However • Too few equations to specify the values of the variables • Different values for the same variable (natural since meanings are vague or multiple) • Instead of finding absolute values for the meanings, they are represented in a richer form (vectors) • Use of SVD (reduces the linear system into multidimensional vectors) Slide by Kostas Kleisouris

  44. abandoned aardvark abacus zymurgy abduct abbot zygote above Latent Semantic Analysis • A trick from Information Retrieval • Each document in corpus is a length-k vector • Or each paragraph, or whatever (0, 3,3,1, 0,7,. . .1, 0) a single document Slide by Jason Eisner

  45. True plot in k dimensions Latent Semantic Analysis • A trick from Information Retrieval • Each document in corpus is a length-k vector • Plot all documents in corpus Reduced-dimensionality plot Slide by Jason Eisner

  46. True plot in k dimensions Latent Semantic Analysis • Reduced plot is a perspective drawing of true plot • It projects true plot onto a few axes •  a best choice of axes – shows most variation in the data. • Found by linear algebra: “Singular Value Decomposition” (SVD) Reduced-dimensionality plot Slide by Jason Eisner

  47. True plot in k dimensions theme B theme A word 2 theme B word 3 theme A word 1 Latent Semantic Analysis • SVD plot allows best possible reconstruction of true plot(i.e., can recover 3-D coordinates with minimal distortion) • Ignores variation in the axes that it didn’t pick out • Hope that variation’s just noise and we want to ignore it Reduced-dimensionality plot Slide by Jason Eisner