105 Views

Download Presentation
##### I256: Applied Natural Language Processing

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**I256: Applied Natural Language Processing**Marti Hearst Nov 6, 2006**Today**• Text Clustering • Latent Semantic Indexing (LSA)**Text Clustering**• Finds overall similarities among groups of documents • Finds overall similarities among groups of tokens • Picks out some themes, ignores others**Text Clustering**Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2**Text Clustering**Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2**Clustering Applications**• Find semantically related words by combining similarity evidence from multiple indicators • Try to find overall trends or patterns in text collections Slide by Vasileios Hatzivassiloglou**“Training” in Clustering**• Clustering is an unsupervised learning method • For each data set, a totally fresh solution is constructed • Therefore, there is no training • However, we often use some data for which we have additional information on how it should be partitioned to evaluate the performance of the clustering method Slide by Vasileios Hatzivassiloglou**Pair-wise Document Similarity**A B C D nova galaxy heat h’wood film role diet fur 1 3 1 5 2 2 1 5 4 1 How to compute document similarity?**Pair-wise Document Similarity(no normalization for**simplicity) nova galaxy heat h’wood film role diet fur 1 3 1 5 2 2 1 5 4 1 A B C D**Hierarchical clustering methods**• Agglomerative or bottom-up: • Start with each sample in its own cluster • Merge the two closest clusters • Repeat until one cluster is left • Divisive or top-down: • Start with all elements in one cluster • Partition one of the current clusters in two • Repeat until all samples are in singleton clusters Slide by Vasileios Hatzivassiloglou**Agglomerative Clustering**A B C D E F G H I**Agglomerative Clustering**A B C D E F G H I**AgglomerativeClustering**A B C D E F G H I**Merging Nodes**• Each node is a combination of the documents combined below it • We represent the merged nodes as a vector of term weights • This vector is referred to as the cluster centroid Slide by Vasileios Hatzivassiloglou**Merging criteria**• We need to extend the distance measure from samples to sets of samples • The complete linkage method • The single linkage method • The average linkage method Slide by Vasileios Hatzivassiloglou**merge**Single-link merging criteria each word type isa single-point cluster • Merge closest pair of clusters: • Single-link: clusters are close if any of their points are dist(A,B) = min dist(a,b) for aA, bB**...**Bottom-Up Clustering – Single-Link Fast, but tend to get long, stringy, meandering clusters**distancebetweenclusters**Bottom-Up Clustering – Complete-Link • Again, merge closest pair of clusters: • Complete-link: clusters are close only if all of their points are dist(A,B) = max dist(a,b) for aA, bB**distancebetweenclusters**Bottom-Up Clustering – Complete-Link Slow to find closest pair – need quadratically many distances**K-Means Clustering**• 1 Decide on a pair-wise similarity measure • 2 Find K centers using agglomerative clustering • take a small sample • group bottom up until K groups found • 3 Assign each document to nearest center, forming new clusters • 4 Repeat 3 as necessary**k-Medians**• Similar to k-means but instead of calculating the means across features, it selects as ci the sample in cluster Ci that minimizes (the median) • Advantages • Does not require feature vectors • Distance between samples is always available • Statistics with medians are more robust than statistics with means Slide by Vasileios Hatzivassiloglou**Choosing k**• In both hierarchical and k-means/medians, we need to be told where to stop, i.e., how many clusters to form • This is partially alleviated by visual inspection of the hierarchical tree (the dendrogram) • It would be nice if we could find an optimal k from the data • We can do this by trying different values of k and seeing which produces the best separation among the resulting clusters. Slide by Vasileios Hatzivassiloglou**Scatter/Gather: Clustering a Large Text Collection**Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 • Cluster sets of documents into general “themes”, like a table of contents • Display the contents of the clusters by showing topical terms andtypical titles • User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes”**S/G Example: query on “star”**Encyclopedia text 14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p) 12 steller phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated**Clustering Retrieval Results**• Tends to place similar docs together • So can be used as a step in relevance ranking • But not great for showing to users • Exception: good for showing what to throw out!**Another use of clustering**• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. • “Project” these onto a 2D graphical representation • Looks neat, but doesn’t work well as an information retrieval interface.**Clustering Multi-Dimensional Document Space(image from Wise**et al 95)**How to evaluate clusters?**• In practice, it’s hard to do • Different algorithms’ results look good and bad in different ways • It’s difficult to distinguish their outcomes • In theory, define an evaluation function • Typically choose something easy to measure (e.g., the sum of the average distance in each class)**Two Types of Document Clustering**• Grouping together of “similar” objects • Hard Clustering -- Each object belongs to a single cluster • Soft Clustering -- Each object is probabilistically assigned to clusters Slide by Inderjit S. Dhillon**Soft clustering**• A variation of many clustering methods • Instead of assigning each data sample to one and only one cluster, it calculates probabilities of membership for all clusters • So, a sample might belong to cluster A with probability 0.4 and to cluster B with probability 0.6 Slide by Vasileios Hatzivassiloglou**Application: Clustering of adjectives**• Cluster adjectives based on the nouns they modify • Multiple syntactic clues for modification • The similarity measure is Kendall’s τ, a robust measure of similarity • Clustering is done via a hill-climbing method that minimizes the combined average dissimilarity Predicting the semantic orientation of adjectives, V Hatzivassiloglou, KR McKeown, EACL 1997 Slide by Vasileios Hatzivassiloglou**Clustering of nouns**• Work by Pereira, Tishby, and Lee • Dissimilarity is KL divergence • Asymmetric relationship: nouns are clustered, verbs which have the nouns as objects serve as indicators • Soft, hierarchical clustering Slide by Vasileios Hatzivassiloglou**Distributional Clustering of English Words - Pereira, Tishby**and Lee, ACL 93**Distributional Clustering of English Words - Pereira, Tishby**and Lee, ACL 93**Latent Semantic Analysis**• Mathematical/statistical technique for extracting and representing the similarity of meaning of words • Represents word and passage meaning as high-dimensional vectors in the semantic space • Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage meaning • Its success depends on: • Sufficient scale and sampling of the data it is given • Choosing the right number of dimensions to extract Slide by Kostas Kleisouris**LSA Characteristics**• Why is reducing dimensionality beneficial? • Some words with similar occurrence patterns are projected onto the same dimension • Closely mimics human judgments of meaning similarity Slide by Schone, Jurafsky, and Stenchikova**Sample Applications of LSA**• Essay Grading • LSA is trained on a large sample of text from the same domain as the topic of the essay • Each essay is compared to a large set of essays scored by experts and a subset of the most similar identified by LSA • The target essay is assigned a score consisting of a weighted combination of the scores for the comparison essays Slide by Kostas Kleisouris**Sample Applications of LSA**• Prediction of differences in comprehensibility of texts • By using conceptual similarity measures between successive sentences • LSA has predicted comprehension test results with students • Evaluate and give advice to students as they write and revise summaries of texts they have read • Assess psychiatric status • By representing the semantic content of answers to psychiatric interview questions Slide by Kostas Kleisouris**Sample Applications of LSA**• Improving Information Retrieval • Use LSA to match users’ queries with documents that have the desired conceptual meaning • Not used in practice – doesn’t help much when you have large corpora to match against, but maybe helpful for a few difficult queries and for term expansion Slide by Kostas Kleisouris**LSA intuitions**• Implements the idea that the meaning of a passage is the sum of the meanings of its words: meaning of word1 + meaning of word2 + … + meaning of wordn = meaning of passage • This “bag of words” function shows that a passage is considered to be an unordered set of word tokens and the meanings are additive. • By creating an equation of this kind for every passage of language that a learner observes, we get a large system of linear equations. Slide by Kostas Kleisouris**LSA Intuitions**• However • Too few equations to specify the values of the variables • Different values for the same variable (natural since meanings are vague or multiple) • Instead of finding absolute values for the meanings, they are represented in a richer form (vectors) • Use of SVD (reduces the linear system into multidimensional vectors) Slide by Kostas Kleisouris**abandoned**aardvark abacus zymurgy abduct abbot zygote above Latent Semantic Analysis • A trick from Information Retrieval • Each document in corpus is a length-k vector • Or each paragraph, or whatever (0, 3,3,1, 0,7,. . .1, 0) a single document Slide by Jason Eisner**True plot in k dimensions**Latent Semantic Analysis • A trick from Information Retrieval • Each document in corpus is a length-k vector • Plot all documents in corpus Reduced-dimensionality plot Slide by Jason Eisner**True plot in k dimensions**Latent Semantic Analysis • Reduced plot is a perspective drawing of true plot • It projects true plot onto a few axes • a best choice of axes – shows most variation in the data. • Found by linear algebra: “Singular Value Decomposition” (SVD) Reduced-dimensionality plot Slide by Jason Eisner**True plot in k dimensions**theme B theme A word 2 theme B word 3 theme A word 1 Latent Semantic Analysis • SVD plot allows best possible reconstruction of true plot(i.e., can recover 3-D coordinates with minimal distortion) • Ignores variation in the axes that it didn’t pick out • Hope that variation’s just noise and we want to ignore it Reduced-dimensionality plot Slide by Jason Eisner