1 / 106

Clustering Documents

Clustering Documents. Overview. It is a process of partitioning a set of data into a set of meaningful subclasses. Every data in the subclass shares a common trait. It helps a user understand the natural grouping or structure in a data set. Unsupervised Learning

odell
Download Presentation

Clustering Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Documents

  2. Overview • It is a process of partitioning a set of data into a set of meaningful subclasses. Every data in the subclass shares a common trait. • It helps a user understand the natural grouping or structure in a data set. • Unsupervised Learning • Cluster, category, group, class • No training data that the classifier use to learn how to group • Documents that share same properties are categorized into same clusters • Cluster Size, Number of Clusters, similarity measure • Square root of n if n is the number of documents • LSI

  3. How many clusters? Six Clusters Two Clusters Four Clusters Notion of a Cluster can be Ambiguous

  4. What is a natural grouping among these objects?

  5. What is a natural grouping among these objects? Clustering is subjective Simpson's Family School Employees Females Males

  6. Inter-cluster distances are maximized Intra-cluster distances are minimized What is clustering? • A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups • Distance Based clustering: objects belong to the name cluster according to a distance measure • Conceptual Clustering: objects belong to the same cluster if they describe a common concept

  7. Outliers • Outliers are objects that do not belong to any cluster or form clusters of very small cardinality • In some applications we are interested in discovering outliers, not clusters (outlier analysis) cluster outliers

  8. Why do we cluster? • Clustering : given a collection of data objects group them so that • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Clustering results are used: • As a stand-alone tool to get insight into data distribution • Visualization of clusters may unveil important information • As a preprocessing step for other algorithms • Efficient indexing or compression often relies on clustering

  9. Applications of clustering? • ImageProcessing : cluster images based on their visual content • Web : Cluster groups of users based on their access patterns on webpages / Cluster webpages based on their content • Bioinformatics : Cluster similar proteins together (similarity wrt chemical structure and/or functionality etc) • Marketing : Find groups of customers with similar behavior given a large database of customer data containing their properties and past buying records • Biology : classification of plants and animals given their features; • Libraries : book ordering • Insurance : identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds; • City-planning : identifying groups of houses according to their house type, value and geographical location; • Earthquakestudies : clustering observed earthquake epicenters to identify dangerous zones; • …many more

  10. The clustering task • Cluster observations into groups so that the observations belonging in the same group are similar, whereas observations in different groups are different • Basic questions: • What does “similar” mean • What is a good partition of the objects? i.e., how is the quality of a solution measured • How to find a good partition of the observations

  11. Observations to cluster • Real-value attributes/variables • e.g., salary, height • Binary attributes • e.g., gender (M/F), has_cancer(T/F) • Nominal (categorical) attributes • e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.) • Ordinal/Ranked attributes • e.g., military rank (soldier, sergeant, lutenant, captain, etc.) • Variables of mixed types • multiple attributes with various types

  12. Aim of Clustering • Partition unlabeled examples into subsets of clusters, such that: • Examples within a cluster are very similar • Examples in different clusters are very different

  13. Cluster Organization • For “small” number of documents simple/flat clustering is acceptable • Search a smaller set of clusters for relevancy • If cluster is relevant, documents in the cluster are also relevant • Problem: Look for a broader or more specific documents • Hierarchical clustering has a tree-like structure

  14. Types of Clusterings • A clustering is a set of clusters • Important distinction between hierarchical and partitionalsets of clusters • Partitional Clustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree

  15. A Partitional Clustering Partitional Clustering Original Points

  16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitional/Simple/Flat Clustering Example .

  17. Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

  18. Dendogram A dendogram presents the progressive, hierarchy-forming merging process pictorially.

  19. Visualization of Dendogram

  20. Example • D1 • Human machine interface for computer applications • D2 • A survey of user opinion of computer system response time • D3 • The EPS user interface management system • D4 • System and human system engineering testing of the EPS system • D5 • The generation of the random binary and ordered trees • D6 • The intersection graphs of paths in a tree • Graph minors: A survey • D7

  21. Broad Specific D3 D5 D6 D7 D2 D4 D1

  22. Other Distinctions Between Sets of Clusters • Exclusive versus non-exclusive • In non-exclusive clusterings, points may belong to multiple clusters. • Can represent multiple classes or ‘border’ points • Fuzzy versus non-fuzzy • In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 • Weights must sum to 1 • Probabilistic clustering has similar characteristics • Partial versus complete • In some cases, we only want to cluster some of the data • Heterogeneous versus homogeneous • Cluster of widely different sizes, shapes, and densities

  23. Cluster Parameters • A minimum and maximum size of clusters • Large cluster size • one cluster attracting many documents • Multi topic themes • A matching threshold value for including documents in a cluster • Minimum degree of similarity • Affects the number of clusters • High threshold • Fewer documents can join a cluster • Larger number of clusters • The degree of overlap between clusters • Some documents deal with more than one topic • Low degree of overlap • Greater separation of clusters • A maximum number of clusters

  24. Cluster-Based Search • Inverted file organization • Query keywords must exactly match word occurrences • Clustered file organization matches a keyword against a set of cluster representatives • Each cluster representative consists of popular words related to a common topic • In flat clustering, query compared against the centroids of the clusters • Centroid : average representative of a group of documents built from the composite text of all member documents

  25. Automatic Document Classification • Searching vs. Browsing • Disadvantages in using inverted index files • information pertaining to a document is scattered among many different inverted-term lists • information relating to different documents with similar term assignment is not in close proximity in the file system • Approaches • inverted-index files (for searching) +clustered document collection (for browsing) • clustered file organization (for searching and browsing)

  26. Typical Clustered File Organization Highest-level centroid Supercentroids Centroids Documents Typical Search path Centroids Documents

  27. Cluster Generation vs. Cluster Search • Cluster generation • Cluster structure is generated only once. • Cluster maintenance can be carried out at relatively infrequent intervals. • Cluster generation process may be slower and more expensive. • Cluster search • Cluster search operations may have to be performed continually. • Cluster search operations must be carried out efficiently.

  28. Hierarchical Cluster Generation • Two strategies • pairwise item similarities • heuristic methods • Models • Divisive Clustering (top down) • The complete collection is assumed to represent one complete cluster. • Then the collection is subsequently broken down into smaller pieces. • Hierarchical Agglomerative Clustering (bottom up) • Individual item similarities are used as a starting point. • A gluing operation collects similar items, or groups, into larger group.

  29. Searching with a taxonomy • Two ways to search a document collection organized in a taxonomy • Top Down Search • Start at the root • Progressively compare query with cluster representative • Single error at higher levels => wrong path => incorrect cluster • Bottom Up Search • Compare query with the most specific cluster at the lowest level • High number of low level clusters increase computation time • Use an inverted index for low level representatives

  30. Aim of Clustering again? • Partitioning data into classes with high intra-class similarity low inter-class similarity • Is it well-defined?

  31. What is Similarity? • Clearly, subjective measure or problem-dependent

  32. How Similar Clusters are? • Ex1: Two clusters or one clusters?

  33. How Similar Clusters are? • Ex2: Cluster or outliers

  34. Similarity Measures • Most cluster methods • use a matrix of similarity computations • Compute similarities between documents Home work: What are the similarity measures used in text mining. Discuss advantages disadvantages. Whenever appropriate comment on the application areas for each similarity measure. List the references and use your own words.

  35. Linking Methods Star Clique String

  36. Clustering Methods • Many methods to compute clusters • NP complete problem • Each solution can be evaluated quickly but exhaustive evaluation of all solutions is not feasible • Each trial may produce a different cluster organization

  37. Stable Clustering • Results should be independent of the initial order of documents • Clusters should not be substantially different when new documents are added to the collection • Results from consecutive runs should not differ significantly

  38. K-Means • Heuristic with complexity O(nlogn) • Matrix based algorithms O(n2) • Begins with an initial set of clusters • Pick the cluster centroids randomly • Use matrix based similarity on a small subset • Use density test to pick cluster centers from sample data • Di is cluster center if at least n other ds have similarity greater than threshold • A set of documents that are sufficiently dissimilar must exist in collection Use one of these methods

  39. K means Algorithm • Select k documents from the collection to form k initial singleton clusters • Repeat Until termination conditions are satisfied • For every document d, find the cluster i whose centroid is most similar, assign d to cluster i. • For every cluster i, recompute the centroid based on the current member documents • Check for termination—minimal or no changes in the assignment of documents to clusters • Return a list of clusters

  40. Simulated Annealing • Avoids local optima by randomly searching • Downhill move • New solution with higher (better) value than the previous solution • Uphill move • A worse solution is accepted to avoid local minima • The frequency decreases during “life cycle” • Analogy for crystal formation

  41. Simulated Annealing Algorithm • Get initial set of cluster and set the temperature to T • Repeat until the temperature is reduced to the minimum • Run a loop x times • Find a new set of clusters by altering the membership of some documents • Compare the difference between the values of the new and old set of clusters. If there is an improvement, accept the new set of clusters, otherwise accept the new set of clusters with probability p. • Reduce the temperature based on cooling schedule • Return the final set of clusters 2.1 2.1.2 2.1.1 2.2

  42. Simulated Annealing • Simple to implement • Solutions are reasonable good and avoid local minima • Successful in other optimization tasks • Initial set very important • Adjusting the size of clusters is difficult

  43. Genetic Algorithm • Use a population of solutions 1-Arrange the set of documents in a circle such that documents that are similar to one another are located close to each other 2-Find key documents from the circle and build clusters from a neighborhood of these documents . • Each arrangement of documents is a solution: (chromosome) • Fitness

  44. Genetic Algorithm • Pick two parent solutions x and y from the set of all solutions with preference for solutions with higher fitness score. • Use crossover operation to combine x and y to generate a new solution z. • Periodically mutate a solution by randomly exchanging two documents in a solution.

  45. Learn scatter/gather algorithm

  46. Scoring Documents

  47. Scoring Documents • Given document d • Given query q • Calculate score(q,d) • Rank documents in decreasing order of score(q,d) • Generic Model: Documents = bag of [unordered] words (in set theory a bag is a multiset) • A document is composed of terms • A query is composed of terms • score(q,d) will depend on terms

  48. Assign to each term a weight tft,d - term frequency (how often term t occurs in document d) query = ‘who wrote wild boys’ doc1 = ‘Duran Duran sang Wild Boys in 1984.’ doc2 = ‘Wild boys don’t remain forever wild.’ doc3 = ‘Who brought wild flowers?’ doc4 = ‘It was John Krakauer who wrote In to the wild.’ Method 1: Assign weights to terms query = {boys: 1, who: 1, wild: 1, wrote: 1} doc1 = {1984: 1, boys: 1, duran: 2, in: 1, sang: 1, wild: 1} doc2 = {boys: 1, don’t: 1, forever: 1, remain: 1, wild: 2} … score(q, doc1) = 1 + 1 = 2 score(q, doc2) = 1 + 2 = 3 score(q,doc3) = 1 + 1 = 2 score(q, doc4) = 1 + 1 + 1 = 3

  49. Why is Method 1 not good? • All terms have equal importance. • Bigger documents have more terms, thus the score is larger. • It ignores term order. Postulate: If a word appears in every document, probably it is not that important (it has no discriminatory power).

  50. Method 2: New weights • dft - document frequency for term t • idft - inverse document frequency for term t • tf-idftd - a combined weight for term t in document d • Increases with the number of occurrences withina doc • Increases with the rarity of the term acrossthe whole corpus N - total number of documents

More Related