1 / 72

Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage

Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage. Raymond J. Mooney Sugato Basu Mikhail Bilenko Arindam Banerjee. Supervised Classification Example. . . . . Supervised Classification Example. . . . . . . . . . . . . . . . . . . .

anson
Download Presentation

Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage Raymond J. Mooney Sugato Basu Mikhail Bilenko Arindam Banerjee

  2. Supervised Classification Example . . . .

  3. Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

  4. Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

  5. Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  6. Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  7. Semi-Supervised Learning • Combines labeled and unlabeled data during training to improve performance: • Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. • Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.

  8. Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

  9. Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

  10. Semi-Supervised Classification • Algorithms: • Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. • Co-training [Blum:COLT98]. • Transductive SVM’s [Vapnik:98,Joachims:ICML99]. • Assumptions: • Known, fixed set of categories given in the labeled data. • Goal is to improve classification of examples into these known categories.

  11. Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  12. Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  13. Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  14. Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

  15. Semi-Supervised Clustering • Can group data using the categories in the initial labeled data. • Can also extend and modify the existing set of categories as needed to reflect other regularities in the data. • Can cluster a disjoint set of unlabeled data using the labeled data as a “guide” to the type of clusters desired.

  16. Search-Based Semi-Supervised Clustering • Alter the clustering algorithm that searches for a good partitioning by: • Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. • Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. • Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans, EM) [Basu:ICML02].

  17. Unsupervised KMeans Clustering • KMeans is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. Algorithm: Initialize K cluster centers randomly. Repeat until convergence: • Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum • Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

  18. KMeans Objective Function • Locally minimizes sum of squared distance between the data points and their correspondingcluster centers: • Initialization of K cluster centers: • Totally random • Random perturbation from global mean • Heuristic to ensure well-separated centers etc.

  19. K Means Example

  20. K Means ExampleRandomly Initialize Means x x

  21. K Means ExampleAssign Points to Clusters x x

  22. K Means ExampleRe-estimate Means x x

  23. K Means ExampleRe-assign Points to Clusters x x

  24. K Means ExampleRe-estimate Means x x

  25. K Means ExampleRe-assign Points to Clusters x x

  26. K Means ExampleRe-estimate Means and Converge x x

  27. Semi-Supervised KMeans • Seeded KMeans: • Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. • Seed points are only used for initialization, and not in subsequent steps. • Constrained KMeans: • Labeled data provided by user are used to initialize KMeans algorithm. • Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.

  28. Semi-Supervised K Means Example

  29. Semi-Supervised K Means ExampleInitialize Means Using Labeled Data x x

  30. Semi-Supervised K Means ExampleAssign Points to Clusters x x

  31. Semi-Supervised K Means ExampleRe-estimate Means and Converge x x

  32. Similarity-Based Semi-Supervised Clustering • Train an adaptive similarity function to fit the labeled data. • Use a standard clustering algorithm with the trained similarity function to cluster the unlabeled data. • Adaptive similarity functions: • Altered Euclidian distance [Klein:ICML02] • Trained Mahalanobis distance [Xing:NIPS02] • EM-trained edit distance [Bilenko:KDD03] • Clustering algorithms: • Single-link agglomerative [Bilenko:KDD03] • Complete-link agglomerative [Klein:ICML02] • K-means [Xing:NIPS02]

  33. Semi-Supervised Clustering ExampleSimilarity Based

  34. Semi-Supervised Clustering ExampleDistances Transformed by Learned Metric

  35. Semi-Supervised Clustering ExampleClustering Result with Trained Metric

  36. Experiments • Evaluation measures: • Objective function value for KMeans. • Mutual Information (MI) between distributions of computed cluster labels and human-provided class labels. • Experiments: • Change of objective function and MI with increasing fraction of seeding (for complete labeling and no noise). • Change of objective function and MI with increasing noise in seeds (for complete labeling and fixed seeding).

  37. Experimental Methodology • Clustering algorithm is always run on the entire dataset. • Learning curves with 10-fold cross-validation: • 10% data set aside as test set whose label is always hidden. • Learning curve generated by training on different “seed fractions” of the remaining 90% of the data whose label is provided. • Objective function is calculated over the entire dataset. • MI measure calculated only on the independent test set.

  38. Experimental Methodology (contd.) • For each fold in the Seeding experiments: • Seeds selected from training dataset by varying seed fraction from 0.0 to 1.0, in steps of 0.1 • For each fold in the Noise experiments: • Noise simulated by changing the labels of a fraction of the seed values to a random incorrect value.

  39. COP-KMeans • COPKMeans [Wagstaff et al.: ICML01] is KMeans with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. • Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). • Algorithm: During cluster assignment step in COP-KMeans, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.

  40. Datasets • Data sets: • UCI Iris (3 classes; 150 instances) • CMU 20 Newsgroups (20 classes; 20,000 instances) • Yahoo! News (20 classes; 2,340 instances) • Data subsets created for experiments: • Small-20 newsgroup: random sample of 100 documents from each newsgroup, created to study effect of datasize on algorithms. • Different-3 newsgroup: 3 very different newsgroups (alt.atheism, rec.sport.baseball, sci.space), created to study effect of data separability on algorithms. • Same-3 newsgroup: 3 very similar newsgroups (comp.graphics, comp.os.ms-windows, comp.windows.x).

  41. Text Data • Vector space model with TF-IDF weighting for text data. • Non-content bearing words removed: • Stopwords • High and low frequency words • Words of length < 3 • Text-handling software: • Spherical KMeans was used as the underlying clustering algorithm: it uses cosine-similarity instead of Euclidean distance between word vectors. • Code base built on top of MC and SPKMeans packages developed at UT Austin.

  42. Results: MI and Seeding Zero noise in seeds [Small-20 NewsGroup] • Semi-Supervised KMeans substantially better than unsupervised KMeans

  43. Results: Objective function and Seeding User-labeling consistent with KMeans assumptions [Small-20 NewsGroup] • Obj. function of data partition increases exponentially with seed fraction

  44. Results: MI and Seeding Zero noise in seeds [Yahoo! News] • Semi-Supervised KMeans still better than unsupervised

  45. Results: Objective Function and Seeding User-labeling inconsistent with KMeans assumptions [Yahoo! News] • Objective function of constrained algorithms decreases with seeding

  46. Results: Dataset Separability Difficult datasets: lots of overlap between the clusters [Same-3 NewsGroup] • Semi-supervision gives substantial improvement

  47. Results: Dataset Separability Easy datasets: not much overlap between the clusters [Different-3 NewsGroup] • Semi-supervision does not give substantial improvement

  48. Results: Noise Resistance Seed fraction: 0.1 [20 NewsGroup] • Seeded-KMeans most robust against noisy seeding

  49. Record Linkage • Identify and merge duplicate field values and duplicate records in a database. • Applications • Duplicates in mailing lists • Information integration of multiple databases of stores, restaurants, etc. • Matching bibliographic references in research papers (Cora/ResearchIndex) • Different published editions in a database of books.

  50. Experimental Datasets • 1,200 artificially corrupted mailing list addresses. • 1,295 Cora research paper citations. • 864 restaurant listings from Fodor’s and Zagat’s guidebooks. • 1,675 Citeseer research paper citations.

More Related