Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage

Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage Raymond J. Mooney Sugato Basu Mikhail Bilenko Arindam Banerjee

Supervised Classification Example . . . .

Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

Unsupervised Clustering Example . . . . . . . . . . . . . . . . . . . .

Semi-Supervised Learning • Combines labeled and unlabeled data during training to improve performance: • Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. • Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.

Semi-Supervised Classification Example . . . . . . . . . . . . . . . . . . . .

Semi-Supervised Classification • Algorithms: • Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. • Co-training [Blum:COLT98]. • Transductive SVM’s [Vapnik:98,Joachims:ICML99]. • Assumptions: • Known, fixed set of categories given in the labeled data. • Goal is to improve classification of examples into these known categories.

Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

Second Semi-Supervised Clustering Example . . . . . . . . . . . . . . . . . . . .

Semi-Supervised Clustering • Can group data using the categories in the initial labeled data. • Can also extend and modify the existing set of categories as needed to reflect other regularities in the data. • Can cluster a disjoint set of unlabeled data using the labeled data as a “guide” to the type of clusters desired.

Search-Based Semi-Supervised Clustering • Alter the clustering algorithm that searches for a good partitioning by: • Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. • Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. • Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans, EM) [Basu:ICML02].

Unsupervised KMeans Clustering • KMeans is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. Algorithm: Initialize K cluster centers randomly. Repeat until convergence: • Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum • Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

KMeans Objective Function • Locally minimizes sum of squared distance between the data points and their correspondingcluster centers: • Initialization of K cluster centers: • Totally random • Random perturbation from global mean • Heuristic to ensure well-separated centers etc.

K Means Example

K Means ExampleRandomly Initialize Means x x

K Means ExampleAssign Points to Clusters x x

K Means ExampleRe-estimate Means x x

K Means ExampleRe-assign Points to Clusters x x

K Means ExampleRe-estimate Means x x

K Means ExampleRe-assign Points to Clusters x x

K Means ExampleRe-estimate Means and Converge x x

Semi-Supervised KMeans • Seeded KMeans: • Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. • Seed points are only used for initialization, and not in subsequent steps. • Constrained KMeans: • Labeled data provided by user are used to initialize KMeans algorithm. • Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.

Semi-Supervised K Means Example

Semi-Supervised K Means ExampleInitialize Means Using Labeled Data x x

Semi-Supervised K Means ExampleAssign Points to Clusters x x

Semi-Supervised K Means ExampleRe-estimate Means and Converge x x

Similarity-Based Semi-Supervised Clustering • Train an adaptive similarity function to fit the labeled data. • Use a standard clustering algorithm with the trained similarity function to cluster the unlabeled data. • Adaptive similarity functions: • Altered Euclidian distance [Klein:ICML02] • Trained Mahalanobis distance [Xing:NIPS02] • EM-trained edit distance [Bilenko:KDD03] • Clustering algorithms: • Single-link agglomerative [Bilenko:KDD03] • Complete-link agglomerative [Klein:ICML02] • K-means [Xing:NIPS02]

Semi-Supervised Clustering ExampleSimilarity Based

Semi-Supervised Clustering ExampleDistances Transformed by Learned Metric

Semi-Supervised Clustering ExampleClustering Result with Trained Metric

Experiments • Evaluation measures: • Objective function value for KMeans. • Mutual Information (MI) between distributions of computed cluster labels and human-provided class labels. • Experiments: • Change of objective function and MI with increasing fraction of seeding (for complete labeling and no noise). • Change of objective function and MI with increasing noise in seeds (for complete labeling and fixed seeding).

Experimental Methodology • Clustering algorithm is always run on the entire dataset. • Learning curves with 10-fold cross-validation: • 10% data set aside as test set whose label is always hidden. • Learning curve generated by training on different “seed fractions” of the remaining 90% of the data whose label is provided. • Objective function is calculated over the entire dataset. • MI measure calculated only on the independent test set.

Experimental Methodology (contd.) • For each fold in the Seeding experiments: • Seeds selected from training dataset by varying seed fraction from 0.0 to 1.0, in steps of 0.1 • For each fold in the Noise experiments: • Noise simulated by changing the labels of a fraction of the seed values to a random incorrect value.

COP-KMeans • COPKMeans [Wagstaff et al.: ICML01] is KMeans with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. • Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). • Algorithm: During cluster assignment step in COP-KMeans, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.

Datasets • Data sets: • UCI Iris (3 classes; 150 instances) • CMU 20 Newsgroups (20 classes; 20,000 instances) • Yahoo! News (20 classes; 2,340 instances) • Data subsets created for experiments: • Small-20 newsgroup: random sample of 100 documents from each newsgroup, created to study effect of datasize on algorithms. • Different-3 newsgroup: 3 very different newsgroups (alt.atheism, rec.sport.baseball, sci.space), created to study effect of data separability on algorithms. • Same-3 newsgroup: 3 very similar newsgroups (comp.graphics, comp.os.ms-windows, comp.windows.x).

Text Data • Vector space model with TF-IDF weighting for text data. • Non-content bearing words removed: • Stopwords • High and low frequency words • Words of length < 3 • Text-handling software: • Spherical KMeans was used as the underlying clustering algorithm: it uses cosine-similarity instead of Euclidean distance between word vectors. • Code base built on top of MC and SPKMeans packages developed at UT Austin.

Results: MI and Seeding Zero noise in seeds [Small-20 NewsGroup] • Semi-Supervised KMeans substantially better than unsupervised KMeans

Results: Objective function and Seeding User-labeling consistent with KMeans assumptions [Small-20 NewsGroup] • Obj. function of data partition increases exponentially with seed fraction

Results: MI and Seeding Zero noise in seeds [Yahoo! News] • Semi-Supervised KMeans still better than unsupervised

Results: Objective Function and Seeding User-labeling inconsistent with KMeans assumptions [Yahoo! News] • Objective function of constrained algorithms decreases with seeding

Results: Dataset Separability Difficult datasets: lots of overlap between the clusters [Same-3 NewsGroup] • Semi-supervision gives substantial improvement

Results: Dataset Separability Easy datasets: not much overlap between the clusters [Different-3 NewsGroup] • Semi-supervision does not give substantial improvement

Results: Noise Resistance Seed fraction: 0.1 [20 NewsGroup] • Seeded-KMeans most robust against noisy seeding

Record Linkage • Identify and merge duplicate field values and duplicate records in a database. • Applications • Duplicates in mailing lists • Information integration of multiple databases of stores, restaurants, etc. • Matching bibliographic references in research papers (Cora/ResearchIndex) • Different published editions in a database of books.

Experimental Datasets • 1,200 artificially corrupted mailing list addresses. • 1,295 Cora research paper citations. • 864 restaurant listings from Fodor’s and Zagat’s guidebooks. • 1,675 Citeseer research paper citations.

Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage