1 / 30

Using Clustering to Learn Distance Functions for Supervised Similarity Assessment

Using Clustering to Learn Distance Functions for Supervised Similarity Assessment. Christoph F. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta Department of Computer Science University of Houston Organization of the Talk Similarity Assessment A Framework for Distance Function Learning

olesia
Download Presentation

Using Clustering to Learn Distance Functions for Supervised Similarity Assessment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Clustering to Learn Distance Functions for Supervised Similarity Assessment Christoph F. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta Department of Computer Science University of Houston Organization of the Talk Similarity Assessment A Framework for Distance Function Learning Inside Outside Weight Updating Distance Function Learning Research at UH-DMML Experimental Evaluation Other Distance Function Learning Research Summary

  2. 1. Similarity Assessment • Definition: Similarity assessment is the task of determining which objects are similar • to each other and which are dissimilar to each other. • Goal of Similarity Assessment: Construct a distance function! • Applications of Similarity Assessment: • Case-based reasoning • Classification techniques that rely on distance functions • Clustering • … • Complications: • Usually, there is no universal “good” distance function for a set of objects; the • usefulness of a distance depends on the task it used for (“no free lunch in • similarity assessment either”). • Defining the distance between objects is more an art than a science.

  3. Motivating Example: How To Find Similar Patients? The following relation is given (with 10000 tuples): Patient(ssn, weight, height, cancer-sev, eye-color, age,…) • Attribute Domains • ssn: 9 digits • weight between 30 and 650; mweight=158 sweight=24.20 • height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2 • cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor • eye-color: {brown, blue, green, grey } • age: between 3 and 100; mage=45 sage=13.2 Task:Define Patient Similarity

  4. CAL-FULL/UH Database Clustering & Similarity Assessment Environments For more details: see [RE05] Training Data A set of clusters Library of clustering algorithms Learning Tool Object View Similarity measure Clustering Tool Library of similarity measures Similarity Measure Tool Data Extraction Tool User Interface Today’s topic Type and weight information Default choices and domain information DBMS

  5. 2. A Framework for Distance Function Learning • Assumption: The distance between two objects is computed as the weighted sum of the distances with respect to their attributes. • Objective: Learn a “good” distance function q for classification tasks. • Our approach: Apply a clustering algorithm with the object distance function q to be evaluated that returns k clusters. • Our goal is to learn the weights of an object distance function q such that pure clusters are obtained (or as pure is possible) --- a pure cluster contains example belonging to a single class.

  6. Idea: Coevolving Clusters and Distance Functions Weight Updating Scheme / Search Strategy Clustering X Distance Function Q Cluster “Bad” distance function Q1 “Good” distance function Q2 q(X) Clustering Evaluation o o o x x o x o o o x o o o Goodness of the Distance Function Q o o x x x x x x

  7. 3. Inside/Outside Weight Updating o:=examples belonging to majority class x:= non-majority-class examples Cluster1: distances with respect to Att1 xo oo ox Action: Increase weight of Att1 Cluster1: distances with respect to Att2 Idea: Move examples of the majority class closer to each other o o xx o o Action: Decrease weight for Att2

  8. Inside/Outside Weight Updating Algorithm • Cluster the dataset using a given weight vector w=(w1,…,wp) using k-means • FOR EACH cluster-attribute pair DO • Modify w using inside/outside weight updating • IF NOT DONE, CONTINUE with Step1; OTHERWISE, RETURN w.

  9. Inside/Outside Weight Updating Heuristic The weight of the i-th attribute wi is updated as follows for a given cluster: (W) Example 2: Example 1: o o xx o o xo oo ox

  10. Idea: Weight Inside/Outside Weight Updating Clusterk 2 5 1 4 6 3 Attribute1 Attribute2 Attribute3 Initial Weights: w1=w2=w3=1; Updated Weights: w1=1.14,w2=1.32, w3=0.84

  11. Illustration: Net Effect of Weight Adjustments Clusterk 2 5 1 4 6 3 New Object Distances Old Object Distances

  12. A Slight Enhanced Weight Update Formula

  13. Sample Run of IOWU for the Diabetes Dataset

  14. 4. Distance Function Learning Research at UH-DMML Distance Function Evaluation Weight-Updating Scheme / Search Strategy Current Research [EZZ04] K-Means [ERBV04] Inside/Outside Weight Updating Supervised Clustering Work By Karypis Randomized Hill Climbing NN-Classifier Adaptive Clustering Other Research … [BECV05] …

  15. 5. Experimental Evaluation • Used a benchmark consisting of 7/15 UCI datasets • Inside/outside weight updating was run for 200 iterations • a was set to 0.3 • Evaluation (10-fold cross validation repeated 10 times was used to determine accuracy) • Used 1-NN classifier as the base line classifer • Usee the learned distance function for a 1-NN • Used the learned distance function for a NCC classifier (new!)

  16. NCC-Classifier Idea: the training set is replaced by k (centroid, majority class) pairs that are computed using k-means; the so generated dataset is then used to classify the examples in the test set. Attribute1 Attribute1 B A D C F E Attribute2 Attribute2 a. Dataset clustered by K-means b. Dataset edited using cluster centroids that carry the class label of the cluster majority class

  17. Experimental Evaluation Remark: Statistically significant improvements are in red.

  18. DF-Learning With Randomized Hill Climbing Random: random number a: rate of change for example:[-0.3,0.3] • Generate R solutions in the neighborhood of w and pick the best one to be the new weight vector w 0.3 -0.3

  19. Accuracy IOWA and Randomized Hill Climbing

  20. Distance Function Learning With Adaptive Clustering • Uses reinforcement learning to adapt distance functions for k-means clustering. • Employs search strategies that explores multiple paths in parallel. The algorithm maintains an open-list with maximum size |L| --- bad performers a dropped from the open list. Currently, beam search is used which creates 2p successors (increasing and decreasing the weight of each attribute exactly once) and evaluates those 2p*|L| successors and keeps the best |L| of those. • Discretizes the search space in which states are (<weights>,<centroids>) tuples into a grid, and memorizes and updates the fitness values of the grid; value iteration is limited to “interesting states” by employing prioritized sweeping. • Weights are updated by increasing / decreasing the weight of an attribute by a randomly chosen percentage that fall within an interval [min-change, max-change]; our current implementation uses: [25%,50%]. • Employs entropy H(X) as the fitness function (low entropy pure cluster)

  21. 6. Related Distance Function Learning Research • Interactive approaches that use user feedback and reinforcement learning to derive a good distance function. • Other work uses randomized hill climbing and neural networks to learn distance functions for classification tasks; mostly, NN-queries are used to evaluate the quality of a clustering. • Other work, mostly in the area of semi-supervised clustering, adapts object distances to cope with constraints.

  22. 7. Summary • Described an approach that employs clustering for distance function evaluation. • Introduced an attribute weight updating heuristic called inside/outside weight-updating and evaluated its performance. • The inside/weight updating approach enhanced a 1-NN classifier significantly for some UCI datasets, but not for all data sets that were tested. • The quality of the employed approach is dependent on the number of cluster k which is an input parameter; our current research centers on determining k automatically with a supervised clustering algorithm [EZZ04] • The general idea to replace a dataset by cluster representatives to enhance NN-classifiers shows a lot of promise in this (as exemplified in the NCC classifier) and other research we are currently conducting. • Distance function learning is quite time consuming; one run of 200 iterations of inside/outside weight updating takes between 5 seconds and 5 minutes depending on dataset size and k-value; other techniques we currently investigate are significantly slower; therefore, we are currently moving to high performance computing facilities for the empirical evaluation of the distance function learning approaches.

  23. Links to 4 Papers • [EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version appeared in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November 2004. http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf • [RE05] T. Ryu and C. Eick, A Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005). http://www.cs.uh.edu/~ceick/kdd/RE05.doc • [ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, in Proc. MLDM'05, Leipzig, Germany, July 2005. http://www.cs.uh.edu/~ceick/kdd/ERBV05.pdf • [BECV05] A. Bagherjeiran, C. Eick, C.-S. Chen, R. Vilalta, Adaptive Clustering: Obtaining Better Clusters Using Feedback and Past Experience, submitted for publication. http://www.cs.uh.edu/~ceick/kdd/BECV05.pdf

  24. Question? ? ? ?

  25. Randomized Hill Climbing • Fast start: algorithm starts from small neighborhood size until it can not find any better solutions. Then it increases its neighborhood size by 3 times hopping that a better solution can be found by trying more points • Shoulder condition: When the algorithm has moved to a shoulder or flat hill, it will keep getting solutions with the same fitness value. Our algorithm terminates when it has tried for 3 times and still getting the same results. This prevents it from been trapped in a shoulder forever

  26. Randomized Hill Climbing Objective function Flat hill Shoulder State space

  27. Purity in clusters obtained (internal)

  28. Purity in clusters obtained (internal)

  29. Ch. Eick Different Forms of Clustering Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)).

  30. A Fitness Function for Supervised Clustering q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0 Penalty(k) increase sub-linearly. because the effect of increasing the # of clusters from k to k+1 has greater effect on the end result when k is small than when it is large. Hence the formula above

More Related