Using Representative-Based Clustering For Nearest Neighbour Dataset Editing

Using Representative-Based Clustering For Nearest Neighbour Dataset Editing Christoph F. Eick, Nidal Zeidat, Ricardo Vilalta Department of Computer Science, University of Houston, Texas, USA Organization of the Talk Dataset Editing and Condensing Representative-based Supervised Clustering Experimental Results Applications of Supervised Clustering Summary and Conclusion

1. Introduction Nearest Neighbour Editing Consider a two class problem where each sample consists of two measurements (x,y). For a given query point q, assign the class of the nearest neighbour. k = 1 Compute the k nearest neighbours and assign the class by majority vote. k = 3

Dataset Reduction: Editing • Training data may contain noise, overlapping classes • Editing seeks to remove noisy points and produce smooth decision boundaries – often by retaining points far from the decision boundaries • Main Goal of Editing: enhance the accuracy of classifier (% of “unseen” examples classified correctly) • Secondary Goal of Editing: enhance the speed of a k-NN classifier

Figure provided by David Claus Wilson Editing • Remove points that do not agree with the majority of their k nearest neighbours • Therefore, only points that are classified incorrectly are removed Earlier example Overlapping classes Original data Original data Wilson editing with k=7 Wilson editing with k=7

Figure provided by David Claus Dataset Reduction: Condensing • Aim is to reduce the number of training samples  more speed • Retain only the samples that are needed to define the decision boundary • Tends to remove example that are classified correctly by a k-NN classifier • Decision Boundary Consistent – a subset whose nearest neighbour decision boundary is identical to the boundary of the entire training set • Minimum Consistent Set – the smallest subset of the training data that correctly classifies all of the original training data Original data Minimum Consistent Set

Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)).

2. Representative-Based Supervised Clustering (RSC) • Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster. • The remaining objects in the data set are, then, clustered around these representatives by assigning objects to the cluster of the closest representative. Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm.

Representative-Based Supervised Clustering … (Continued) 2 Attribute1 1 3 Attribute2 4

Representative-Based Supervised Clustering … (Continued) 2 Attribute1 1 3 Attribute2 4 Objective of RSC: Find a subset OR of O such that the clustering X obtained by using the objects in OR as representatives minimizes q(X).

RSC  Dataset Editing Attribute1 Attribute1 B A D C F E Attribute2 Attribute2 a. Dataset clustered using supervised clustering. b. Dataset edited using cluster representatives.

A Fitness Function for Supervised Clustering q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0

SC Algorithms Currently Investigated • Supervised Partitioning Around Medoids (SPAM). • Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). • Top Down Splitting Algorithm (TDS). • Supervised Clustering using Evolutionary Computing (SCEC) • Agglomerative Hierarchical Supervised Clustering (AHSC).

3. Experimental Evaluation • We compared a traditional 1-NN, 1-NN using Wilson Editing, Supervised Clustering Editing (SCE), and C4.5 (that was run using its default parameter setting). • A benchmark consisting of 8 UCI datasets was used for this purpose. • Accuracies were computed using 10-fold cross validation. • SRIDHCR was used for supervised clustering. • SCE was tested using different compression rates by associating different penalties with the number of clusters found (by setting parameter b to: 0.1, 0.4 and 1.0). • Compression rates of SCE and Wilson Editing were computed using: 1-(k/n) with n being the size of the original dataset and k being the size of the edited dataset.

Table 2: Prediction Accuracy for the four classifiers.

Table 3: Dataset Compression Rates for SCE and Wilson Editing.

4. Applications of Supervised Clustering • Enhance classification algorithms. • Use SC for Dataset Editing to enhance NN-classifiers • Improve Simple Classifiers • Learning Sub-classes • Distance Function Learning • Dataset Compression/Reduction • Redistricting • Meta Learning / Creating Signatures for Datasets

4. Summary • Wilson editing enhances the accuracy of a traditional 1-NN classifier for six of the eight datasets tested. It achieved compression rates of approx. 25%, but much lower compression rates for “easy” datasets. • SCE achieved very high compression rates without loss in accuracy for 6 of the 8 datasets tested. • SCE accomplished a significant improvement in accuracy for 3 of the 8 datasets tested. • Surprisingly, many UCI datasets can be compressed by just using a single representative per class without a significant loss in accuracy. • SCE tends to pick representatives that are in the center of a region that is dominated by a single class; it removes examples that are classified correctly as well as examples that are classified incorrectly from the dataset. This explains its much higher compression rates.

Current Direction of this Research p Data Set’ Data Set IDLA:= Inductive Learning Algorithm IDLA IDLA Classifier C Classifier C’ Goal: Find p, such that C’ is more accurate than C or C and C’ have approximately the same accuracy, but C’ can be learnt more quickly and/or C’ classifies new examples more quickly. Currently Investigated: Different editing techniques and techniques that originate from high-performance clustering algorithms (e.g. CURE).

Links to 4 Related Papers • [VAE03] R. Vilalta, M. Achari, C. Eick, Class Decomposition via Clustering: • A New Framework for Low-Variance Classifiers, in Proc. IEEE International • Conference on Data Mining (ICDM), Melbourne, Florida, November 2003. • http://www.cs.uh.edu/~ceick/kdd/VAE03.pdf • [EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and • Benefits, short version of this paper to appear in Proc. International Conference on • Tools with AI (ICTAI), Boca Raton, Florida, November 2004. • http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf • [ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to • Learn Distance Functions for Supervised Similarity Assessment, in revision, to be • submitted to MLDM'05, Leipzig, Germany, July 2005 • http://www.cs.uh.edu/~ceick/kdd/ERBV04.pdf • [EZV04] C. Eick, N. Zeidat, R. Vilalta, Using Representative-Based Clustering • for Nearest Neighbor Dataset Editing, to appear in Proc. IEEE International • Conference on Data Mining (ICDM), Brighton, England, November 2004. • http://www.cs.uh.edu/~ceick/kdd/EZV04.pdf

Figure provided by David Claus Multi-edit • Diffusion: divide data into N ≥ 3 random subsets • Classification: Classify Si using 1-NN with S(i+1)Mod N as the training set (i = 1..N) • Editing: Discard all samples incorrectly classified in (2) • Confusion: Pool all remaining samples into a new set • Termination: If the last I iterations produced no editing then end; otherwise go to (1) • Multi-edit [Devijer & Kittler ’79] • Repeatedly apply Wilson editing to random partitions • Classify with the 1-NN rule • Approximates the error rate of the Bayes decision rule Multi-edit, 8 iterations – last 3 same

Using Representative-Based Clustering For Nearest Neighbour Dataset Editing

Using Representative-Based Clustering For Nearest Neighbour Dataset Editing

Presentation Transcript

Cover Trees For Nearest Neighbour Search in Metric Spaces

A Polygon-based Clustering and Analysis Framework for Mining Spatial Dataset

Nearest Neighbour Condensing and Editing

Nearest Neighbour Analysis

Nearest Neighbour Analysis

Inferential Statistics 4: Nearest Neighbour Analysis

Representative sets and Clustering.

Using Word Based Features for Word Clustering

Fast PNN-based Clustering Using K -nearest Neighbor Graph

Nearest Neighbor Retrieval Using Distance-Based Hashing

An Invariant Large Margin Nearest Neighbour Classifier

CMune : A CLUSTERING USING MUTUAL NEAREST NEIGHBORS ALGORITHM

Dataset Based Physics Analysis

Lazy Learning k -Nearest Neighbour

Rugby players, Ballet dancers, and the Nearest Neighbour Classifier

Kerndichteschätzung Nearest-Neighbour-Verfahren

Fast PNN-based Clustering Using K -nearest Neighbor Graph

K Nearest Neighbour

Outlier Detection Using k-Nearest Neighbour Graph

Nearest Neighbour Index

Nearest Neighbor Retrieval Using Distance-Based Hashing

Challenges in Applying Nearest Neighbour Methods