Supervised Clustering --- Algorithms and Applications. Christoph F. Eick Department of Computer Science University of Houston Organization of the Talk Supervised Clustering Representative-based Supervised Clustering Algorithms Applications: Using Supervised Clustering for Dataset Editing
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Supervised Clustering ---Algorithms and Applications
Christoph F. Eick
Department of Computer Science
University of Houston
Organization of the Talk
Supervised Clustering
Representative-based Supervised Clustering Algorithms
Applications: Using Supervised Clustering for
Dataset Editing
Class Decomposition
Distance Function Learning
Region Discovery in Spatial Datasets
Other Activities I am Involved With
Ch. Eick
Objectives Supervised Clustering: Minimize cluster impurity
while keeping the number of clusters low (expressed by a
fitness function q(X)).
Attribute1
Ford Trucks
:Ford
:GMC
GMC Trucks
GMC Van
Ford Vans
Ford SUV
Attribute2
GMC SUV
Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm.
2
Attribute1
1
3
Attribute2
4
2
Attribute1
1
3
Attribute2
4
Objective of RSC: Find a subset OR of O such that the clustering X
obtained by using the objects in OR as representatives minimizes q(X).
Remark: For a more detailed discussion of SCEC and SRIDHCR see [EZZ04]
q(X) := Impurity(X) + β*Penalty(k)
k: number of clusters used
n: number of examples the dataset
c: number of classes in a dataset.
β: Weight for Penalty(k), 0< β ≤2.0
Penalty(k) increase sub-linearly.
because the effect of increasing the # of clusters from k to k+1 has greater effect on the end result when k is small than when it is large. Hence the formula above
Algorithm SRIDHCR (Greedy Hill Climbing)
Supervised Clustering using Evolutionary Computing: SCEC
Initial generation
Next generation
Mutation
Crossover
Copy
Best solution
Final generation
Result:
Initialize
Solutions
Initialize
Solutions
Compose
Population S
Compose
Population S
Evaluate a Population
Evaluate a Population
Clustering
on S[i]
Clustering
on S[i]
Loop PS times
Loop PS times
Loop N times
Loop N times
Evaluation
on S[i]
Evaluation
on S[i]
Intermediate
Result
Intermediate
Result
Record Best Solution, Q
Record Best Solution, Q
Best Solution, Q, Summary
Best Solution, Q, Summary
Exit
Exit
Create next Generation
Create next Generation
K-tournament
K-tournament
Loop PS times
Loop PS times
Mutation
Mutation
Crossover
Crossover
Copy
Copy
New S’[i]
New S’[i]
The complete flow chart of SCEC
The complete flow chart of SCEC
Supervised Clustering ---Algorithms and Applications
Organization of the Talk
Supervised Clustering
Representative-based Supervised Clustering Algorithms
Applications: Using Supervised Clustering for
for Dataset Editing
for Class Decomposition
for Distance Function Learning
for Region Discovery in Spatial Datasets
Other Activities I am Involved With
Consider a two class problem where each sample consists of two measurements (x,y).
For a given query point q, assign the class of the nearest neighbour.
k = 1
Compute the k nearest neighbours and assign the class by majority vote.
k = 3
Problem: requires “good” distance function
Earlier example
Overlapping classes
Original data
Original data
Wilson editing with k=7
Wilson editing with k=7
Attribute1
Attribute1
B
A
D
C
F
E
Attribute2
Attribute2
a. Dataset clustered using supervised clustering.
b. Dataset edited using cluster representatives.
Remark: For a more detailed evaluation of SCE, Wilson Editing, and other editing techniques see [EZV04] and [ZWE05].
p
Data Set’
Data Set
IDLA
IDLA
Classifier C
Classifier C’
Goal: Find p, such that C’ is more accurate than C or C and C’ have
approximately the same accuracy, but C’ can be learnt more quickly
and/or C’ classifies new examples more quickly.
O OOx x x
OOOx x x
O OOx x x
Approaches to discover subclasses of a given class:
Figure 4. Supervised clustering editing vs. clustering each class (x and o) separately.
Remark: A traditional clustering algorithm, such as k-medoids, would pick o
as the cluster representative, because it is “blind” on how the examples of
other classes distribute, whereas supervised clustering would pick o as the
representative; obviously, o is not a good choice for editing, because it attracts
points of the class x, which leads to misclassifications.
Attribute 1
Attribute 1
Attribute 2
Attribute 2
Attribute 1
Attribute 2
3c. Using Clustering in Distance Function Learning
The following relation is given (with 10000 tuples):
Patient(ssn, weight, height, cancer-sev, eye-color, age,…)
Task: Define Patient Similarity
CAL-FULL/UH Database Clustering & Similarity Assessment Environments
Training
Data
A set of clusters
Library of
clustering algorithms
Learning
Tool
Object
View
Similarity measure
Clustering Tool
Library of similarity measures
Similarity Measure Tool
Data Extraction Tool
User Interface
Today’s
topic
Type and weight information
Default choices and domain information
DBMS
For more details: see [RE05]
Weight Updating Scheme /
Search Strategy
Clustering X
Distance
Function Q
Cluster
“Bad” distance function Q1
“Good” distance function Q2
q(X) Clustering
Evaluation
o
o
o
x
x
o
x
o
o
o
x
o
o
o
Goodness of
the Distance
Function Q
o
o
x
x
x
x
x
x
o:=examples belonging to majority class
x:= non-majority-class examples
Cluster1: distances with respect to Att1
xo oo ox
Action: Increase weight of Att1
Cluster1: distances with respect to Att2
Idea: Move examples of the
majority class closer to each other
o o xx o o
Action: Decrease weight for Att2
Graph produced by Abraham Bagherjeiran
Distance Function
Evaluation
Weight-Updating Scheme /
Search Strategy
K-Means
Inside/Outside
Weight Updating
[ERBV04]
Supervised
Clustering
Work
By Karypis
Randomized
Hill Climbing
NN-Classifier
Adaptive
Clustering
Other
Research
…
[BECV05]
…
Task: 2D/3D datasets are given; discover interesting regions in the dataset that maximize a given fitness function; examples of region discovery include:
Remark: We use (supervised) clustering to discover such regions; regions are implicitly defined by the set of points that belong to a cluster.
Example: 2 clusters in red and blue are given; regions are defined by using a Voronoi
diagram based on a NN classifier with k=7; region are in grey and white.
Let
prior(C)= |C|/n
p(c,C)= percentage of examples in c that belong to class C
Reward(c) is computed based on p(c.C), prior(C) , and based on the following
parameters: g1,g2,R+,R- (g11g2; R+,R-0) relying on the following interpolation
function (e.g. g1=0.8,g2=1.2,R+ =1, R-=1):
qC(X)= ScX (t(p(c,C),prior(C),g1,g2,R+,R-) *|c|)b/n)
with b>1 (typically, 1.0001<b<2); the idea is that increases in
cluster-size rewarded nonlinearly, favoring clusters with
more points as long as |c|*t(…) increases.
Reward(c)
R+
R-
t(p(C),prior(C),g1,g2,R+,R-)
prior(C)*g1
prior(C)
prior(C)*g2
1
p(c,C)
Ch. Eick
Supervised Clustering ---Algorithms and Applications
Organization of the Talk
Supervised Clustering
Representative-based Supervised Clustering Algorithms
Applications: Using Supervised Clustering for
for Dataset Editing
for Class Decomposition
for Distance Function Learning
for Region Discovery in Spatial Datasets
Other Activities I am Involved With
Clustering
Summary
Clustering
Algorithm
Inputs
changes
Adaptation
System
Evaluation
System
feedback
Past
Experience
Domain
Expert
quality
Fitness
Functions
(predefined)
q(X),
…
Idea: Development of a Generic Clustering/Feedback/Adaptation Architecture
whose objective is to facilitate the search for clusterings that maximize an internally and/or an externally given reward function (for some initial ideas see [BECV05])
Data Set Examples
Data Set Feature Representation
Distance Function
Clustering Algorithm Parameters
Fitness Function Parameters
Background Knowledge
Remark: Topics that were “covered” in this talk are in blue
[VAE03] R. Vilalta, M. Achari, C. Eick, Class Decomposition via Clustering: A New Framework for Low-Variance Classifiers, in Proc. IEEE International Conference on Data Mining (ICDM), Melbourne, Florida, November 2003.
http://www.cs.uh.edu/~ceick/kdd/VAE03.pdf
[EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version appeared in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November 2004.
http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf
[EZV04] C. Eick, N. Zeidat, R. Vilalta, Using Representative-Based Clustering for Nearest Neighbor Dataset Editing, in Proc. IEEE International Conference on Data Mining (ICDM), Brighton, England, November 2004.
http://www.cs.uh.edu/~ceick/kdd/EZV04.pdf
[RE05] T. Ryu and C. Eick, A Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005).
http://www.cs.uh.edu/~ceick/kdd/RE05.doc
[ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, in Proc. MLDM'05, Leipzig, Germany, July 2005.
http://www.cs.uh.edu/~ceick/kdd/ERBV05.pdf
[ZWE05] N. Zeidat, S. Wang, C. Eick,, Editing Techniques: a Comparative Study, submitted for publication.
http://www.cs.uh.edu/~ceick/kdd/ZWE05.pdf
[BECV05] A. Bagherjeiran, C. Eick, C.-S. Chen, R. Vilalta, Adaptive Clustering: Obtaining Better Clusters Using Feedback and Past Experience, submitted for publication.
http://www.cs.uh.edu/~ceick/kdd/BECV05.pdf