Supervised clustering algorithms and applications
Download
1 / 47

Supervised Clustering --- Algorithms and Applications - PowerPoint PPT Presentation


  • 150 Views
  • Updated On :

Supervised Clustering --- Algorithms and Applications. Christoph F. Eick Department of Computer Science University of Houston Organization of the Talk Supervised Clustering Representative-based Supervised Clustering Algorithms Applications: Using Supervised Clustering for Dataset Editing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Supervised Clustering --- Algorithms and Applications' - dympna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Supervised clustering algorithms and applications l.jpg

Supervised Clustering ---Algorithms and Applications

Christoph F. Eick

Department of Computer Science

University of Houston

Organization of the Talk

Supervised Clustering

Representative-based Supervised Clustering Algorithms

Applications: Using Supervised Clustering for

Dataset Editing

Class Decomposition

Distance Function Learning

Region Discovery in Spatial Datasets

Other Activities I am Involved With


List of persons that contributed to the work presented in today s talk l.jpg
List of Persons that Contributed to the Work Presented in Today’s Talk

  • Tae-Wan Ryu (former PhD student; now faculty member Cal State Fullerton)

  • Ricardo Vilalta (colleague at UH since 2002; Co-Director of the UH’s Data Mining and Knowledge Discovery Group)

  • Murali Achari (former Master student)

  • Alain Rouhana (former Master student)

  • Abraham Bagherjeiran (current PhD student)

  • Chunshen Chen (current Master student)

  • Nidal Zeidat (current PhD student)

  • Sujing Wang (current PhD student)

  • Kim Wee (current MS student)

  • Zhenghong Zhao (former Master student)


Traditional clustering l.jpg
Traditional Clustering Today’s Talk

  • Partition a set of objects into groups of similar objects. Each group is called a cluster.

  • Clustering is used to “detect classes” in a data set (“unsupervised learning”).

  • Clustering is based on a fitness function that relies on a distance measure and usually tries to create “tight” clusters.


Different forms of clustering l.jpg

Ch. Eick Today’s Talk

Different Forms of Clustering

Objectives Supervised Clustering: Minimize cluster impurity

while keeping the number of clusters low (expressed by a

fitness function q(X)).


Motivation finding subclasses using sc l.jpg
Motivation: Finding Subclasses using SC Today’s Talk

Attribute1

Ford Trucks

:Ford

:GMC

GMC Trucks

GMC Van

Ford Vans

Ford SUV

Attribute2

GMC SUV


Related work supervised clustering l.jpg
Related Work Supervised Clustering Today’s Talk

  • Sinkkonen’s [SKN02] discriminative clustering and Tishby’s information bottleneck method [TPB99, ST99] can be viewed as probabilistic supervised clustering algorithms.

  • There has been a lot of work in the area of semi-supervised clustering that centers on clustering with background information. Although the focus of this work is traditional clustering, there is still a lot of similarity between techniques and algorithms they investigate and the techniques and algorithms we investigate.


2 representative based supervised clustering l.jpg
2. Representative-Based Today’s Talk Supervised Clustering

  • Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster.

  • The remaining objects in the data set are then clustered around these representatives by assigning objects to the cluster of the closest representative.

    Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm.


Representative based supervised clustering continued l.jpg
Representative-Based Today’s Talk Supervised Clustering … (Continued)

2

Attribute1

1

3

Attribute2

4


Representative based supervised clustering continued9 l.jpg
Representative-Based Today’s Talk Supervised Clustering … (continued)

2

Attribute1

1

3

Attribute2

4

Objective of RSC: Find a subset OR of O such that the clustering X

obtained by using the objects in OR as representatives minimizes q(X).


Sc algorithms currently investigated l.jpg
SC Algorithms Currently Investigated Today’s Talk

  • Supervised Partitioning Around Medoids (SPAM).

  • Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR).

  • Top Down Splitting Algorithm (TDS).

  • Supervised Clustering using Evolutionary Computing (SCEC)

  • Agglomerative Hierarchical Supervised Clustering (AHSC)

  • Grid-Based Supervised Clustering (GRIDSC)

Remark: For a more detailed discussion of SCEC and SRIDHCR see [EZZ04]


A fitness function for supervised clustering l.jpg
A Fitness Function for Supervised Clustering Today’s Talk

q(X) := Impurity(X) + β*Penalty(k)

k: number of clusters used

n: number of examples the dataset

c: number of classes in a dataset.

β: Weight for Penalty(k), 0< β ≤2.0

Penalty(k) increase sub-linearly.

because the effect of increasing the # of clusters from k to k+1 has greater effect on the end result when k is small than when it is large. Hence the formula above


Slide12 l.jpg

  • REPEAT Today’s Talkr TIMES

    • curr := a randomly created set of representatives (with size between c+1 and 2*c)

    • WHILE NOT DONE DO

      • Create new solutions S by adding a single non-representative to curr and by removing a single representative from curr

      • Determine the element s in S for which q(s) is minimal (if there is more than one minimal element, randomly pick one)

      • IF q(s)<q(curr) THEN curr:=s

        • ELSE IF q(s)=q(curr) AND |s|>|curr| THEN Curr:=s

        • ELSE terminate and return curr as the solution for this run.

  • Report the best out of the r solutions found.

Algorithm SRIDHCR (Greedy Hill Climbing)

  • Highlights:

  • k is not an input parameter, SRIDHCR searches for best k within the range that is induced by b.

  • Reports the best clustering found in r runs


Slide13 l.jpg

S Today’s Talkupervised Clustering using Evolutionary Computing: SCEC

Initial generation

Next generation

Mutation

Crossover

Copy

Best solution

Final generation

Result:


Slide14 l.jpg

Initialize Today’s Talk

Solutions

Initialize

Solutions

Compose

Population S

Compose

Population S

Evaluate a Population

Evaluate a Population

Clustering

on S[i]

Clustering

on S[i]

Loop PS times

Loop PS times

Loop N times

Loop N times

Evaluation

on S[i]

Evaluation

on S[i]

Intermediate

Result

Intermediate

Result

Record Best Solution, Q

Record Best Solution, Q

Best Solution, Q, Summary

Best Solution, Q, Summary

Exit

Exit

Create next Generation

Create next Generation

K-tournament

K-tournament

Loop PS times

Loop PS times

Mutation

Mutation

Crossover

Crossover

Copy

Copy

New S’[i]

New S’[i]

The complete flow chart of SCEC

The complete flow chart of SCEC


Complex1 dataset l.jpg
Complex1 Dataset Today’s Talk



Supervised clustering algorithms and applications17 l.jpg

Supervised Clustering --- Today’s TalkAlgorithms and Applications

Organization of the Talk

Supervised Clustering

Representative-based Supervised Clustering Algorithms

Applications: Using Supervised Clustering for

for Dataset Editing

for Class Decomposition

for Distance Function Learning

for Region Discovery in Spatial Datasets

Other Activities I am Involved With


Nearest neighbour rule l.jpg
Nearest Neighbour Rule Today’s Talk

Consider a two class problem where each sample consists of two measurements (x,y).

For a given query point q, assign the class of the nearest neighbour.

k = 1

Compute the k nearest neighbours and assign the class by majority vote.

k = 3

Problem: requires “good” distance function


3a dataset reduction editing l.jpg
3a. Dataset Reduction: Editing Today’s Talk

  • Training data may contain noise, overlapping classes

  • Editing seeks to remove noisy points and produce smooth decision boundaries – often by retaining points far from the decision boundaries

  • Main Goal of Editing: enhance the accuracy of classifier (% of “unseen” examples classified correctly)

  • Secondary Goal of Editing: enhance the speed of a k-NN classifier


Wilson editing l.jpg
Wilson Editing Today’s Talk

  • Wilson 1972

  • Remove points that do not agree with the majority of their k nearest neighbours

Earlier example

Overlapping classes

Original data

Original data

Wilson editing with k=7

Wilson editing with k=7


Rsc dataset editing l.jpg
RSC Today’s Talk Dataset Editing

Attribute1

Attribute1

B

A

D

C

F

E

Attribute2

Attribute2

a. Dataset clustered using supervised clustering.

b. Dataset edited using cluster representatives.


Experimental evaluation l.jpg
Experimental Evaluation Today’s Talk

  • We compared a traditional 1-NN, 1-NN using Wilson Editing, Supervised Clustering Editing (SCE), and C4.5 (that was run using its default parameter setting).

  • A benchmark consisting of 8 UCI datasets was used for this purpose.

  • Accuracies were computed using 10-fold cross validation.

  • SRIDHCR was used for supervised clustering.

  • SCE was tested using different compression rates by associating different penalties with the number of clusters found (by setting parameter b to: 0.1, 0.4 and 1.0).

  • Compression rates of SCE and Wilson Editing were computed using: 1-(k/n) with n being the size of the original dataset and k being the size of the edited dataset.




Summary sce and wilson editing l.jpg
Summary SCE and Wilson Editing Editing.

  • Wilson editing enhances the accuracy of a traditional 1-NN classifier for six of the eight datasets tested. It achieved compression rates of approx. 25%, but much lower compression rates for “easy” datasets.

  • SCE achieved very high compression rates without loss in accuracy for 6 of the 8 datasets tested.

  • SCE accomplished a significant improvement in accuracy for 3 of the 8 datasets tested.

  • Surprisingly, many UCI datasets can be compressed by just using a single representative per class without a significant loss in accuracy.

  • SCE tends to pick representatives that are in the center of a region that is dominated by a single class; it removes examples that are classified correctly as well as examples that are classified incorrectly from the dataset. This explains its much higher compression rates.

    Remark: For a more detailed evaluation of SCE, Wilson Editing, and other editing techniques see [EZV04] and [ZWE05].


Future direction of this research l.jpg
Future Direction of this Research Editing.

p

Data Set’

Data Set

IDLA

IDLA

Classifier C

Classifier C’

Goal: Find p, such that C’ is more accurate than C or C and C’ have

approximately the same accuracy, but C’ can be learnt more quickly

and/or C’ classifies new examples more quickly.


Supervised clustering vs clustering the examples of each separately l.jpg

O OOx x x Editing.

OOOx x x

O OOx x x

Supervised Clustering vs. Clustering the Examples of Each Separately

Approaches to discover subclasses of a given class:

  • Cluster the examples of each class separately

  • Use supervised clustering

Figure 4. Supervised clustering editing vs. clustering each class (x and o) separately.

Remark: A traditional clustering algorithm, such as k-medoids, would pick o

as the cluster representative, because it is “blind” on how the examples of

other classes distribute, whereas supervised clustering would pick o as the

representative; obviously, o is not a good choice for editing, because it attracts

points of the class x, which leads to misclassifications.


Slide28 l.jpg

Attribute 1

Attribute 1

Attribute 2

Attribute 2

Attribute 1

  • Simple classifiers:

    • Encompass a small class of approximating functions.

    • Limited flexibility in their decision boundaries

Attribute 2


Na ve bayes vs na ve bayes with class decomposition l.jpg
Naïve Bayes vs. Naïve Bayes with Editing.Class Decomposition


Example how to find similar patients l.jpg

3c. Using Clustering in Distance Function Learning Editing.

Example: How to Find Similar Patients?

The following relation is given (with 10000 tuples):

Patient(ssn, weight, height, cancer-sev, eye-color, age,…)

  • Attribute Domains

    • ssn: 9 digits

    • weight between 30 and 650; mweight=158 sweight=24.20

    • height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2

    • cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor

    • eye-color: {brown, blue, green, grey }

    • age: between 3 and 100; mage=45 sage=13.2

      Task: Define Patient Similarity


Slide31 l.jpg

CAL-FULL/UH Database Clustering & Similarity Assessment Environments

Training

Data

A set of clusters

Library of

clustering algorithms

Learning

Tool

Object

View

Similarity measure

Clustering Tool

Library of similarity measures

Similarity Measure Tool

Data Extraction Tool

User Interface

Today’s

topic

Type and weight information

Default choices and domain information

DBMS

For more details: see [RE05]


Similarity assessment framework and objectives l.jpg
Similarity Assessment Framework and Objectives Environments

  • Objective: Learn a good distance function q for classification tasks.

  • Our approach: Apply a clustering algorithm with the distance function q to be evaluated that returns a number of clusters k. The more pure the obtained clusters are the better is the quality of q.

  • Our goal is to learn the weights of an object distance function q such that all the clusters are pure (or as pure is possible); for more details see [ERBV05] and [BECV05] papers.


Idea coevolving clusters and distance functions l.jpg
Idea: Coevolving Clusters and Distance Functions Environments

Weight Updating Scheme /

Search Strategy

Clustering X

Distance

Function Q

Cluster

“Bad” distance function Q1

“Good” distance function Q2

q(X) Clustering

Evaluation

o

o

o

x

x

o

x

o

o

o

x

o

o

o

Goodness of

the Distance

Function Q

o

o

x

x

x

x

x

x


Idea inside outside weight updating l.jpg
Idea Inside/Outside Weight Updating Environments

o:=examples belonging to majority class

x:= non-majority-class examples

Cluster1: distances with respect to Att1

xo oo ox

Action: Increase weight of Att1

Cluster1: distances with respect to Att2

Idea: Move examples of the

majority class closer to each other

o o xx o o

Action: Decrease weight for Att2


Sample run of iowu for diabetes dataset l.jpg
Sample Run of IOWU for Diabetes Dataset Environments

Graph produced by Abraham Bagherjeiran


Research framework distance function learning l.jpg
Research Framework Distance Function Learning Environments

Distance Function

Evaluation

Weight-Updating Scheme /

Search Strategy

K-Means

Inside/Outside

Weight Updating

[ERBV04]

Supervised

Clustering

Work

By Karypis

Randomized

Hill Climbing

NN-Classifier

Adaptive

Clustering

Other

Research

[BECV05]


3 d discovery of interesting regions for spatial data mining l.jpg
3.d Discovery of EnvironmentsInteresting Regions for Spatial Data Mining

Task: 2D/3D datasets are given; discover interesting regions in the dataset that maximize a given fitness function; examples of region discovery include:

  • Discover regions that have significant deviations from the prior probability of a class; e.g. regions in the state of Wyoming were people are very poor or not poor at all

  • Discover regions that have significant variation in the income (fitness is defined based on the variance with respect to income in a region)

  • Discover regions for congressional redistricting

  • Discover congested regions for traffic control

    Remark: We use (supervised) clustering to discover such regions; regions are implicitly defined by the set of points that belong to a cluster.


Wyoming map l.jpg
Wyoming Map Environments



Clusters regions l.jpg
Clusters Environments Regions

Example: 2 clusters in red and blue are given; regions are defined by using a Voronoi

diagram based on a NN classifier with k=7; region are in grey and white.


An evaluation scheme for discovering regions that deviate from the prior probability of a class c l.jpg
An Evaluation Scheme for Discovering Regions that Deviate from the Prior Probability of a Class C

Let

prior(C)= |C|/n

p(c,C)= percentage of examples in c that belong to class C

Reward(c) is computed based on p(c.C), prior(C) , and based on the following

parameters: g1,g2,R+,R- (g11g2; R+,R-0) relying on the following interpolation

function (e.g. g1=0.8,g2=1.2,R+ =1, R-=1):

qC(X)= ScX (t(p(c,C),prior(C),g1,g2,R+,R-) *|c|)b/n)

with b>1 (typically, 1.0001<b<2); the idea is that increases in

cluster-size rewarded nonlinearly, favoring clusters with

more points as long as |c|*t(…) increases.

Reward(c)

R+

R-

t(p(C),prior(C),g1,g2,R+,R-)

prior(C)*g1

prior(C)

prior(C)*g2

1

p(c,C)


Example discovery of interesting regions in wyoming census 2000 datasets l.jpg

Ch. Eick from the Prior Probability of a Class C

Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets


Supervised clustering algorithms and applications43 l.jpg

Supervised Clustering --- from the Prior Probability of a Class CAlgorithms and Applications

Organization of the Talk

Supervised Clustering

Representative-based Supervised Clustering Algorithms

Applications: Using Supervised Clustering for

for Dataset Editing

for Class Decomposition

for Distance Function Learning

for Region Discovery in Spatial Datasets

Other Activities I am Involved With


An environment for adaptive supervised clustering for summary generation applications l.jpg
An Environment for Adaptive (Supervised) Clustering from the Prior Probability of a Class Cfor Summary Generation Applications

Clustering

Summary

Clustering

Algorithm

Inputs

changes

Adaptation

System

Evaluation

System

feedback

Past

Experience

Domain

Expert

quality

Fitness

Functions

(predefined)

q(X),

Idea: Development of a Generic Clustering/Feedback/Adaptation Architecture

whose objective is to facilitate the search for clusterings that maximize an internally and/or an externally given reward function (for some initial ideas see [BECV05])


Clustering algorithm inputs l.jpg
Clustering Algorithm Inputs from the Prior Probability of a Class C

Data Set Examples

Data Set Feature Representation

Distance Function

Clustering Algorithm Parameters

Fitness Function Parameters

Background Knowledge


Research topics 2005 2006 l.jpg
Research Topics 2005/2006 from the Prior Probability of a Class C

  • Inductive Learning/Data Mining

    • Decision trees, nearest neighbor classifiers

    • Using clustering to enhance classification algorithms

    • Making sense of data

  • Supervised Clustering

    • Learning subclasses

    • Supervised clustering algorithms that learn clusters with arbitrary shape

    • Using supervised clustering for region discovery

    • Adaptive clustering

  • Tools for Similarity Assessment and Distance Function Learning

  • Data Set Compression and Creating Meta Knowledge for Local Learning Techniques

    • Comparative studies

    • Creating maps and other data set signatures for datasets based on editing, SC, and other techniques

  • Traditional Clustering

  • Data Mining and Information Retrieval for Structured Data

  • Other: Evolutionary Computing, File Prediction, Ontologies, Heuristic Search, Reinforcement Learning, Data Models.

Remark: Topics that were “covered” in this talk are in blue


Links to 7 papers l.jpg
Links to 7 Papers from the Prior Probability of a Class C

[VAE03] R. Vilalta, M. Achari, C. Eick, Class Decomposition via Clustering: A New Framework for Low-Variance Classifiers, in Proc. IEEE International Conference on Data Mining (ICDM), Melbourne, Florida, November 2003.

http://www.cs.uh.edu/~ceick/kdd/VAE03.pdf

[EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version appeared in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November 2004.

http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf

[EZV04] C. Eick, N. Zeidat, R. Vilalta, Using Representative-Based Clustering for Nearest Neighbor Dataset Editing, in Proc. IEEE International Conference on Data Mining (ICDM), Brighton, England, November 2004.

http://www.cs.uh.edu/~ceick/kdd/EZV04.pdf

[RE05] T. Ryu and C. Eick, A Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005).

http://www.cs.uh.edu/~ceick/kdd/RE05.doc

[ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, in Proc. MLDM'05, Leipzig, Germany, July 2005.

http://www.cs.uh.edu/~ceick/kdd/ERBV05.pdf

[ZWE05] N. Zeidat, S. Wang, C. Eick,, Editing Techniques: a Comparative Study, submitted for publication.

http://www.cs.uh.edu/~ceick/kdd/ZWE05.pdf

[BECV05] A. Bagherjeiran, C. Eick, C.-S. Chen, R. Vilalta, Adaptive Clustering: Obtaining Better Clusters Using Feedback and Past Experience, submitted for publication.

http://www.cs.uh.edu/~ceick/kdd/BECV05.pdf