semi supervised clustering
Download
Skip this Video
Download Presentation
Semi-Supervised Clustering

Loading in 2 Seconds...

play fullscreen
1 / 66

Semi-Supervised Clustering - PowerPoint PPT Presentation


  • 239 Views
  • Uploaded on

Semi-Supervised Clustering. Jieping Ye Department of Computer Science and Engineering Arizona State University http://www.public.asu.edu/~jye02. Outline. Overview of clustering and classification What is semi-supervised learning? Semi-supervised clustering Semi-supervised classification

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Semi-Supervised Clustering' - arissa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
semi supervised clustering

Semi-Supervised Clustering

Jieping Ye

Department of Computer Science and Engineering

Arizona State University

http://www.public.asu.edu/~jye02

outline
Outline
  • Overview of clustering and classification
  • What is semi-supervised learning?
    • Semi-supervised clustering
    • Semi-supervised classification
  • Semi-supervised clustering
    • What is semi-supervised clustering?
    • Why semi-supervised clustering?
    • Semi-supervised clustering algorithms
supervised classification versus unsupervised clustering
Supervised classification versus unsupervised clustering
  • Unsupervised clustering Group similar objects together to find clusters
      • Minimize intra-class distance
      • Maximize inter-class distance
  • Supervised classification Class label for each training sample is given
    • Build a model from the training data
    • Predict class label on unseen future data points
what is clustering

Inter-cluster distances are maximized

Intra-cluster distances are minimized

What is clustering?
  • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
clustering algorithms
Clustering algorithms
  • K-Means
  • Hierarchical clustering
  • Graph based clustering (Spectral clustering)
  • Bi-clustering
classification algorithms
Classification algorithms
  • K-Nearest-Neighbor classifiers
  • Naïve Bayes classifier
  • Linear Discriminant Analysis (LDA)
  • Support Vector Machines (SVM)
  • Logistic Regression
  • Neural Networks
semi supervised learning

Semi-supervised

learning

Supervised

classification

Unsupervised

clustering

Semi-Supervised Learning
  • Combines labeled and unlabeled data during training to improve performance:
    • Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier.
    • Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.
semi supervised classification
Semi-Supervised Classification
  • Algorithms:
    • Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].
    • Co-training [Blum:COLT98].
    • Transductive SVM’s [Vapnik:98,Joachims:ICML99].
    • Graph based algorithms
  • Assumptions:
    • Known, fixed set of categories given in the labeled data.
    • Goal is to improve classification of examples into these known categories.
semi supervised clustering problem definition
Semi-supervised clustering: problem definition
  • Input:
    • A set of unlabeled objects, each described by a set of attributes (numeric and/or categorical)
    • A small amount of domain knowledge
  • Output:
    • A partitioning of the objects into k clusters (possibly with some discarded as outliers)
  • Objective:
    • Maximum intra-cluster similarity
    • Minimum inter-cluster similarity
    • High consistency between the partitioning and the domain knowledge
why semi supervised clustering
Why semi-supervised clustering?
  • Why not clustering?
    • The clusters produced may not be the ones required.
    • Sometimes there are multiple possible groupings.
  • Why not classification?
    • Sometimes there are insufficient labeled data.
  • Potential applications
    • Bioinformatics (gene and protein clustering)
    • Document hierarchy construction
    • News/email categorization
    • Image categorization
semi supervised clustering1
Semi-Supervised Clustering
  • Domain knowledge
    • Partial label information is given
    • Apply some constraints (must-links and cannot-links)
  • Approaches
    • Search-based Semi-Supervised Clustering
      • Alter the clustering algorithm using the constraints
    • Similarity-based Semi-Supervised Clustering
      • Alter the similarity measure based on the constraints
    • Combination of both
search based semi supervised clustering
Search-Based Semi-Supervised Clustering
  • Alter the clustering algorithm that searches for a good partitioning by:
    • Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99].
    • Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01].
    • Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans,) [Basu:ICML02].
overview of k means clustering
Overview of K-Means Clustering
  • K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters.

Algorithm:

Initialize K cluster centers randomly. Repeat until convergence:

    • Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum
    • Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster
k means objective function
K-Means Objective Function
  • Locally minimizes sum of squared distance between the data points and their correspondingcluster centers:
  • Initialization of K cluster centers:
    • Totally random
    • Random perturbation from global mean
    • Heuristic to ensure well-separated centers
semi supervised k means
Semi-Supervised K-Means
  • Partial label information is given
    • Seeded K-Means
    • Constrained K-Means
  • Constraints (Must-link, Cannot-link)
    • COP K-Means
semi supervised k means for partially labeled data
Semi-Supervised K-Means for partially labeled data
  • Seeded K-Means:
    • Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i.
    • Seed points are only used for initialization, and not in subsequent steps.
  • Constrained K-Means:
    • Labeled data provided by user are used to initialize K-Means algorithm.
    • Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.
seeded k means
Seeded K-Means

Use labeled data to find

the initial centroids and

then run K-Means.

The labels for seeded

points may change.

exercise

0 1 3 6 9 15 20 21 22

Exercise

Compute the clustering using seeded Kmeans

constrained k means
Constrained K-Means

Use labeled data to find

the initial centroids and

then run K-Means.

The labels for seeded

points will not change.

exercise1

0 1 3 6 9 15 20 21 22

Exercise

Compute the clustering using constrained Kmeans

cop k means
COP K-Means
  • COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points.
  • Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster).
  • Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.
illustration
Illustration

Determine

its label

Must-link

x

x

Assign to the red class

illustration1
Illustration

Determine

its label

x

x

Cannot-link

Assign to the red class

illustration2
Illustration

Determine

its label

Must-link

x

x

Cannot-link

The clustering algorithm fails

similarity based semi supervised clustering
Similarity-based semi-supervised clustering
  • Alter the similarity measure based on the constraints
  • Paper: From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering. D. Klein et al.

Two types of constraints: Must-links and Cannot-links

Clustering algorithm: Hierarchical clustering

overview of hierarchical clustering algorithm
Overview of Hierarchical Clustering Algorithm
  • Agglomerative versusDivisive
  • Basic algorithm of Agglomerative clustering
    • Compute the distance matrix
    • Let each data point be a cluster
    • Repeat
    • Merge the two closest clusters
    • Update the distance matrix
    • Until only a single cluster remains
  • Key operation is the update of the distance between two clusters
how to define inter cluster distance

p1

p2

p3

p4

p5

. . .

p1

p2

p3

p4

p5

.

.

.

How to Define Inter-Cluster Distance

Distance?

  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids

distance matrix

must link constraints
Must-link constraints
  • Distance between must-links pair to zero.
  • Derive a new metric by running an all-pairs-shortest distances algorithm.
    • It is still a metric
    • Faithful to the original metric
    • Computational complexity: O(N2C)
      • C: number of points involved in must-link constraints
      • N: total number of points
new distance matrix based on must link constraints

p1

p2

p3

p4

p5

. . .

p1

p2

p3

p4

p5

.

.

.

New distance matrix based on must-link constraints

Hierarchical clustering can be carried out based on the new distance matrix.

What is missing?

New distance matrix

cannot link constraint
Cannot-link constraint
  • Run hierarchical clustering with complete link (MAX)
    • The distance between two clusters is determined by the largest distance.
  • Set the distance between cannot-link pair to be
  • The new distance matrix does not define a metric.
    • Work very well in practice
constrained complete link clustering algorithm
Constrained complete-link clustering algorithm

Derive a new distance

Matrix based on both

Types of constraints

illustration3
Illustration

2

1

4

3

5

Initial distance matrix

new distance matrix
New distance matrix

0.9

Must-links: 1—2, 3—4

Cannot-links: 2--3

hierarchical clustering
Hierarchical clustering

1 and 2 form a cluster, and 3 and 4 form another cluster

1,2

3,4

5

1,2

3,4

1 2 3 4 5

5

summary
Summary
  • Seeded and Constrained K-Means: partially labeled data
  • COP K-Means: constraints (Must-link and Cannot-link)
  • Constrained K-Means and COP K-Means require all the constraints to be satisfied.
    • May not be effective if the seeds contain noise.
  • Seeded K-Means use the seeds only in the first step to determine the initial centroids.
    • Less sensitive to the noise in the seeds.
  • Semi-supervised hierarchical clustering
ad