Support vector clustering asa ben hur david horn hava t seigelman vladimir vapnik
This presentation is the property of its rightful owner.
Sponsored Links
1 / 54

Support Vector Clustering - Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on
  • Presentation posted in: General

Support Vector Clustering - Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik. Data Mining – II Yiyu Chen Siddharth Pal Ritendra Datta Ashish Parulekar Shiva Kasiviswanathan. Overview. Clustering and Classification Support Vector Machines – a brief idea

Download Presentation

Support Vector Clustering - Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Support vector clustering asa ben hur david horn hava t seigelman vladimir vapnik

Support Vector Clustering- Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik

Data Mining – II

Yiyu Chen

Siddharth Pal

Ritendra Datta

Ashish Parulekar

Shiva Kasiviswanathan


Overview

Overview

  • Clustering and Classification

  • Support Vector Machines – a brief idea

  • Support Vector Clustering

  • Results

  • Conclusion


Clustering and classification

Clustering and Classification

  • Classification – The task of assigning instances to pre-defined classes.

    • E.g. Deciding whether a particular patient record can be associated with a specific disease.

  • Clustering – The task of grouping related data points together without labeling them.

    • E.g. Grouping patient records with similar symptoms without knowing what the symptoms indicate.


Supervised v s unsupervised learning

Supervised v/s Unsupervised Learning

  • Unsupervised Learning - There is no training process involved.

    • E.g. Clustering.

  • Supervised Learning – Training is done using labeled data.

    • E.g. Classification.


Support vector machines

Support Vector Machines

  • Classification/Clustering of data by exploiting complex patterns in data.

  • Instance of a class of algorithms called Kernel Machines.

  • Exploiting information about the inner product between data points, in some (possibly very complex) feature space.


Kernel based learning

Kernel-based learning

  • Composed of two parts:

    • General-purpose learning machine.

    • Application-specific kernel function.

  • Simple case – Classification:

    • Decision function = Hyper-plane in input space.

    • Rosenblatt’s Perceptron Algorithm.

    • SVM draws inspiration from this linear case.


Svm features

SVM Features

  • Maps data into feature space where they are more easily (possibly linearly) separable.

  • Duality – The main optimization function can be re-written in the dual form where data appears only as inner-product.

  • KernelKisa function that returns the inner product of the images of two data points.

    • K(X1, X2) = <Φ(X1),Φ(X2)>

  • This K is needed, the actual mapping Φ(X) may not even be known.


Svm features1

SVM Features

  • Dimensionality – High dimensionality in the feature space can cause an over-fitting of the data.

    • Regularities in the training set may not appear in the test sets.

  • Convexity – The optimization problem is a convex quadratic programming problem

    • No local minima

    • Solvable in polynomial time


Support vector clustering

Support Vector Clustering

  • Vapnik (1995): Support Vector Machines

  • Tax & Duin (1999): SVs to characterize high dimensional distribution

  • Scholkopf et al. (2001): SVs to compute set of contours enclosing data points

  • Ben-Hur et al. (2001): SVs to systematically search for clustering solutions


Highlights of this approach

Highlights of this Approach

  • Data points mapped to HD feature space using a Gaussian kernel

  • In feature space, look for smallest sphere enclosing image of data

  • Map sphere back to data space to form set of contours enclosing data points

  • Points enclosed by each contour associated with same cluster


Highlights of this approach cont

Highlights of this Approach (cont.)

  • Decreasing the width of Gaussian kernel increases number of contours

  • Searching for contours = searching for valleys in the probability distribution

  • Use a soft margin to deal with outliers

  • Handle overlapping clusters with large values of the soft margin


Major steps

Major Steps

  • Sphere Analysis

  • Clusters Analysis


Sphere analysis

Sphere Analysis


Sphere analysis1

Sphere Analysis


Sphere analysis2

Sphere Analysis


The sphere

The Sphere


Data space

Data Space


Major steps1

Major Steps

  • Sphere Analysis

  • Clusters Analysis


Cluster analysis adjacent matrix

Cluster analysis: adjacent matrix


Cluster analysis connect components

Cluster analysis: connect components

  • A unweighted graph

  • Connected components = clusters

  • BSV unclassified


Parameters

Parameters

  • The shape of the enclosing contours in data space is governed by two parameters

  • The scale parameter of the gaussian kernel “q”

  • The soft margin constant “C”


Examples without bsv

Examples without BSV


Examples without bsv1

Examples without BSV

  • The scale parameter of the gaussian kernel q, is increased, the shape of the boundary in data-space varies.

    • As q increases the boundary fits the data more tightly

    • At several q values the enclosing contour splits forming increased number of clusters

    • With increase q, the number of support vectors nsv increases.


Examples with bsv

Examples with BSV

In real data, clusters are usually not as well separated as in Thus, in order to observe splitting of contours, we must allow for BSVs. The number of outliers is controlled by parameter C

nbsv = 1/C

Where nbsv is the number of BSVs and C is soft margin constant

p = 1/NC

1/NC is the upper bound on the fraction of BSVs

When distinct clusters are present ,but some outliers (due to noise ) prevent contour separation, it is useful to use BSVs


Support vectors

Support vectors

The difference between data that are contour-separable without BSVs and data that require use of BSVs is illustrated in Figure 4. A small overlap between the two probability distributions that generate the data is enough to prevent separation if there are no BSVs.


Example

Example


Example1

Example

  • SVC is used iteratively

  • Starting with a low value of q where there is a single cluster, and increasing it, to observe the formation of an increasing number of clusters (as the Gaussian kernel describes the data with increasing precision).

  • If, however, the number of SVs is excessive, p is increased to allow these points to turn into outliers, thus facilitating contour separation.


Example cont

Example (cont.)

  • SVC is used iteratively

  • As p is increased not only does the number of BSVs increase, but their influence on the shape of the cluster contour decreases.

  • The number of support vectors depends on both q and p. For fixed q , as p is increased, the number of SVs decreases since some of them turn into BSVs and the contours become smoother


Strong overlapping clusters

Strong overlapping clusters

What if the overlap is very strong ?

Use high BSV regime, reinterpret the sphere in feature space as representing cluster cores, rather the envelope of data


Strong overlapping clusters1

Strong overlapping clusters

The sphere in the data space can be expressed as

where p is determined by the value of the sum on the support vectors

The set of points enclosed in contour is

In extreme case when almost all points are BSVs p --> 1, the sum in this expression

where Pwis Parzen window estimator

Cluster centers are define as maxima of the Parzen window estimator Pw(x).


Strong overlapping clusters cont

Strong overlapping clusters (cont.)

  • Advantages

  • Instead of solving a problem with many local maxima, They identify core boundaries by an SV method with a global optimal solution.

  • The conceptual advantage of the method is that they define a region, rather than just a peak, as the core of the cluster.


Iris data

Iris Data

The data set consists of 150 instances each composed of fours measurements of an iris flower.

There are three type of flowers, represented by 50 instances of each


Iris data1

Iris Data

One of the clusters is linearly separable from the other two by a clear gap in the probability distribution. The remaining two clusters have significant overlap and were separated at q = 6 and p = 0.6. When these two clusters are considered together, the result is 2 misclassifications.


Iris data analysis

Iris Data Analysis

  • Adding the third principal component three clusters were obtained at q = 7 p = 0.7, with four misclassifications.

  • With the fourth principal component the number of misclassifications increased to 14 (using q = 9 p = 0.75).

  • In addition, the number of support vectors increased with increasing dimensionality (18 in 2 dimensions, 23 in 3 dimensions and 34 in 4 dimensions).

  • The improved performance in 2 or 3 dimensions can be attributed to the noise reduction effect of PCA.

  • Compared to other method SVC algorithm performed better

    • Approach of Tishby and Slonim (2001) leads to 5 misclassifications

    • SPC algorithm of Blatt et al. (1997), when applied to the dataset in the original data-space, has 15 misclassifications.


Varying p and q

Varying ‘p’ and ‘q’

Start with initial value of q, at this scale all pairs of points produce a sizeable kernel value, resulting in a single cluster. At this value no outliers are needed, so we choose C = 1.

If q is being increased, clusters of single or few points break off, or cluster boundaries become very rough, p should be increased in order to investigate what happens when BSVs are allowed.


Varying p and q1

Varying ‘p’ and ‘q’

A low number of SVs guarantees smooth boundaries.

As q increases this number increases, If the number of SVs is excessive, so p should be increased, where by many SVs may be turned into BSVs, and smooth cluster (or core) boundaries emerge, as in Figure 3b.

Systematically increase q and p along a direction that guarantees a minimal number of SVs.


Our results for iris

Our Results for Iris


Dimension 1 4

Dimension 1 & 4


Dimension 1 41

Dimension 1 & 4


Dimension 2 3

Dimension 2 & 3


Dimension 2 31

Dimension 2 & 3


Dimension 2 4

Dimension 2 & 4


Dimension 2 41

Dimension 2 & 4


Dimension 3 4

Dimension 3 & 4


Dimension 3 41

Dimension 3 & 4


Results on our synthesized data

Results on our synthesized data

q=1


Results on our synthesized data1

Results on our synthesized data

q=2


Results on our synthesized data2

Results on our synthesized data

q=5


Results on our synthesized data3

Results on our synthesized data

q=10


Results on our synthesized data4

Results on our synthesized data

q=20


Conclusion

Conclusion

  • As the graphs show, data points that are located close to one another (at least in some of the dimensions) tend to be allocated to the same cluster.

  • As expected, the number of clusters increases as q grows.

  • The optimal q value (that yields the best result) is very much different for each dataset and is thus very hard to fine tune.

  • It seems that when no clear separation exists, the algorithm tends to fail. We suspect, the incorporating some method for noise reduction, and the use of the Bound Support Vectors (using C values smaller than 1), may have yield better classification results.


Conclusion1

Conclusion

  • As the number of dimensions increases, the running time of the algorithm (and especially processing the clusters separation) grows dramatically. For a large number of attributes, it is practically not feasible to use this algorithm.


Advantages

Advantages

  • Handle clusters with irregular geometric shapes

  • Implicit mapping to high dimensional feature space

  • Have control over the level of clustering


Problems

Problems

  • Computation complexity ( O(n2m))

  • The optimal q value is different for each dataset and is very hard to fine tune.

  • Some problems with Cluster analysis:

    • Compute of adjacent matrix

  • Assign labels to BSV

  • Dimensions

    • Time

    • Error rate


  • Login