Support Vector Clustering - Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik

Download Presentation

Support Vector Clustering - Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik

Loading in 2 Seconds...

- 81 Views
- Uploaded on
- Presentation posted in: General

Support Vector Clustering - Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Support Vector Clustering- Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik

Data Mining – II

Yiyu Chen

Siddharth Pal

Ritendra Datta

Ashish Parulekar

Shiva Kasiviswanathan

- Clustering and Classification
- Support Vector Machines – a brief idea
- Support Vector Clustering
- Results
- Conclusion

- Classification – The task of assigning instances to pre-defined classes.
- E.g. Deciding whether a particular patient record can be associated with a specific disease.

- Clustering – The task of grouping related data points together without labeling them.
- E.g. Grouping patient records with similar symptoms without knowing what the symptoms indicate.

- Unsupervised Learning - There is no training process involved.
- E.g. Clustering.

- Supervised Learning – Training is done using labeled data.
- E.g. Classification.

- Classification/Clustering of data by exploiting complex patterns in data.
- Instance of a class of algorithms called Kernel Machines.
- Exploiting information about the inner product between data points, in some (possibly very complex) feature space.

- Composed of two parts:
- General-purpose learning machine.
- Application-specific kernel function.

- Simple case – Classification:
- Decision function = Hyper-plane in input space.
- Rosenblatt’s Perceptron Algorithm.
- SVM draws inspiration from this linear case.

- Maps data into feature space where they are more easily (possibly linearly) separable.
- Duality – The main optimization function can be re-written in the dual form where data appears only as inner-product.
- KernelKisa function that returns the inner product of the images of two data points.
- K(X1, X2) = <Φ(X1),Φ(X2)>

- This K is needed, the actual mapping Φ(X) may not even be known.

- Dimensionality – High dimensionality in the feature space can cause an over-fitting of the data.
- Regularities in the training set may not appear in the test sets.

- Convexity – The optimization problem is a convex quadratic programming problem
- No local minima
- Solvable in polynomial time

- Vapnik (1995): Support Vector Machines
- Tax & Duin (1999): SVs to characterize high dimensional distribution
- Scholkopf et al. (2001): SVs to compute set of contours enclosing data points
- Ben-Hur et al. (2001): SVs to systematically search for clustering solutions

- Data points mapped to HD feature space using a Gaussian kernel
- In feature space, look for smallest sphere enclosing image of data
- Map sphere back to data space to form set of contours enclosing data points
- Points enclosed by each contour associated with same cluster

- Decreasing the width of Gaussian kernel increases number of contours
- Searching for contours = searching for valleys in the probability distribution
- Use a soft margin to deal with outliers
- Handle overlapping clusters with large values of the soft margin

- Sphere Analysis
- Clusters Analysis

- Sphere Analysis
- Clusters Analysis

- A unweighted graph
- Connected components = clusters
- BSV unclassified

- The shape of the enclosing contours in data space is governed by two parameters
- The scale parameter of the gaussian kernel “q”
- The soft margin constant “C”

- The scale parameter of the gaussian kernel q, is increased, the shape of the boundary in data-space varies.
- As q increases the boundary fits the data more tightly
- At several q values the enclosing contour splits forming increased number of clusters
- With increase q, the number of support vectors nsv increases.

In real data, clusters are usually not as well separated as in Thus, in order to observe splitting of contours, we must allow for BSVs. The number of outliers is controlled by parameter C

nbsv = 1/C

Where nbsv is the number of BSVs and C is soft margin constant

p = 1/NC

1/NC is the upper bound on the fraction of BSVs

When distinct clusters are present ,but some outliers (due to noise ) prevent contour separation, it is useful to use BSVs

The difference between data that are contour-separable without BSVs and data that require use of BSVs is illustrated in Figure 4. A small overlap between the two probability distributions that generate the data is enough to prevent separation if there are no BSVs.

- SVC is used iteratively
- Starting with a low value of q where there is a single cluster, and increasing it, to observe the formation of an increasing number of clusters (as the Gaussian kernel describes the data with increasing precision).
- If, however, the number of SVs is excessive, p is increased to allow these points to turn into outliers, thus facilitating contour separation.

- SVC is used iteratively
- As p is increased not only does the number of BSVs increase, but their influence on the shape of the cluster contour decreases.
- The number of support vectors depends on both q and p. For fixed q , as p is increased, the number of SVs decreases since some of them turn into BSVs and the contours become smoother

What if the overlap is very strong ?

Use high BSV regime, reinterpret the sphere in feature space as representing cluster cores, rather the envelope of data

The sphere in the data space can be expressed as

where p is determined by the value of the sum on the support vectors

The set of points enclosed in contour is

In extreme case when almost all points are BSVs p --> 1, the sum in this expression

where Pwis Parzen window estimator

Cluster centers are define as maxima of the Parzen window estimator Pw(x).

- Advantages
- Instead of solving a problem with many local maxima, They identify core boundaries by an SV method with a global optimal solution.
- The conceptual advantage of the method is that they define a region, rather than just a peak, as the core of the cluster.

The data set consists of 150 instances each composed of fours measurements of an iris flower.

There are three type of flowers, represented by 50 instances of each

One of the clusters is linearly separable from the other two by a clear gap in the probability distribution. The remaining two clusters have significant overlap and were separated at q = 6 and p = 0.6. When these two clusters are considered together, the result is 2 misclassifications.

- Adding the third principal component three clusters were obtained at q = 7 p = 0.7, with four misclassifications.
- With the fourth principal component the number of misclassifications increased to 14 (using q = 9 p = 0.75).
- In addition, the number of support vectors increased with increasing dimensionality (18 in 2 dimensions, 23 in 3 dimensions and 34 in 4 dimensions).
- The improved performance in 2 or 3 dimensions can be attributed to the noise reduction effect of PCA.
- Compared to other method SVC algorithm performed better
- Approach of Tishby and Slonim (2001) leads to 5 misclassifications
- SPC algorithm of Blatt et al. (1997), when applied to the dataset in the original data-space, has 15 misclassifications.

Start with initial value of q, at this scale all pairs of points produce a sizeable kernel value, resulting in a single cluster. At this value no outliers are needed, so we choose C = 1.

If q is being increased, clusters of single or few points break off, or cluster boundaries become very rough, p should be increased in order to investigate what happens when BSVs are allowed.

A low number of SVs guarantees smooth boundaries.

As q increases this number increases, If the number of SVs is excessive, so p should be increased, where by many SVs may be turned into BSVs, and smooth cluster (or core) boundaries emerge, as in Figure 3b.

Systematically increase q and p along a direction that guarantees a minimal number of SVs.

q=1

q=2

q=5

q=10

q=20

- As the graphs show, data points that are located close to one another (at least in some of the dimensions) tend to be allocated to the same cluster.
- As expected, the number of clusters increases as q grows.
- The optimal q value (that yields the best result) is very much different for each dataset and is thus very hard to fine tune.
- It seems that when no clear separation exists, the algorithm tends to fail. We suspect, the incorporating some method for noise reduction, and the use of the Bound Support Vectors (using C values smaller than 1), may have yield better classification results.

- As the number of dimensions increases, the running time of the algorithm (and especially processing the clusters separation) grows dramatically. For a large number of attributes, it is practically not feasible to use this algorithm.

- Handle clusters with irregular geometric shapes
- Implicit mapping to high dimensional feature space
- Have control over the level of clustering

- Computation complexity ( O(n2m))
- The optimal q value is different for each dataset and is very hard to fine tune.
- Some problems with Cluster analysis:
- Compute of adjacent matrix

- Assign labels to BSV
- Dimensions
- Time
- Error rate