On statistical models of cluster stability Z. Volkovich a, b , Z. Barzily a , L. Morozensky a

On statistical models of cluster stabilityZ. Volkovich a, b, Z. Barzily a , L. Morozensky a a. Software Engineering Department, ORT Braude College of Engineering, Karmiel, Israel b. Department of Mathematics and Statistics, The University of Maryland, Baltimore County, UMBC, Baltimore, USA

What is Clustering? Clustering deals the partitioning of a data set to groups of elements which are similar to each other. A group membership is determined by means of a distance-like function that measures the resembling between two data points.

Goal of the paper In the current paper we present a method for assessing cluster stability. This method, combined with a clustering algorithm, yields an estimate of a data partition, namely, the number of clusters and the attributes of each cluster.

Concept of the paper The basic idea of our method is that if one ”properly” clusters, two independent samples then, under the assumption of a consistent clustering algorithm, the clustered samples can be classified as two samples drawn from the same population.

The Model Conclusion: The substance we are dealing with belongs to the subject of the hypothesis testing. As no prior knowledge of the distribution of the population is available thus, a distribution-free two- sample test can be applied.

Two-sample test Which two-sample tests can be used for our purpose? There are several possibilities. We consider the two-sample test built on negative definite kernels approach proposed by A.A. Zinger, A.V. Kakosyan and L.B. Klebanov, 1989 and L. Klebanov, 2003. This approach is very similar to the one proposed by G. Zech, B. Aslan, 2005.. Applications for distribution’s characterization of these distances were also discussed by L. Klebanov, T. Kozubowskii, S. Rachev andV. Volkovich, 2001.

Negative Definite Kernels A real symmetric function Nis negative definite, if for any n ≥1 any x1, .., xnЄX for any real numbers c1, .., cnsuch that The kernel is called strongly negative definite, if the equality in this relationship is reached only if ci= 0, i = 1, .., n .

Example Functions of the type φ(x) = ||x||r, 0 < r ≤2, produce negative definite kernels, which are strongly negative definite if 0 < r < 2. It is important to note that a negative definite kernel, N2, can be obtained from a negative definite kernel, N1, by the transformations N2 = N1α, 0 < α < 1 and N2 = ln(1−N1).

Negative Definite Kernel test We restrict ourself to the hard clustering situation, where the partition is defined by a set of associations In this case, the underlying distribution of X is whereare cluster probabilities and are the inner clusters distributions.

Negative Definite Kernel test (2) We consider kernels N(x1, x2, c1, c2) = Nx(x1, x2) χ(c1=c2) , where Nx(x1, x2) is a negative definite kernel and χ(c1=c2) is an indicator function of the event {c1=c2}. Formally speaking, this kernel is not a Negative definite kernel. However, a distance can be constructed as: and Dis(μ, ν) = L(μ, μ) + L(ν, ν) − 2L(μ, ν).

Negative Definite Kernel test (3) Theorem.Let N(x1, x2, c1, c2)be a negative definite kernel described above and let μ andν be two measures satisfying (*) such that Pμ(c|x) = Pν(c|x), then • Dis(μ, ν) ≥ 0; • If Nx is a strongly negative definite function thenDis(μ, ν) = 0 if and only if μ = ν.

Negative Definite Kernel test (4) Let S1: x1, x2,…, xn and S2: y1, y2,…, yn be two samples of independent random vectors having probability laws F and G respectively. We are willing to test the hypothesis Against the alternative when the distributions F and G are unknown.

Algorithm description Let us suppose that a hard clustering algorithm Cl, based on the probability model, is available. Input parameters: a clustered sample S and a predefined number of clusters k. Output parameters: clustered sample S(k) = (S, Ck) consisting of a vector Ckof the cluster labels of S. For two given disjoint samples S1 and S2 we consider a clustered sample (S1U S2, Ck) and denote by c the mapping from this clustered sample to Ck.

Algorithm description (2) Let us introduce where |Cj| is the size of the cluster number j :

Algorithm description (3)The algorithm consists of the following steps:

Algorithm description (4)Remarks about the algorithm: • Need for standardization (Step 6): • The clustering algorithm may not determine the correct cluster for an outlier. This adds noise to the result. • The noise level decreases in k since less data elements are assigned to distant centroids. • Standardization decreases the noise level. • Choice of the optimal k as the most concentrated (Step 8): • If k is less than the “true” number of clusters then at least one cluster is formed by uniting two separate clusters thus, is less concentrated. • If k is larger than the “true” number of clusters then at least one cluster is formed in a location where there is a random concentration of data elements in the sample. This, again, decreases the concentration of because two clusters are not likely to have the same random concentration.

Numerical experiments In order to evaluate the performance of the described methodology we provide several numerical experiments on synthetic and real datasets. The selected samples (steps 3 and 4 of the algorithm) are clustered by applying the K-Means algorithm. The results obtained are used as inputs for steps 4 and 5 of the algorithm. The quality of the k* partitions is evaluated (step 7 of the algorithm) by three concentration statistics: the Friedman’s Index, The KL-distance and the Kurtosis.

Numerical experiments (2) We demonstrate the performance of our algorithm by comparing our clustering results to the ”true” structure of the real datasets. This dataset is chosen from the text collections available at http://www.dcs.gla.ac.uk/idom/ir resources/test collections/. The set consists of the following three collections: • DC0–Medlars Collection (1033 medical abstracts). • DC1–CISI Collection (1460 information science abstracts). • DC2–Cranfield Collection (1400 aerodynamics abstracts).

Numerical experiments (3) Following the ”bag of words” well known approach, 300 and 600 “best” terms were selected, and the thirty leading Principal components were found. In the case when number of the samples and size of the samples are equal 1000 for K(x,y)=||x-y||2 we obtained

Numerical experiments (4) We can see that two of the indexes indicate three clusters in the data Thank you

On statistical models of cluster stability Z. Volkovich a, b , Z. Barzily a , L. Morozensky a

On statistical models of cluster stability Z. Volkovich a, b , Z. Barzily a , L. Morozensky a

Presentation Transcript

Z e b r a

B r a z I l

A B C ….z of parenting

J a z z

B R A Z I L