CLUSTERINGPROXIMITY MEASURES By Çağrı SarıgözSubmitted to Assoc. Prof. Turgayİbrikçi EE 639
Classification • Classifying has been one of the crucial though activities of human kind. • It makes it easy to perceive the outside world and act accordingly. Aristotle’s Classification of Living Things is one of the most famous classification works dating back to ancient times
Cluster Analysis • Cluster analysis brings mathematical methodology to the solution of classification problems • It deals with classification or grouping of data into a set of categories or clusters. • Data objects that are in the same cluster should be similar and the ones that are in different clusters should be dissimilar in some context. • It’s generally a subjective matter to determine this context.
Approaching the Data Objects • Feature Types • Continuous • Discrete • Binary • Measurement Levels • Qualitative • Nominal • Ordinal • Quantitative • Interval • Ratio
Feature Types • A continuous feature can take a value from an uncountably infinite range. • Exact weight of a person. • Whereas a discrete feature has a range of value that is finite or countably infinite. • Number of heartbeats of a person, in bpm. • Binary feature is a special case of discrete features where there is only 2 values that the feature can take. • Presenceor absence of tattoos on a person’s skin.
Measurement Levels:Qualitative • Features at nominal level have no mathematical meaning; they generally are levels, states or names. • Color of a car, condition of weather, etc.. • Features at ordinal level are still just names, but with a certain order. But, the difference between the values are still meaningless in mathematical sense. • Degrees of headache: none, slight, moderate, severe, unbearable, etc..
Measurement Levels:Quantitative • At interval level, difference between feature values has a meaning, but there is no true zero in the range of level, i.e. the ratio between two values has no meaning. • IQ score. A person with 140 IQ score isn’t necessarily two times intelligent than a person with 70 IQ score. • Features at ratio level have all the properties of the other, plus a true zero, so that the ratio between two values has a mathematical meaning. • Number of cars in a parking lot.
Definition of Proximity Measures: Dissimilarity (Distance) • A dissimilarity or distance function D on a data set X is defined to satisfy these conditions: • Symmetry: D(xi , xj) = D(xj, xi) • Positivity: D(xi,xj) ≥ 0 for all xiand xj. • It’s called a dissimilarity metric if these conditions also hold, • Triangle inequality: D(xi,xj) ≤ D(xi,xk)+D(xk,xj) for all xi, xjand xk • Reflexivity: D(xi,xj) = 0 iffxi= xj • It’s called a semimetric if triangle inequality does not hold • If the following condition also holds, it’s called a ultrametric: • D(xi,xj) ≤ max(D(xi,xk),D(xj,xk)) for all xi, xjand xk.
Definition of Proximity Measures: Similarity • A similarity function S is defined to satisfy the following conditions: • Symmetry: S(xi, xj) = S(xj, xi); • Positivity: 0≤S(xi,xj)≤1, for all xi and xj. • It’s called a similarity metric if the following additional conditions also hold: • For all xi , xj, and xk, S (xi , xj)S (xj , xk) ≤ [S (xi , xj) + S (xj, xk)]S (xi , xk) • S(xi ,xj)=1 iffxi =xj
Proximity Measures for Continuous Variables • Euclidean distance (also known as L2 norm) : xiandxjared-dimensionaldata objects • Euclidean distance is a metric, tending to form hyperspherical clusters. Also, clusters formed with Euclidean distance are invariant to translations and rotations in the feature space. • Without normalizing the data, features with large values and variances will tend to dominate over other features. A commonly used method is data standardization, in which each feature has zero mean and unit variance, wherexil*represents the raw data and sample mean mland sample standard slare defined as and respectively.
Proximity Measures for Continuous Variables • Another normalization approach: • The Euclidean distance can be generalized as a special case of a family of metrics, called Minkowski distance or Lp norm, defined as: • When p = 2, the distance becomes the Euclidean distance. • p = 1: the city-block (Manhattan distance) or L1 norm, • p →∞ : the sup distance or L∞norm,
Proximity Measures for Continuous Variables • The squared Mahalanobis distance is also a metric: • Where S is the within-class covariance matrix defined as S = E[(x − μ)(x − μ)T] where μ is the mean vector and E[·] calculates the expected value of a random variable. • Mahalanobis distance tends to form hyperellipsodial clusters, which are invariant to any nonsingular linear transformation. • The calculation of the inverse of Smay cause some computational burden for large-scale data. • When features are not correlated, S equals to an identity matrix, making Mahalanobis distance equal to Euclidean distance.
Proximity Measures for Continuous Variables • The point symmetry distance is based on the assumption that the cluster’s structure is symmetric: • Where xr is a reference point (e.g. the centroid of the cluster) and ||·|| represents the Euclidean norm. • It calculates the distance between an object xi and xr, the reference point, given other N – 1 objects and minimized when a symmetric pattern exists.
Proximity Measures for Continuous Variables • The distance measure can also be derived from a correlation coefficient, such as the Pearson correlation coefficient, defined as, • The correlation coefficient is in the range of [-1,1], with -1 and 1 indicating the strongest negative and positive corre- lation respectively. So we can define the distance measure as which is in the range of [0,1]. • Features should be measured on the same scales, otherwise the calculation of the mean or variance in calculating the Pearson correlation coefficient would have no meaning.
Proximity Measures for Continuous Variables • Cosine similarity is an example of similarity measures, which can be used to compare a pair of data objects with continuous variables, given as, • which can be constructed as a distance measure by simply using D(xi, xj) = 1 − S(xi, xj). • Like Pearson correlation coefficient, the cosine similarity is unable to provide information on the magnitude of differences.
Examples and Applications of the Proximity Measures for Continuous Variables
Proximity Measures for Discrete Variables: Binary Variables • Invariant similarity measures for symmetric binary variables: • 1-1 match and 0-0 match of the variables are regarded as equally important. Unmatched pairs are weighted based on their contribution to the similarity. • For the simple matching coefficient, the corresponding dissimilarity measure from D(xi, xj) = 1 − S(xi, xj) is known as the Hamming distance.
Proximity Measures for Discrete Variables: Binary Variables • Non-invariant similarity measures for asymmetric binary variables: • These measures focus on 1-1 match features while ignoring the effect of 0-0 match, which is considered uninformative. • Again, the unmatched pairs are weighted depending on their importance.
Proximity Measures for Discrete Variables with More than Two Values • One simple and direct approach is to map the variables into new binary features. • It is simple, but it may cause introducing too many binary variables. • A more effective and commonly used method is based on matching criterion. For a pair of d-dimensional objectsxiand xj, the similarity using the simple matching criterion is given as: where
Proximity Measures for Discrete Variables with More than Two Values • The categorical features may display certain orders, known as the ordinal features. • In this case, the codes from 1 to Ml, where Ml is the highest level, are no meaningless in similarity measures. In fact, the closer the two levels are, the more similar the two objects in that feature. • Objects with this type of feature can be compared using the continuous dissimilarity measures. Since the number of possible levels varies with the different features, the original ranks ril* for the ith object in the lth feature are usually converted into the new ranks ril in the range of [0,1], using the following method: • Then city-block or Euclidean distance can be used.
Proximity Measures for Mixed Variables • The similarity measure for a pair of d-dimensional mixed data objects xi and xj can be defined as: where Sijl indicates the similarity for the lth feature between the two objects, and δijlis a0-1 coefficient based on whether the measure of the two objects is missing. Correspondingly, the dissimilarity measure can be obtained by simply using D(xi, xj) = 1 − S(xi, xj). • The component similarity for discrete variables: • For continuous variables: where Rlis the range of the lth variable over all objects, written as