Create Presentation
Download Presentation

Download Presentation
## CLUSTERING PROXIMITY MEASURES

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**CLUSTERINGPROXIMITY MEASURES**By Çağrı SarıgözSubmitted to Assoc. Prof. Turgayİbrikçi EE 639**Classification**• Classifying has been one of the crucial though activities of human kind. • It makes it easy to perceive the outside world and act accordingly. Aristotle’s Classification of Living Things is one of the most famous classification works dating back to ancient times**Cluster Analysis**• Cluster analysis brings mathematical methodology to the solution of classification problems • It deals with classification or grouping of data into a set of categories or clusters. • Data objects that are in the same cluster should be similar and the ones that are in different clusters should be dissimilar in some context. • It’s generally a subjective matter to determine this context.**Approaching the Data Objects**• Feature Types • Continuous • Discrete • Binary • Measurement Levels • Qualitative • Nominal • Ordinal • Quantitative • Interval • Ratio**Feature Types**• A continuous feature can take a value from an uncountably infinite range. • Exact weight of a person. • Whereas a discrete feature has a range of value that is finite or countably infinite. • Number of heartbeats of a person, in bpm. • Binary feature is a special case of discrete features where there is only 2 values that the feature can take. • Presenceor absence of tattoos on a person’s skin.**Measurement Levels:Qualitative**• Features at nominal level have no mathematical meaning; they generally are levels, states or names. • Color of a car, condition of weather, etc.. • Features at ordinal level are still just names, but with a certain order. But, the difference between the values are still meaningless in mathematical sense. • Degrees of headache: none, slight, moderate, severe, unbearable, etc..**Measurement Levels:Quantitative**• At interval level, difference between feature values has a meaning, but there is no true zero in the range of level, i.e. the ratio between two values has no meaning. • IQ score. A person with 140 IQ score isn’t necessarily two times intelligent than a person with 70 IQ score. • Features at ratio level have all the properties of the other, plus a true zero, so that the ratio between two values has a mathematical meaning. • Number of cars in a parking lot.**Definition of Proximity Measures: Dissimilarity (Distance)**• A dissimilarity or distance function D on a data set X is defined to satisfy these conditions: • Symmetry: D(xi , xj) = D(xj, xi) • Positivity: D(xi,xj) ≥ 0 for all xiand xj. • It’s called a dissimilarity metric if these conditions also hold, • Triangle inequality: D(xi,xj) ≤ D(xi,xk)+D(xk,xj) for all xi, xjand xk • Reflexivity: D(xi,xj) = 0 iffxi= xj • It’s called a semimetric if triangle inequality does not hold • If the following condition also holds, it’s called a ultrametric: • D(xi,xj) ≤ max(D(xi,xk),D(xj,xk)) for all xi, xjand xk.**Definition of Proximity Measures: Similarity**• A similarity function S is defined to satisfy the following conditions: • Symmetry: S(xi, xj) = S(xj, xi); • Positivity: 0≤S(xi,xj)≤1, for all xi and xj. • It’s called a similarity metric if the following additional conditions also hold: • For all xi , xj, and xk, S (xi , xj)S (xj , xk) ≤ [S (xi , xj) + S (xj, xk)]S (xi , xk) • S(xi ,xj)=1 iffxi =xj**Proximity Measures for Continuous Variables**• Euclidean distance (also known as L2 norm) : xiandxjared-dimensionaldata objects • Euclidean distance is a metric, tending to form hyperspherical clusters. Also, clusters formed with Euclidean distance are invariant to translations and rotations in the feature space. • Without normalizing the data, features with large values and variances will tend to dominate over other features. A commonly used method is data standardization, in which each feature has zero mean and unit variance, wherexil*represents the raw data and sample mean mland sample standard slare defined as and respectively.**Proximity Measures for Continuous Variables**• Another normalization approach: • The Euclidean distance can be generalized as a special case of a family of metrics, called Minkowski distance or Lp norm, defined as: • When p = 2, the distance becomes the Euclidean distance. • p = 1: the city-block (Manhattan distance) or L1 norm, • p →∞ : the sup distance or L∞norm,**Proximity Measures for Continuous Variables**• The squared Mahalanobis distance is also a metric: • Where S is the within-class covariance matrix defined as S = E[(x − μ)(x − μ)T] where μ is the mean vector and E[·] calculates the expected value of a random variable. • Mahalanobis distance tends to form hyperellipsodial clusters, which are invariant to any nonsingular linear transformation. • The calculation of the inverse of Smay cause some computational burden for large-scale data. • When features are not correlated, S equals to an identity matrix, making Mahalanobis distance equal to Euclidean distance.**Proximity Measures for Continuous Variables**• The point symmetry distance is based on the assumption that the cluster’s structure is symmetric: • Where xr is a reference point (e.g. the centroid of the cluster) and ||·|| represents the Euclidean norm. • It calculates the distance between an object xi and xr, the reference point, given other N – 1 objects and minimized when a symmetric pattern exists.**Proximity Measures for Continuous Variables**• The distance measure can also be derived from a correlation coefficient, such as the Pearson correlation coefficient, defined as, • The correlation coefficient is in the range of [-1,1], with -1 and 1 indicating the strongest negative and positive corre- lation respectively. So we can define the distance measure as which is in the range of [0,1]. • Features should be measured on the same scales, otherwise the calculation of the mean or variance in calculating the Pearson correlation coefficient would have no meaning.**Proximity Measures for Continuous Variables**• Cosine similarity is an example of similarity measures, which can be used to compare a pair of data objects with continuous variables, given as, • which can be constructed as a distance measure by simply using D(xi, xj) = 1 − S(xi, xj). • Like Pearson correlation coefficient, the cosine similarity is unable to provide information on the magnitude of differences.**Examples and Applications of the Proximity Measures for**Continuous Variables**Proximity Measures for Discrete Variables: Binary Variables**• Invariant similarity measures for symmetric binary variables: • 1-1 match and 0-0 match of the variables are regarded as equally important. Unmatched pairs are weighted based on their contribution to the similarity. • For the simple matching coefficient, the corresponding dissimilarity measure from D(xi, xj) = 1 − S(xi, xj) is known as the Hamming distance.**Proximity Measures for Discrete Variables: Binary Variables**• Non-invariant similarity measures for asymmetric binary variables: • These measures focus on 1-1 match features while ignoring the effect of 0-0 match, which is considered uninformative. • Again, the unmatched pairs are weighted depending on their importance.**Proximity Measures for Discrete Variables with More than Two**Values • One simple and direct approach is to map the variables into new binary features. • It is simple, but it may cause introducing too many binary variables. • A more effective and commonly used method is based on matching criterion. For a pair of d-dimensional objectsxiand xj, the similarity using the simple matching criterion is given as: where**Proximity Measures for Discrete Variables with More than Two**Values • The categorical features may display certain orders, known as the ordinal features. • In this case, the codes from 1 to Ml, where Ml is the highest level, are no meaningless in similarity measures. In fact, the closer the two levels are, the more similar the two objects in that feature. • Objects with this type of feature can be compared using the continuous dissimilarity measures. Since the number of possible levels varies with the different features, the original ranks ril* for the ith object in the lth feature are usually converted into the new ranks ril in the range of [0,1], using the following method: • Then city-block or Euclidean distance can be used.**Proximity Measures for Mixed Variables**• The similarity measure for a pair of d-dimensional mixed data objects xi and xj can be defined as: where Sijl indicates the similarity for the lth feature between the two objects, and δijlis a0-1 coefficient based on whether the measure of the two objects is missing. Correspondingly, the dissimilarity measure can be obtained by simply using D(xi, xj) = 1 − S(xi, xj). • The component similarity for discrete variables: • For continuous variables: where Rlis the range of the lth variable over all objects, written as