Speech and Image Processing Unit Department of Computer Science University of Joensuu, FINLAND

1 / 43

# Speech and Image Processing Unit Department of Computer Science University of Joensuu, FINLAND - PowerPoint PPT Presentation

Clustering Methods: Part 6. Dimensionality. Ilja Sidoroff Pasi Fränti. Speech and Image Processing Unit Department of Computer Science University of Joensuu, FINLAND. Dimensionality of data.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Speech and Image Processing Unit Department of Computer Science University of Joensuu, FINLAND' - sani

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Clustering Methods: Part 6

Dimensionality

Ilja Sidoroff

Pasi Fränti

Speech and Image Processing UnitDepartment of Computer Science

University of Joensuu, FINLAND

Dimensionality of data
• Dimensionality of data set = the minimum number of free variables needed to represent data without information loss
• An d-attribute data set has an intrinsic dimensionality (ID) of M if its elements lie entirely within an M-dimensional subspace of Rd (M < d)
Dimensionality of data
• The use of more dimensions than necessary leads to problems:
• greater storage requirements
• the speed of algorithms is slower
• finding clusters and creating good classifiers is more difficult (curse of dimensionality)
Curse of dimensionality
• When the dimensionality of space increases, distance measures become less useful
• all points are more or less equidistant
• most of the volume of a sphere is concentrated on a thin layer near the surface of the sphere (eg. next slide)

V(r) – volume of sphere with radius r

D – dimension of the sphere

Two approaches
• Estimation of dimensionality
• knowing ID of data set could help in tuning classification or clustering performance
• Dimensionality reduction
• projecting data to some subspace
• eg. 2D/3D visualisation of multi-dimensional data set
• may result in information loss if the subspace dimension is smaller than ID
Goodness of the projection

Can be estimated by two measures:

• Trustworthiness: data points that are not neighbours in input space are not mapped as neighbours in output space.
• Continuity: data points that are close are not mapped far away in output space [11].
Trustworthiness
• N - number of feature vectors
• r(i,j) – the rank of data sample j in the ordering according to the distance from i in the original data space
• Uk(i) – set of feature vectors that are in the size k-neighbourhood of sample i in the projection space but not in the original space
• A(k) – Scales the measure between 0 and 1
Continuity
• r'(i,j) – the rank of data sample j in the ordering according to the distance from i in the projection space
• Vk(i) – set of feature vectors that are in the size k-neighbourhood of sample i in the original space but not in the projection space
Example data sets
• Swiss roll: 20000 3D points
• 2D manifold in 3D space
• http://isomap.stanford.edu
Example data sets
• 16  16 pixel images of hands in different positions
• Each image can be considered as 4096-dimensional data element
• Could also be interpreted in terms of finger extension – wrist rotation (2D)
Example data sets

http://isomap.stanford.edu

Synthetic data sets [11]

Sphere

S-shaped manifold

Six clusters

Principal component analysis (PCA)
• Idea: find directions of maximal variance and align coordinate axis to them.
• If variance is zero, that dimension is not needed.
• Drawback: works well only with linear data [1]
PCA method (1/2)
• Center data so that its means are zero
• Calculate covariance matrix for data
• Calculate eigenvalues and eigenvectors of the covariance matrix
• Arrange eigenvectors according to the eigenvalues
• For dimensionality reduction, choose the desired number of eigenvectors (2 or 3 for visualization)
PCA Method
• Intrinsic dimensionality = number of non-zero eigenvalues
• Dimensionality reduction by projection: yi = Axi
• Here xi is the input vector, yi the output vector, and A is the matrix containing eigenvectors corresponding to the largest eigenvalues.
• For visualization typically 2 or 3 eigenvalues preserved.
Example of PCA
• The distances between points are different in projections.
• Test set c:
• two clusters are projected into one cluster
• s-shaped cluster is projected nicely
Another example of PCA [10]
• Data set: point lying on circle: (x2 + y2 = 1), ID = 2
• PCA yield two non-null eigenvalues
• u, v – principal components
Limitations of PCA
• Since eigenvectors are orthogonal works well only with linear data
• Tends to overestimate ID
• Kernel PCA uses so called kernel trick to apply PCA also to non linear data
• make non linear projection into a higher dimensional space, perform PCA analysis in this space
Multidimensional scaling method (MDS)
• Project data into a new space while trying to preserve distances between data points
• Define stress E (difference of pairwise distances in original and projection spaces)
• E is minimized using some optimization algorithm
• With certain stress functions (i.e. Kruskal) when E is 0, perfect projection exists
• ID of the data is the smallest projection dimension where perfect projection exists
Metric MDS

The simplest stress function [2], raw stress:

d(xi, xj)distance in the original space

d(yi, yj)distance in the projection space

yi, yj representation of xi, xj in output space

Sammon's Mapping
• Sammon's mapping gives small distances a larger weight [5]:
Kruskal's stress
• Ranking the point distances accounts for decreasing distances in lower dimensional projections:
MDS example
• Separates clusters better than PCA
• Local structures are not always preserved (leftmost test set)
Other MDS approaches
• ISOMAP [12]
• Curvilinear component analysis CCA [13]
Local methods
• Previous methods are global in the sense that the all input data is considered at once.
• Local methods consider only some neighbourhood of data points  may be computationally less demanding
• Try to estimate topological dimension of the data manifold
Fukunaga-Olsen algorithm [6]
• Assume that data can be divided into small regions, i.e. clustered
• Each cluster (voronoi set) of the data vector lies in an approximately linear surface => PCA method can be applied to each cluster
• Eigenvalues are normalized by diving by the largest eigenvalue
Fukunaga-Olsen algorithm
• ID is defined as the number of normalized eigenvalues that are larger than a threshold T
• Defining a good threshold is a problem as such
Near neighbour algorithm
• Trunk's method [7]:
• An initial value for an integer parameter k is chosen (usually k=1).
• k nearest neighbours for each data vector are identified.
• for each data vector i, subspace spanned by vectors from i to each of its k neighbours is constructed.

(k+1)th-neighbour

Near neighbour algorithm
• The angle between (k+1)th near neighbour and its projection to the subspace is calculated for each data vector
• If the average of these angles is below a threshold, ID is k, otherwise increase k and repeat the process

angle

subspace

Near neighbour algorithm
• It is not clear how to select suitable value for threshold
• Improvements to Trunk's method
• Pettis et al. [8]
• Verver-Duin [9]
Fractal methods
• Global methods, but different definition of dimensionality
• Basic idea:
• count the observations inside a ball of radius r (f(r)).
• analyse the growth rate of f(r)
• if f grows as rkthe dimensionality of data can be considered as k
Fractal methods
• Dimensionality can be fractional, i.e. 1.5
• So does not provide projections for lesser dimensional space (what is an R1,5anyway?)
• Fractal dimensionality estimate can be used in time-series analysis etc. [10]
Fractal methods
• Different definitions for fractal dimensions [10]
• Hausdorff dimension
• Box-counting dimension
• Correlation dimension
• In order to get an accurate estimate of the dimension D, the data set cardinality must be at least 10D/2
Hausdorff dimension
• data set is covered by cells siwith variable diameter ri, all ri < r
• in other words, we look for collection of covering sets siwith diameter less than or equal to r, which minimizes the sum
• d-dimensional Hausdorff measure:
Hausdorff dimension
• For every data set ΓdH is infinite if d is less than some critical value DH, and 0 if d is greater than DH
• The critical value DH is the Hausdorff dimension of the data set
Box-Counting dimension
• Hausdorff dimension is not easy to calculate
• Box-Counting DB dimension is an upper bound of Hausdorff dimension, does not usually differ from it:

v(r) – is the number of the boxes of size r needed to cover the data set

Box-Counting dimension
• Although Box-Counting dimension is easier to calculate than Hausdorff dimension, the algorithmic complexity grows exponentially with the set dimensionality => can be used only for low-dimensional data sets
• Correlation dimension is computationally more feasible fractal dimension measure
• Correlation dimension is an lower bound of the Box-Counting dimension
Correlation dimension
• Let x1, x2, x3, ... , xNbe data points
• Correlation integral can be defined as:

I(x) is indicator function:

I(x) = 1, iff x istrue,

I(x) = 0, otherwise.

Correlation dimension

(some explanation needed!!!)

Literature
• M. Kirby, Geometric Data Analysis: An Empirical Approach to Dimensionality Reduction and the Study of Patterns, John Wiley and Sons, 2001.
• J. B. Kruskal, Multidimensional scaling by optimizing goodness of ﬁt to a nonmetric hypothesis, Psychometrika 29 (1964) 1–27.
• R. N. Shepard, The analysis of proximities: Multimensional scaling with an unknown distance function, Psychometrika 27 (1962) 125–140.
• R. S. Bennett, The intrinsic dimensionality of signal collections, IEEE Transactions on Information Theory 15 (1969) 517–525.
• J. W. J. Sammon, A nonlinear mapping for data structure analysis, IEEE Transaction on Computers C-18 (1969) 401–409.
• K. Fukunaga, D. R. Olsen, An algorithm for ﬁnding intrinsic dimensionality of data, IEEE Transactions on Computers 20 (2) (1976) 165–171.
• G. V. Trunk, Statistical estimation of the intrinsic dimensionality of a noisy signal collection, IEEE Transaction on Computers 25 (1976) 165–171.

Literature

• K. Pettis, T. Bailey, T. Jain, R. Dubes, An intrinsic dimensionality estimator from near-neighbor information, IEEE Transaction on Pattern Analysis and Machine Intelligence 1 (1) (1979) 25–37.
• P. J. Verveer, R. Duin, An evaluation of intrinsic dimensionality estimators, IEEE Transaction on Pattern Analysis and Machine Intelligence 17 (1) (1995) 81–86.
• F. Camastra, Data dimensionality estimation methods: a survey, Pattern Recognition 36 (2003) 2945-2954.
• J. Venna, Dimensionality reduction for visual exploration of similarity structures (2007), PhD thesis manuscript (submitted)
• J. B. Tenenbaum, V. de Silva, J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (12) (2000) 2319–2323.
• P. Demartines, J. Herault, Curvilinear component analysis: A self-organizing neural network for nonlinear mapping in cluster analysis, IEEE Transactions on Neural Networks 8 (1) (1997) 148–154.