1 / 54

Elastic Maps for Data Analysis

Elastic Maps for Data Analysis. Alexander Gorban, Leicester with Andrei Zinovyev, Paris. Plan of the talk. INTRODUCTION Two paradigms for data analysis: statistics and modelling Clustering and K-means Self Organizing Maps PCA and local PCA. Plan of the talk.

jorryn
Download Presentation

Elastic Maps for Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Elastic Mapsfor Data Analysis Alexander Gorban, Leicester with Andrei Zinovyev, Paris

  2. Plan of the talk INTRODUCTION • Two paradigms for data analysis: statistics and modelling • Clustering and K-means • Self Organizing Maps • PCA and local PCA

  3. Plan of the talk 1. Principal manifolds and elastic maps • The notion of of principal manifold (PM) • Constructing PMs: elastic maps • Adaptation and grammars 2. Application technique • Projection and regression • Maps and visualization of functions 3. Implementation and examples

  4. Two basic paradigms for data analysis Data set Statistical Analysis Data Modelling

  5. Statistical Analysis • Existence of a Probability Distribution; • Statistical Hypothesis about Data Generation; • Verification/Falsification of Hypothesises about Hidden Properties of Data Distribution

  6. Data Modelling Universe of models • We should find the Best Model for Data description; • We know the Universe of Models; • We know the Fitting Criteria; • Learning Errors and Generalization Errors analysis for the Model Verification

  7. Example: Simplest Clustering

  8. Centers y(i) Data points x(j) K-means algorithm • Minimize U for given {K(i)}(find centers); • Minimize U for given {y(i)} (find classes); • If {K(i)} change, then go to step 1.

  9. “Centers” can be lines, manifolds,… with the same algorithm 1st Principal components + mean points for classes instead of simplest means

  10. SOM - Self Organizing Maps • Set of nodes is a finite metric space with distance d(N,M); • 0) Map set of nodes into dataspace N→f0(N); • 1) Select a datapoint X (random); • 2) Find a nearest fi(N) (N=NX); • 3) fi+1(N) = fi(N) +wi(d(N, NX))(X- fi(N)),where wi(d) (0<wi(d)<1) is a decreasing cutting function. The closest node to X is moved the most in the direction of X, while other nodes are moved by smaller amounts depending on their distance from the closest node in the initial geometry.

  11. PCA and Local PCA The covariance matrix is positive definite (Xq are datapoints) Principal components: eigenvectors of the covariance matrix: The local covariance matrix (w is a positive cutting function) The field of principal components: eigenvectors of the local covariance matrix, ei(y). Trajectories of these vector-fields present geometry of local data structure.

  12. A top secret: the difference between two basic paradigms is not crucial (Almost) Back to Statistics: • Quasi-statistics: 1) delete one point from the dataset, 2) fitting,3) analysis of the error for the deleted data; • The overfitting problem and smoothed data points (it is very close to non-parametric statistics)

  13. Principal manifoldsElastic maps framework LLE ISOMAP Clustering Multidim. scaling Principal manifolds PCA K- means Visualization SOM Non-linear Data-mining methods Factor analysis Supervised classification SVM Regression, approximation

  14. Finite set of objects in RN X i i=1..m

  15. K-means clustering Mean point

  16. Principal “Object” ,

  17. 1st Principal axis Maximal dispersion 2nd principal axis Principal Component Analysis ,

  18. Principal manifold

  19. π-1(x) π x π Statistical Self-consistency x = E(y|π(y)=x) Principal Manifold

  20. What do we want? • Non-linear surface (1D, 2D, 3D …) • Smooth and not twisted • The data model is unknown • Speed (time linear with Nm) • Uniqueness • Fast way to project datapoints

  21. Metaphor of elasticity U(Y) U(E), U(R) Data points Graph nodes

  22. y E(0) R(1) R(2) E(1) R(0) Constructing elastic nets

  23. Xj y E(0) R(1) R(2) E(1) R(0) Definition of elastic energy .

  24. Elastic manifold

  25. Global minimum and softening 0, 0 103 0, 0 102 0, 0 101 0, 0 10-1

  26. Adaptive algorithms Refining net: Growing net Idea of scaling: Adaptive net

  27. Scaling Rules For uniform d-dimensional net from the condition of constant energy density we obtain: s is number of edges,r is number of ribs in a given volume

  28. Grammars of Construction Substitution rules • Examples: • For net refining: substitutions of columns and rows • For growing nets: substitutions of elementary cells.

  29. × = Substitutions in factors Graph factorization Substitution rule Transformation of factor

  30. × × Substitutions in factors Graph transformation

  31. Transformation selection A grammar is a list of elementary graph transformations. Energetic criterion: we select and apply an elementary applicable transformation that provides the maximal energy decrease (after a fitting step). The number of operations for this selection should be in order O(N) or less, where N is the number of vertexes

  32. Projection onto the manifold Closest node of the net Closest point of the manifold

  33. Mapping distortions Two basic types of distortion: 1) Projecting distant points in the close ones (bad resolution) 2) Projecting close points in the distant ones (bad topology compliance)

  34. Instability of projection Best Matching Unit (BMU) for a data point is the closest node of the graph, BMU2 is the second-close node. If BMU and BMU2 are not adjacent on the graph, then the data point is unstable. Gray polygons are the areas of instability. Numbers denote the degree of instability, how many nodes separate BMU from BMU2.

  35. Colorings: visualize any function Value of the coordinate

  36. Density visualization

  37. Example: different topologies RN R2

  38. VIDAExpert tool and elmap C++ package

  39. principal component regression F(x) x Regression and principal manifolds

  40. Data point has a gap Projection and regression Data with gaps are modelled as affine manifolds, the nearest point on the manifold provides the optimal filling of gaps.

  41. Iterative error mapping For a given elastic manifold and a datapoint x(i) the error vector is where P(x) is the projection of data point x(i) onto the manifold. The errors form a new dataset, and we can construct another map, getting regular model of errors. So we have the first map that models the data itself, the second map that models errors of the first model, … and so on.Every point x in the initial data space is modeled by the vector

  42. Image skeletonization or clustering around curves

  43. Image skeletonization or clustering around curves

  44. Approximation of molecular surfaces

  45. Application: economical data Density Gross output Profit Growth temp

  46. Medical table1700patients with infarctus myocarde Patients map, density Lethal cases

  47. Medical table1700patients with infarctus myocarde 128 indicators Stenocardia functional class Numberof infarctus in anamnesis Age

  48. Codon usage in all genes of one genome Escherichia coli Bacillus subtilis Majority of genes “Foreign” genes “Hydrophobic” genes Highly expressed genes

  49. Golub’s leukemia dataset3051 genes, 38 samples (ALL/B-cell,ALL/T-cell,AML) Map of genes: vote for ALL vote for AML used by T.Golub used by W.Lie ALL sample AML sample

  50. Golub’s leukemia datasetmap of samples: AML ALL/B-cell ALL/T-cell Retinoblastoma binding protein P48 Cystatin C density CA2 Carbonic anhydrase II X-linked Helicase II

More Related