1 / 62

AMCS/CS 340 : Data Mining

Dimensionality Reduction. AMCS/CS 340 : Data Mining. Xiangliang Zhang King Abdullah University of Science and Technology. Outline. Introduction of DR Linear DR PCA SVD Nonlinear DR Isomap LLE. 2.

mora
Download Presentation

AMCS/CS 340 : Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dimensionality Reduction AMCS/CS 340 : Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

  2. Outline • Introduction of DR • Linear DR • PCA • SVD • Nonlinear DR • Isomap • LLE 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  3. Most machine learning and data mining techniques may not be effective for high-dimensional data Curse of Dimensionality Query accuracy and efficiency degrade rapidly as the dimension increases. The intrinsic dimension may be small. For example, the number of genes responsible for a certain type of disease may be small. Why Dimensionality Reduction? 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  4. multiple copies created by changing the location and orientation Each image  a data point in 100*100=10,000-dimensional space A set of N images  data set X represented by N*10000 matrix Classify or cluster data set X  work on N data points each of which is a 10000-dimensioanl vector High-dimensional data example 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  5. multiple copies created by changing the location and orientation Each image  a data point in 3-dimensional space A set of N images  data set X represented by N*3 matrix Classify or cluster data set X  work on N data points each of which is a 3-dimensioanl vector Low-dimensional space rotation 3 Latent variables: ……. horizontal translation vertical translation 5 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  6. Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage and retrieval. Noise removal: positive effect on prediction and analysis. Why Dimensionality Reduction? 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  7. Feature Ranking Feature Subset selection Filters, Wrappers, Embedded Methods Feature Extraction/Construction Clustering Linear/Non-linear Dimensionality Reduction How to reduce the dimensionality? 7 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  8. Feature Ranking Feature Subset selection Filters, Wrappers, Embedded Methods Feature Extraction/Construction Clustering Linear/Non-linear Dimensionality Reduction How to reduce the dimensionality? Target: -- representhigh dimensional data in a lower dimensional space -- improve the predictor (as feature selection)?? Not the principle goal 8 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  9. Dimensionality reduction techniques Don’t consider class labels, just the data points Unsupervised Supervised LDA, CCA PLS PCA, ICA SVD, LSA(LSI) Linear Isomap, LLE, MDR Learning with non-linear kernels Non-Linear Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  10. DR techniques Unsupervised Supervised Linear LDA, CCA PLS PCA, ICA SVD, LSA Isomap, LLE, MDR Learning with non-linear kernels Non-Linear Principal components analysis (PCA): A vector space transform used to reduce multidimensional data sets to lower dimensions for analysis Independent component analysis (ICA): Find lower dimensions that are most statistically independent Singular Value Decomposition (SVD): matrix factorization Latent semantic analysis (indexing) (LSA, (LSI)): Analyzing relationships between a set of documents and terms by producing a set of concepts related to them Isomap: Computing a quasi-isometric, low-dimensional embedding of a set of high-dimensional data points Locally Linear Embedding (LLE): preserving local geometry of the data by mapping nearby points on the manifold to nearby points in the low dimensional space Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  11. DR techniques Unsupervised Supervised Linear LDA, CCA PLS PCA, ICA SVD, LSA Isomap, LLE, MDR Learning with non-linear kernels Non-Linear Multifactor dimensionality reduction (MDR): Detecting and characterizing combinations of attributes that interact to influence a dependent or class variable Linear Discriminant Analysis (LDA) Canonical Correlation Analysis (CCA) Partial least squares(PLS-regression): Finding a linear model describing some predicted variables in terms of other observable variables Learning with non-linear kernels: e.g, PCA with non-linear kernels Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  12. Dimensionality reduction techniques Don’t consider class labels, just the data points Unsupervised Supervised LDA, CCA PLS PCA, ICA SVD, LSA (LSI) Linear Isomap, LLE, MDR Learning with non-linear kernels Non-Linear Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  13. Outline • Introduction of DR • Linear DR • PCA • SVD • Nonlinear DR • Isomap • LLE 13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  14. a two dimensional scatter of points that show a high degree of correlation y bar-y x bar-x Demo of vector space transform Linear Regression Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  15. Demo of vector space transform • After vector space transform, we have more “efficient” description • 1st dimension captures max variance • 2nd dimension captures the max amount of residual variance, at right angles (orthogonal) to the first the 1st dimension may capture so much of the information content in the original data set that we can ignore the remaining axis Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  16. X Create N*d data matrix, with one row vector xnper data point  compute the feature mean 3. Σcovariance matrix of X Σ : d*d matrix, 4. Find eigenvectors and eigenvalues of Σ PC’s the M eigenvectors with the M largest eigenvalue PCA Algorithm Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  17. % generate data Data = mvnrnd([5, 5],[1 1.5; 1.5 3], 100); figure(1); plot(Data(:,1), Data(:,2), '+'); %center the data m=mean(Data); for i = 1:size(Data,1) Data(i, :) = Data(i, :) - m; end DataCov = Data’*Data/(size(Data,1)-1); %covariance matrix [PC, variances, explained] = pcacov(DataCov); %PC: eigenvector; variances: eigenvalue %explained: percentage of the total variance explained by each principal component = variances/sum(variances) % plot principal components figure; plot(Data(:,1), Data(:,2), '+b'); hold on t=ceil(max(abs(Data(:)))); plot(PC(1,1)*[-t t], PC(2,1)*[-t t], '-r') plot(PC(1,2)*[-t t]/2, PC(2,2)*[-t t]/2 '-b'); holdoff axis equal axis([-t t -t t]) % project down to 1 dimension PcaPos = Data * PC(:, 1); PCA Algorithm in Matlab Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  18. 2d Data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  19. Gives best axis to project Minimum distortion (sum of squared error) of projection Principal vectors are orthogonal Principal Components 1st principal vector explained 95% 2nd principal vector explained 5% Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  20. Check the distribution of eigenvalues Take enough many eigenvectors to cover 80-90% of the variance sum(explained(1:k)) > 90% How many components? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  21. Example of Sensor networks 54 Sensors in Intel Berkeley Lab

  22. Pairwise link quality vs. distance Link quality Distance between a pair of sensors Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  23. Given a 54x54 matrix of pairwise link qualities Do PCA Project down to 2 principal dimensions PCA discovered the map of the lab PCA in action http://eprints.pascal-network.org/archive/00001205/01/LeskovecSensorsSiKDD2005.pdf Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  24. What if very large dimensional data? e.g., Images (d ≥ 104) Problem: Covariance matrix Σ is size (d2) d=104 |Σ| = 108 Singular Value Decomposition (SVD)! work directly on data X efficient algorithms available (Matlab) Problems and limitations Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  25. Outline • Introduction of DR • Linear DR • PCA • SVD • Nonlinear DR • Isomap • LLE 25 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  26. Problem: #1: Reduce dimensionality #2: Find concepts in text Singular Value Decomposition Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  27. A[n x m] = U[n x r]S[ r x r] (V[m x r])T A: n x m matrix e.g., n documents, m terms U: n x r matrix n documents, r concepts S: r x r diagonal matrix strength of each ‘concept’, (r: rank of the matrix) V: m x r matrix m terms, r concepts S V U r * m n*r r * r = A n * m SVD - Definition 27 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  28. A = USVT - example: SVD - Example retrieval inf. lung brain data CS x x = MD 28 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  29. A = USVT - example: SVD - Finding concepts doc-to-concept similarity matrix retrieval CS-concept inf. lung brain data MD-concept CS x x = MD 29 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  30. A = USVT - example: SVD - Finding concepts retrieval inf. lung ‘strength’ of CS-concept brain data CS x x = MD 30 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  31. A = USVT - example: SVD - Finding concepts retrieval inf. lung brain data CS x x = MD term-to-concept similarity matrix 31 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining CS-concept

  32. THEOREM [Press+92]:always possible to decomposematrix A into A = USVT , where U,S,V: unique U, V: column orthonormal (i.e., columns are unit vectors, orthogonal to each other) UTU = I; VTV = I (I: identity matrix) S: diagonal Entries (singular values) are positive, and sorted in decreasing order SVD - Properties 32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  33. PCA: A  covariance matrix Σ eigenvalueλ, eigenvector e, Principal component e1 maximize the projected variance e1T Σ e1 = λ1 SVD: A into A = USVT , The first right singular vector v1 is the eigenvector of ATA (ATA) v1= s12 v1 and it maximizes the total projected squares ||Av1||2 = s12 SVD vs PCA 33 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  34. If A has already centered Singular value: si2 = Nλi First right singular vector: v1=e1 PCA: map the data to a lower dimensional SVD: rank the pattern/concept according to their importance -- discarding the trivial concept decrease the dimensionality -- the remaining concepts span the new space SVD vs PCA (2) 34 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  35. ||Av1||2 = 19287 ||Ae1||2 = 3429 SVD vs PCA (3) 35 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  36. Calculate U, S, V to have A = USVT S: Square root of eigenvalues of AAT or ATA U: Eigenvectors of AAT vi=ATui/si V: Eigenvectors of ATA ui=Avi/si Which one to choose? Depend on the size of A How to calculate SVD 36 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  37. Q: how exactly is dimensionality reduction done? SVD – Dimensionality reduction x x = 37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  38. Q: how exactly is dimensionality reduction done? A: set the smallest singular values to zero: SVD – Dimensionality reduction x x = 38 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  39. Q: how exactly is dimensionality reduction done? A: set the smallest singular values to zero: SVD – Dimensionality reduction x x = 39 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  40. Q: how exactly is dimensionality reduction done? A: set the smallest singular values to zero: SVD – Dimensionality reduction x x = 40 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  41. SVD – Dimensionality reduction x x ~ 41 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  42. A=USVT(full decomposition) Ak=UkSkVkT(reconstruction) Sk: the first k singular values, Uk: the first k columns of U, Vk: the first k columns of U SVD – Reconstruction ~ 42 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  43. A covariance matrix  eigenvalues, eigenvectors subtracts off the means loses the sparseness of the A matrix, which can make it infeasible for large lexicons. If using PCA -0.54 -0.21 0.69 0.44 0 -0.54 -0.21 0.04 -0.82 0 -0.54 -0.21 -0.73 0.37 0 0.26 -0.66 0 0 -0.71 0.26 -0.66 0 0 0.71 CS  MD 43 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  44. The new concept space typically can be used to: Compare the documents in the concept space (data clustering, document classification). Find relations between terms (synonymy and polysemy). Given a query of terms, translate it into the concept space, and find matching documents (information retrieval). e.g. find documents with ‘data’ LSI – Latent Semantic Indexing 44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  45. Outline • Introduction of DR • Linear DR • PCA • SVD • Nonlinear DR • Isomap • LLE 45 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  46. PCA finds subspace linear projections of input data Fail when data lie on an embedded nonlinear manifold with high-dimensional space Manifold: a topological space which is locally Euclidean Linear and Non-Linear methods 46

  47. Discover low dimensional representations (smooth manifold) for data in high dimension, which may not be best summarized by linear combination of features, e.g., PCA Find the mapping that captures the important features Problem definition of Non-Linear DR 47 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  48. Overlapping local neighborhoods, collectively analyzed, can provide information on global geometry Two ways to select neighboring objects: k nearest neighbors (k-NN) – can make non-uniform neighbor distance across the dataset ε-ball – prior knowledge of the data is needed to make reasonable neighborhoods; size of neighborhood can vary Isomap and LLE (Science 2000) Solution 48 48 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  49. Find neighborhood of each object by computing distances between all pairs of points and selecting closest Build a graph with a node for each object and an edge between neighboring points. Euclidian distance between two objects is used as edge weight Use a shortest path graph algorithm to fill in distance between all non-neighboring points Apply classical MDS on this geodesic distance matrix Isomap algorithm Josh. Tenenbaum, Vin de Silva, John langford Science 2000 49 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

  50. Altogether there are 20,000 points in the “Swiss roll” data set. We sample 1000 out of 20,000. Isomap example: Sample points with Swiss Roll 50 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

More Related