AMCS/CS 340 : Data Mining

Dimensionality Reduction AMCS/CS 340 : Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Outline • Introduction of DR • Linear DR • PCA • SVD • Nonlinear DR • Isomap • LLE 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Most machine learning and data mining techniques may not be effective for high-dimensional data Curse of Dimensionality Query accuracy and efficiency degrade rapidly as the dimension increases. The intrinsic dimension may be small. For example, the number of genes responsible for a certain type of disease may be small. Why Dimensionality Reduction? 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

multiple copies created by changing the location and orientation Each image  a data point in 100*100=10,000-dimensional space A set of N images  data set X represented by N*10000 matrix Classify or cluster data set X  work on N data points each of which is a 10000-dimensioanl vector High-dimensional data example 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

multiple copies created by changing the location and orientation Each image  a data point in 3-dimensional space A set of N images  data set X represented by N*3 matrix Classify or cluster data set X  work on N data points each of which is a 3-dimensioanl vector Low-dimensional space rotation 3 Latent variables: ……. horizontal translation vertical translation 5 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage and retrieval. Noise removal: positive effect on prediction and analysis. Why Dimensionality Reduction? 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Feature Ranking Feature Subset selection Filters, Wrappers, Embedded Methods Feature Extraction/Construction Clustering Linear/Non-linear Dimensionality Reduction How to reduce the dimensionality? 7 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Feature Ranking Feature Subset selection Filters, Wrappers, Embedded Methods Feature Extraction/Construction Clustering Linear/Non-linear Dimensionality Reduction How to reduce the dimensionality? Target: -- representhigh dimensional data in a lower dimensional space -- improve the predictor (as feature selection)?? Not the principle goal 8 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Dimensionality reduction techniques Don’t consider class labels, just the data points Unsupervised Supervised LDA, CCA PLS PCA, ICA SVD, LSA(LSI) Linear Isomap, LLE, MDR Learning with non-linear kernels Non-Linear Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

DR techniques Unsupervised Supervised Linear LDA, CCA PLS PCA, ICA SVD, LSA Isomap, LLE, MDR Learning with non-linear kernels Non-Linear Principal components analysis (PCA): A vector space transform used to reduce multidimensional data sets to lower dimensions for analysis Independent component analysis (ICA): Find lower dimensions that are most statistically independent Singular Value Decomposition (SVD): matrix factorization Latent semantic analysis (indexing) (LSA, (LSI)): Analyzing relationships between a set of documents and terms by producing a set of concepts related to them Isomap: Computing a quasi-isometric, low-dimensional embedding of a set of high-dimensional data points Locally Linear Embedding (LLE): preserving local geometry of the data by mapping nearby points on the manifold to nearby points in the low dimensional space Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

DR techniques Unsupervised Supervised Linear LDA, CCA PLS PCA, ICA SVD, LSA Isomap, LLE, MDR Learning with non-linear kernels Non-Linear Multifactor dimensionality reduction (MDR): Detecting and characterizing combinations of attributes that interact to influence a dependent or class variable Linear Discriminant Analysis (LDA) Canonical Correlation Analysis (CCA) Partial least squares(PLS-regression): Finding a linear model describing some predicted variables in terms of other observable variables Learning with non-linear kernels: e.g, PCA with non-linear kernels Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Dimensionality reduction techniques Don’t consider class labels, just the data points Unsupervised Supervised LDA, CCA PLS PCA, ICA SVD, LSA (LSI) Linear Isomap, LLE, MDR Learning with non-linear kernels Non-Linear Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

a two dimensional scatter of points that show a high degree of correlation y bar-y x bar-x Demo of vector space transform Linear Regression Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Demo of vector space transform • After vector space transform, we have more “efficient” description • 1st dimension captures max variance • 2nd dimension captures the max amount of residual variance, at right angles (orthogonal) to the first the 1st dimension may capture so much of the information content in the original data set that we can ignore the remaining axis Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

X Create N*d data matrix, with one row vector xnper data point  compute the feature mean 3. Σcovariance matrix of X Σ : d*d matrix, 4. Find eigenvectors and eigenvalues of Σ PC’s the M eigenvectors with the M largest eigenvalue PCA Algorithm Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

% generate data Data = mvnrnd([5, 5],[1 1.5; 1.5 3], 100); figure(1); plot(Data(:,1), Data(:,2), '+'); %center the data m=mean(Data); for i = 1:size(Data,1) Data(i, :) = Data(i, :) - m; end DataCov = Data’*Data/(size(Data,1)-1); %covariance matrix [PC, variances, explained] = pcacov(DataCov); %PC: eigenvector; variances: eigenvalue %explained: percentage of the total variance explained by each principal component = variances/sum(variances) % plot principal components figure; plot(Data(:,1), Data(:,2), '+b'); hold on t=ceil(max(abs(Data(:)))); plot(PC(1,1)*[-t t], PC(2,1)*[-t t], '-r') plot(PC(1,2)*[-t t]/2, PC(2,2)*[-t t]/2 '-b'); holdoff axis equal axis([-t t -t t]) % project down to 1 dimension PcaPos = Data * PC(:, 1); PCA Algorithm in Matlab Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

2d Data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Gives best axis to project Minimum distortion (sum of squared error) of projection Principal vectors are orthogonal Principal Components 1st principal vector explained 95% 2nd principal vector explained 5% Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Check the distribution of eigenvalues Take enough many eigenvectors to cover 80-90% of the variance sum(explained(1:k)) > 90% How many components? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of Sensor networks 54 Sensors in Intel Berkeley Lab

Pairwise link quality vs. distance Link quality Distance between a pair of sensors Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Given a 54x54 matrix of pairwise link qualities Do PCA Project down to 2 principal dimensions PCA discovered the map of the lab PCA in action http://eprints.pascal-network.org/archive/00001205/01/LeskovecSensorsSiKDD2005.pdf Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

What if very large dimensional data? e.g., Images (d ≥ 104) Problem: Covariance matrix Σ is size (d2) d=104 |Σ| = 108 Singular Value Decomposition (SVD)! work directly on data X efficient algorithms available (Matlab) Problems and limitations Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Problem: #1: Reduce dimensionality #2: Find concepts in text Singular Value Decomposition Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

A[n x m] = U[n x r]S[ r x r] (V[m x r])T A: n x m matrix e.g., n documents, m terms U: n x r matrix n documents, r concepts S: r x r diagonal matrix strength of each ‘concept’, (r: rank of the matrix) V: m x r matrix m terms, r concepts S V U r * m n*r r * r = A n * m SVD - Definition 27 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

A = USVT - example: SVD - Example retrieval inf. lung brain data CS x x = MD 28 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

A = USVT - example: SVD - Finding concepts doc-to-concept similarity matrix retrieval CS-concept inf. lung brain data MD-concept CS x x = MD 29 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

A = USVT - example: SVD - Finding concepts retrieval inf. lung ‘strength’ of CS-concept brain data CS x x = MD 30 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

A = USVT - example: SVD - Finding concepts retrieval inf. lung brain data CS x x = MD term-to-concept similarity matrix 31 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining CS-concept

THEOREM [Press+92]:always possible to decomposematrix A into A = USVT , where U,S,V: unique U, V: column orthonormal (i.e., columns are unit vectors, orthogonal to each other) UTU = I; VTV = I (I: identity matrix) S: diagonal Entries (singular values) are positive, and sorted in decreasing order SVD - Properties 32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

PCA: A  covariance matrix Σ eigenvalueλ, eigenvector e, Principal component e1 maximize the projected variance e1T Σ e1 = λ1 SVD: A into A = USVT , The first right singular vector v1 is the eigenvector of ATA (ATA) v1= s12 v1 and it maximizes the total projected squares ||Av1||2 = s12 SVD vs PCA 33 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

If A has already centered Singular value: si2 = Nλi First right singular vector: v1=e1 PCA: map the data to a lower dimensional SVD: rank the pattern/concept according to their importance -- discarding the trivial concept decrease the dimensionality -- the remaining concepts span the new space SVD vs PCA (2) 34 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

||Av1||2 = 19287 ||Ae1||2 = 3429 SVD vs PCA (3) 35 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Calculate U, S, V to have A = USVT S: Square root of eigenvalues of AAT or ATA U: Eigenvectors of AAT vi=ATui/si V: Eigenvectors of ATA ui=Avi/si Which one to choose? Depend on the size of A How to calculate SVD 36 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Q: how exactly is dimensionality reduction done? SVD – Dimensionality reduction x x = 37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Q: how exactly is dimensionality reduction done? A: set the smallest singular values to zero: SVD – Dimensionality reduction x x = 38 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SVD – Dimensionality reduction x x ~ 41 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

A=USVT(full decomposition) Ak=UkSkVkT(reconstruction) Sk: the first k singular values, Uk: the first k columns of U, Vk: the first k columns of U SVD – Reconstruction ~ 42 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

A covariance matrix  eigenvalues, eigenvectors subtracts off the means loses the sparseness of the A matrix, which can make it infeasible for large lexicons. If using PCA -0.54 -0.21 0.69 0.44 0 -0.54 -0.21 0.04 -0.82 0 -0.54 -0.21 -0.73 0.37 0 0.26 -0.66 0 0 -0.71 0.26 -0.66 0 0 0.71 CS  MD 43 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The new concept space typically can be used to: Compare the documents in the concept space (data clustering, document classification). Find relations between terms (synonymy and polysemy). Given a query of terms, translate it into the concept space, and find matching documents (information retrieval). e.g. find documents with ‘data’ LSI – Latent Semantic Indexing 44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

PCA finds subspace linear projections of input data Fail when data lie on an embedded nonlinear manifold with high-dimensional space Manifold: a topological space which is locally Euclidean Linear and Non-Linear methods 46

Discover low dimensional representations (smooth manifold) for data in high dimension, which may not be best summarized by linear combination of features, e.g., PCA Find the mapping that captures the important features Problem definition of Non-Linear DR 47 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Overlapping local neighborhoods, collectively analyzed, can provide information on global geometry Two ways to select neighboring objects: k nearest neighbors (k-NN) – can make non-uniform neighbor distance across the dataset ε-ball – prior knowledge of the data is needed to make reasonable neighborhoods; size of neighborhood can vary Isomap and LLE (Science 2000) Solution 48 48 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Find neighborhood of each object by computing distances between all pairs of points and selecting closest Build a graph with a node for each object and an edge between neighboring points. Euclidian distance between two objects is used as edge weight Use a shortest path graph algorithm to fill in distance between all non-neighboring points Apply classical MDS on this geodesic distance matrix Isomap algorithm Josh. Tenenbaum, Vin de Silva, John langford Science 2000 49 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Altogether there are 20,000 points in the “Swiss roll” data set. We sample 1000 out of 20,000. Isomap example: Sample points with Swiss Roll 50 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

AMCS/CS 340 : Data Mining

AMCS/CS 340 : Data Mining

Presentation Transcript

Data Mining

Data Mining

Data Mining

Data Mining: An Introduction

DATA MINING

Data Mining and Bioinformatics

Data Mining

Applications and Trends in Data Mining

Data Mining

CHAPTER 17: DATA MINING BASICS

Ant Inspired Data Mining

AMCS/CS 340: Data Mining

CHAPTER 17: DATA MINING BASICS

AMCS/CS 340: Data Mining

Outline

Data Mining with DB

Spatial and Temporal Data Mining

Data Mining: Extracting Knowledge from Past Data