1 / 27

Dimensionality Reduction

Dimensionality Reduction. Given N vectors in n dims, find the k most important axes to project them k is user defined ( k < n ) Applications : information retrieval & indexing identify the k most important features or

tilden
Download Presentation

Dimensionality Reduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dimensionality Reduction • Given N vectors in n dims, find the k most important axes to project them • k is user defined (k < n) • Applications: information retrieval & indexing • identify the k most important features or • reduce indexing dimensions for faster retrieval (low dim indices are faster) Dimensionality Reduction

  2. Techniques • Eigenvalue analysis techniques [NR’92] • Karhunen-Loeve (K-L) transform • Singular Value Decomposition (SVD) • both need O(N2) time • FastMap [Faloutsos & Lin 95] • dimensionality reduction and • mapping of objects to vectors • O(N) time Dimensionality Reduction

  3. Mathematical Preliminaries • For an nxn square matrix S, for unit vector x and scalar valueλ: Sx = λx • x: eigenvector of S • λ: eigenvalue of S • The eigenvectors of a symmetric matrix (S=ST) are mutually orthogonal and its eigenvalues are real • r rank of a matrix: maximum number or independent columns or rows Dimensionality Reduction

  4. Example 1 • Intuition: S defines an affine transform y = Sx that involves scaling, rotation • eigenvectors: unit vectors along the new directions • eigenvalues denote scaling eigenvector of major axis Dimensionality Reduction

  5. Example 2 • If S is real and symmetric (S=ST) then it can be written as S = UΛUT • the columns of U are eigenvectors of S • U: column orthogonal (UUT=I) • Λ: diagonal with the eigenvalues of S Dimensionality Reduction

  6. Karhunen-Loeve (K-L) • Project in a k-dimensional space (k<n) minimizing the error of the projections (sum. of sq. diffs) • K-L gives a linear combination of axes • sorted by importance • keep the first k dims 2-dim points and the 2 K-L directions for k=1 keep x’ Dimensionality Reduction

  7. Computation of K-L • Put N vectors in rows in A=[aij] • Compute B=[aij-ap] , where • Covariance matrix: C=BTB • Compute the eigenvectors of C • Sort in decreasing eigenvalue order • Approximate each object by its projections on the directions of the first k eigenvectors Dimensionality Reduction

  8. Intuition • B shifts the origin of the center of gravity of the vectors by ap and has 0 column mean • C represents attribute to attribute similarity • C square, real, symmetric • Eigenvector and eigenvalues are computed on C not on A • C denotes the affine transform that minimizes the error • Approximate each vector with its projections along the first k eigenvectors Dimensionality Reduction

  9. Example • Input vectors [1 2], [1 1], [0 0] • Then col.avgs are 2/3 and 1 Dimensionality Reduction

  10. SVD • For general rectangular matrixes • Nxn matrix (N vectors, n dimensions) • groups similar entities (documents) together • Groups similar terms together and each group of terms corresponds to a concept • Given an Nxn matrix A, write it as A = UΛVT • U: Nxr column orthogonal (r: rank of A) • Λ: rxr diagonal matrix (non-negative, desc. order) • V: rxn column orthogonal matrix Dimensionality Reduction

  11. SVD (cont,d) • A = λ1u1v1T + λ2u2v2T + … + λrurvrT • u, v are column vectors of U, V • SVD identifies rect. blobs of related values in A • The rank r of A: number of blobs Dimensionality Reduction

  12. Example • Two types of documents: CS and Medical • Two concepts (groups of terms) • CS: data, information, retrieval • Medical: brain, lung Dimensionality Reduction

  13. Example (cont,d) U Vt Λ r=2 • U: document-to-document similarity matrix • V: term-to-document similarity matrix • v12 = 0: data has 0 similarity with the 2nd concept Dimensionality Reduction

  14. SVD and LSI • SVD leads to “Latent Semantic Indexing” (http://lsi.research.telcordia.com/lsi/LSIpapers.html) • Terms that occur together are grouped into concepts • When a user searches for a term, the system determines the relevant concepts to search • LSI maps concepts to vectors in the concept space instead of the n-dim. document space • Concept space: is a lower dimensionality space Dimensionality Reduction

  15. Examples of Queries • Find documents with the term“data” • Translate query vector q to concept space • The query is related to the CS concept and unrelated to the medical concept • LSI returns docs that also contain the terms “retrieval” and “information” which are not specified by the query Dimensionality Reduction

  16. FastMap • Works with distances, has two roles: • Maps objects to vectors so that their distances are preserved (then apply SAMs for indexing) • Dim. Reduction: N vectors with n attributes each, find N vectors with k attributes such that distances are preserved as much as possible Dimensionality Reduction

  17. Main idea • Pretend that objects are points in some unknown n-dimensional space • project these points on k mutually orthogonal axes • compute projections using distance only • The heart of FastMap is the method that projects two objects on a line • take 2 objects which are far apart (pivots) • project on the line that connects the pivots Dimensionality Reduction

  18. Project Objects on a Line Apply cosine low: • Oa, Ob: pivots, Oi: any object • dij: shorthand for D(Oi,Oj) • xi: first coordinate on a k dimensional space • If Oiis close to Oa, xiis small Dimensionality Reduction

  19. Choose Pivots • Complexity: O(N) • The optimal algorithm would require O(N2) time • steps 2,3 can be repeated 4-5 times to improve the accuracy of selection Dimensionality Reduction

  20. Extension for Many Dimensions • Consider the (n-1)-dimensional hyperplane H that is perpendicular to line Oab • Project objects on H and apply previous step • choose two new pivots • the new xi is the next object coordinate • repeat this step until k dim. vectors are obtained • The distance on H is not D • D’: distance between projected objects Dimensionality Reduction

  21. Distance on the Hyper-Plane H Pythagorean theorem: • D’ on H can be computed from the Pythagorean theorem • The ability to compute D’ allows for computing a second line on H etc. Dimensionality Reduction

  22. Algorithm Dimensionality Reduction

  23. Observations • Complexity: O(kN) distance calculations • k: desired dimensionality • k recursive calls, each takes O(N) • The algorithm records pivots in each call (dimension) to facilitate queries • the query is mapped to a k-dimensional vector by projecting it on the pivot lines for each dimension • O(1) computation/step: no need to compute pivots Dimensionality Reduction

  24. Observations (cont,d) • The projected vectors can be indexed • mapping on 2-3 dimensions allows for visualization of the data space • Assumes Euclidean space (triangle rules) • not always true (at least after second step) • Approximation of pivots • some distances are negative • turn negative distances to 0 Dimensionality Reduction

  25. Application: Document Vectors Dimensionality Reduction

  26. FastMap on 10 documents for 2 & 3 dims (a) k = 2 and (b) k = 3 Dimensionality Reduction

  27. References • Searching Multimedia Databases by Content, C. Faloutsos, Kluwer, 1996 • W. Press et.al. Numerical Recipes in C, Cambridge Univ. Press, 1988 • LSI website: http://lsi.research.telcordia.com/lsi/LSIpapers.html • C. Faloutsos, K.-Ip.Lin, FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets, Proc. of Sigmod, 1995 Dimensionality Reduction

More Related