Jeff Hansen Senior Data Engineer. April 2013. Demystifying Dimensionality Reduction. Demystifying Dimensionality Reduction. A Tribute to Johnson and Lindenstrauss. Who is this?. What is this?. How about this?. Hint: It’s for kids…. Some Perspectives are Better than Others.

Download Presentation

Demystifying Dimensionality Reduction

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

Great, but… • What does this have to do with Machine Learning? • How can this help me visualize my data? • How do I use this to recommend new products to new customers? • Can this help me detect fraud?

Distance and Similarity If we • Treat each feature like a dimension • Treat each item like a point Then • Similar items are closer together • Dissimilar items are further apart

Measures of Distance Various measures of distance with scary math names: • Euclidean Distance • Maximum Distance • Manhattan Distance • L(n) Norm

Curse of Dimensionality • You think more than 3 dimensions are hard? Try a couple million… • Calculating similarity becomes increasingly difficult as a feature set grows.

Reduce the Number of Dimensions Johnson-Lindenstrauss Theorem • Number of Dimensions doesn’t matter, the sample size does – approximate item similarity can be maintained with a number of dimensions on the order of log(n) the number of points. English? • Every time you double the number of points you only need to add a constant number of additional dimensions.

This is worth Repeating The number of dimensions doesn’t matter. If all you care about is item similarity, you can project an INFINITE number of dimensions onto a lower number of dimensions based on the number of points you want to compare.

Feature Extraction • What if there were unrecorded variables that explain the variables we can see? • Dimensionality Reduction techniques extractthese hidden variables or features. • For example, Topics explain the appearance of words in documents, Genres explain the movies that people watch.

Vectors and Projections *Image courtesy of Wikipedia: http://en.wikipedia.org/wiki/File:3D_Vector.svg

Vector “dot” Products • A . B = (a1 * b1) + (a2 * b2) + (a3 * b3) • A . B = || A || * || B || * cosθ If B is a unitvector (it has a length of 1) then the result is simply the length of A projected onto the line (or dimension) formed by B. Remember that a “good” projection is one where the angle is close to zero, so that cosθis close to 1 and the dot product of A and B is approximately the length of A. This is like projecting the face of a coin onto a surface that’s parallel to the face of the coin – that would be a good projection.

Matrix Multiplication Cell 1,1 = Row 1 times Column 1 = (a1,1 x b1,1) + (a1,2 x b2,1) + (a1,3 x b3,1) Cell 1,2 = Row 1 times Column 2 = … …

Matrix Division? What if you could factor a matrix? You Can! Matrix Decompositions: • LU Decomposition • QR Decomposition • Eigen Decomposition • Singular Value Decomposition

Why would you Want to? 1,000,000 x 1,000,000 = 1,000,000,000,000 100 X 1,000,000 + 100 x 1,000,000 = 200,000,000 That’s a MUCH smaller representation!

Factors as Basis for new Space Suppose Cis a Matrix of people who have watched movies. Every Row represents a person and ever column represents a movie. If we can find matrices A and B where A x B approximates C: • Each row of A models a person • The distance between two rows of A models relative similarity • Each column of B models a movie • The distance between two columns of B models relative similarity

A = U Σ V* • U and V are square orthonormal matrices – rows and columns are all unit vectors. • Σ is a rectangular diagonal matrix with values decreasing from left to right. • U and V can be viewed as projection matrices, Σ as a scaling matrix. • Earlier columns of U and V* capture most of the “action” of A. • If Σ “decays” quickly enough, most of U and V* is insignificant and can be thrown away without significantly affecting the model.