1 / 107

SDM06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

SDM06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets. Michael W. Mahoney Yahoo! Research. Petros Drineas CS - RPI. Tutorial given at SIAM Data Mining Meeting April 22, 2006 (Most recent copy) available at: http://www.cs.yale.edu/homes/mmahoney

Download Presentation

SDM06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SDM06 TUTORIAL:Randomized Algorithms for Matrices and Massive Data Sets Michael W. Mahoney Yahoo! Research Petros Drineas CS - RPI Tutorial given at SIAM Data Mining Meeting April 22, 2006 (Most recent copy) available at: http://www.cs.yale.edu/homes/mmahoney http://www.cs.rpi.edu/~drinep

  2. Randomized Linear Algebra Algorithms • Goal: To develop and analyze (fast) Monte Carlo algorithms for performing useful computations on large (and later not so large!) matrices and tensors. • Matrix Multiplication • Computation of the Singular Value Decomposition • Computation of the CUR Decomposition • Testing Feasibility of Linear Programs • Least Squares Approximation • Tensor computations: SVD generalizations • Tensor computations: CUR generalization • Such computations generally require time which is superlinear in the number of nonzero elements of the matrix/tensor, e.g., O(n3) for n x n matrices.

  3. Carefully chosen U O(1) rows O(1) columns Example: the CUR decomposition Algorithmic Motivation: To speed up computations in applications where extremely large data sets are modeled by matrices and, e.g., O(n2) space and O(n3) time is not an option. Structural Motivation: To reveal novel structural properties of the datasets, given sufficient computational time, that are useful in applications. Goal: make ||A-CUR|| small. Why? (Algorithmic) After making two passes over A, we can compute provably good C, U, and R and store them (“sketch”) instead of A: O(m+n) vs. O(n2) space. Why? (Structural) Given sufficient time, we can find C, U and R such that A – CUR is “very” small. This might lead to better understanding of the data. Why? Given a sample consisting of a few columns (C) and a few rows (R) of A, we can compute U and “reconstruct” A as CUR. If the sampling probabilities are not “too bad”, we get provably good accuracy.

  4. Applications of such algorithms • Matrices arise, e.g., since m objects (documents, genomes, images, web pages), each with n features, may be represented by an m x n matrix A. • Covariance Matrices • Latent Semantic Indexing • DNA Microarray Data • Eigenfaces and Image Recognition • Similarity Queries • Matrix Reconstruction • LOTS of other data applications!! • More generally, • Linear and Nonlinear Programming Applications • Design of Approximation Algorithms • Statistical Learning Theory Applications

  5. Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • Data mining • DNA microarray (and DNA SNP) data • Recommendation Systems • Kernel-CUR and the Nystrom Method

  6. Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems

  7. The Pass Efficient Model • Motivation: Amount of disk/tape space has increased enormously; RAM and computing speeds have increased less rapidly. • Can store large amounts of data, but • Cannotprocess these data with traditional algorithms. • In the Pass-Efficient Model: • Data are assumed to be stored on disk/tape. • Algorithm has access to the data via a pass over the data. • Algorithm is allowed additional RAM space and additional computation time. • An algorithm is pass-efficient if it uses a small constant number of passes and sublinear additional time and space to compute a description of the solution. • Note: If data are an m x n matrix A, then algorithms which require additional time and space that is O(m+n) or O(1) are pass-efficient.

  8. Randomized Algorithms for Linear Algebra problems: A “sketch” consisting of a small number of judiciously chosen and randomly sampled rows and columns (or elements) is sufficient for provably rapid and efficient approximation of many matrix operations. Random Sampling • Random Sampling and Randomized Algorithms: • Better complexity properties (randomization as a resource). • Simpler algorithms and/or analysis (maybe de-randomize later). • Uniform Sampling: • Typically things work in expectation, but poor variance properties. • Non-uniform Sampling: • With “good” probabilities, can make the variance small.

  9. Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • Data mining • DNA microarray (and DNA SNP) data • Recommendation Systems • Kernel-CUR and the Nystrom Method

  10. i-th row of B i-th column of A Approximating Matrix Multiplication … (D. & Kannan FOCS ’01, and D., Kannan, & M. TR ’04, SICOMP ’06) Problem Statement Given an m-by-n matrix A and an n-by-p matrix B, approximate the product A·B, OR, equivalently, Approximate the sum of n rank-one matrices. Each term in the summation is a rank-one matrix

  11. i-th row of B i-th column of A …by random sampling • Algorithm • Fix a set of probabilities pi, i=1,…,n, summing up to 1. • For t=1 up to s, • set jt = i, where Pr(jt = i) = pi; • (Pick sterms of the sum, with replacement, with respect to the pi.) • Approximate AB by the sum of the s terms, after scaling.

  12. i-th row of B i-th column of A Random sampling (cont’d) Keeping the terms j1, j2, … js.

  13. Create C and R by performing s i.i.d. trials, with replacement. • For t=1 up to s, pick a column A(jt) and a row B(jt) with probability • Include A(jt)/(spjt)1/2 as a column of C, and B(jt)/(spjt)1/2 as a row of R. The algorithm (matrix notation)

  14. Simple Lemmas • The input matrices are given in “sparse unordered representation”; e.g., their non-zero entries are presented as triples (i, j, Aij) in any order. • The expectation of CR (element-wise) is AB. • Our nonuniform sampling minimizes the variance of the estimator. • It is easy to implement the sampling in two passes. • If the matrices aredense the algorithm runs in O(smp) time, instead of O(nmp) time, • It requires O(sm+sp) RAM space. • Does not tamper with the sparsity of the matrices.

  15. Error Bounds For the above algorithm, For the above algorithm, with probability at least 1-, • This is a relative error bound if||AB||F = (||A||F ||B||F), i.e. if there is “not much cancellation” in the multiplication. • We removed the expectation (by applying a martingale argument) and so have an extra log(1/) factor. • Markov’s inequality would also remove the expectation, introducing an extra 1/ factor.

  16. Special case: B = AT If B = AT, then the sampling probabilities are Also, R = CT, and the error bounds are

  17. Special case: B = AT (cont’d) (Rudelson & Vershynin ’04, Vershynin ’04) Improvement for the spectral norm bound for the special case B = AT. • Uses a result of M. Rudelson for random vectors in isotropic position. • Tight concentration results can be proven using Talagrand’s theory. • The sampling procedure is slightly different; s columns/rows are kept in expectation, i.e., column i is picked with probability:

  18. Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • Data mining • DNA microarray (and DNA SNP) data • Recommendation Systems • Kernel-CUR and the Nystrom Method

  19. Singular Value Decomposition (SVD) U (V): orthogonal matrix containing the left (right) singular vectors of A. S: diagonal matrix containing the singular values of A. • Exact computation of the SVD takes O(min{mn2 , m2n}) time. • The top few singular vectors/values can be approximated faster (Lanczos/ Arnoldi methods).

  20. Rank k approximations (Ak) Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A. Sk: diagonal matrix containing the top k singular values of A. Also, Ak=UkUkTA. Ak is a matrix of rank k such that ||A-Ak||2,F is minimized over all rank k matrices! This property of very useful in the context of Principal Component Analysis.

  21. Approximating SVD in O(n) time • (D., Frieze, Kannan, Vempala & Vinay SODA ’99, JML ’04, D. Kannan, & M. TR ’04, SICOMP ’06) • Given: m x n matrix A • Sample c columns from A and rescale to form the m x c matrix C. • Compute the m x k matrix Hk of the top k left singular vectors of C. Structural Theorem: For any probabilities and number of columns: ||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + 2√k||AAT-CCT||F Algorithmic Theorem: If pi = |A(i)|2/||A||F2 and c ≥ 42k/2, then: ||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + ||A||F2. Proof: via matrix multiplication theorem and matrix perturbation theory.

  22. C Example of randomized SVD A Title: C:\Petros\Image Processing\baboondet.eps Creator: MATLAB, The Mathworks, Inc. Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. Original matrix After sampling columns Compute the top k left singular vectors of the matrix C and store them in the 512-by-k matrix Hk.

  23. Example of randomized SVD (cont’d) Title: C:\Petros\Image Processing\baboondet.eps Creator: MATLAB, The Mathworks, Inc. Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. A HkHkTA A and HkHkTA are close.

  24. More details: Let pij2 [0,1] for all i,j. Create the matrix S from A such that: Element-wise sampling (Achlioptas & McSherry, STOC ’01, JACM ’05) • The Algorithm in 2 lines: • To approximate a matrix A, keep a few elements of the matrix (instead of rows or columns) and zero out the remaining elements. • Compute a rank k approximation to this sparse matrix (using Lanczos methods). ||A-S||2 is bounded !(i) the singular values of A and S are close, and (ii, under additional assumptions) the top k left (right) singular vectors of S span a subspace that is close the to subspace spanned by the top k left (right) singular vectors of A.

  25. Element-wise sampling (cont’d) • Approximating singular values fast: • Zero out (a large number of) elements of A, scale the remaining ones appropriately. • Compute the singular values of the resulting sparse matrix using iterative techniques. • (Good choice for pij: pij = sAij2/i,j Aij2, where s denotes the expected number of elements that we seek to keep in S.) • Note: Each element is kept or discarded independently of the others. • Similar ideas have been used to: • explain the success of Latent Semantic Indexing (LSI): (Papadimitriou, Raghavan, Tamaki, Vempala, PODS ’98 & Azar, Fiat, Karlin, McSherry, Saia STOC ’01) • design recommendation systems: (Azar, Fiat, Karlin, McSherry, Saia STOC ’01 ) • speedup kernel computations: (Achlioptas, McSherry, and Schölkopf, NIPS ’02)

  26. Element-wise vs. row/column sampling • Quite different techniques! • Row/column sampling preserves subspace/structural properties of the matrices. • Element-wise sampling explains how adding noise and/or quantizing the elements of a matrix perturbs its singular values/vectors. • The two techniques should be complementary! • Some similarities and differences: • Similar error bounds. • Element-wise sampling is doable in one pass! • Running time of element-wise sampling depends on the speed of, e.g., Arnoldi methods. • Element-wise methods do not seem amenable to many of the extensions we will present.

  27. Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • Data mining • DNA microarray (and DNA SNP) data • Recommendation Systems • Kernel-CUR and the Nystrom Method

  28. Carefully chosen U An application: Given a query vector x, instead of computing A · x, compute CUR · x to identify its nearest neighbors, O(1) rows O(1) columns A novel CUR matrix decomposition (D. & Kannan, SODA ’03, D., Kannan, & M. TR ’04, SICOMP ’06) Create an approximation to the original matrix of the following form:

  29. The CUR decomposition Given a large m-by-n matrix A (stored on disk), compute a decomposition CUR of A such that: • C consists of c = O(k/2) columns of A. • R consists of r = O(k/2) rows of A. • C (R) is created using importance sampling, e.g. columns (rows) are picked in i.i.d. trials with respect to probabilities: • C, U, R can be stored in O(m+n) space, after making two passes through the entire matrix A, using O(m+n) additional space and time. • The product CUR satisfies (with high probability):

  30. Computing U • Intuition: (which can be formalized - see later) • The CUR algorithm expresses every row of the matrix A as a linear combination of a small subset of the rows of A. • This small subset consists of the rows in R. • Given a row of A – say A(i) – the algorithm computes a good fit for the row A(i) using the rows in R as the basis, by approximately solving: But, only c = O(1) elements of the i-th row are given as input. So, we only approximate the optimal vector u instead of computing it exactly. Actually, the pass-efficient CUR decomposition approximates the approximation.

  31. If we pick O(1/2) rows and O(1/2) columns, Error bounds for CUR Assume Ak is the “best” rank k approximation to A (through SVD). Then, if we pick O(k/2) rows and O(k/2) columns,

  32. Previous CUR-type decompositions (For details see Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD”, ‘05.)

  33. Lower Bounds Question: How many queries does a sampling algorithm need to approximate a given function accurately with high probability? • (Ziv Bar-Yossef ’03, ’04)Lower bounds for the low rank matrix approximation problem and the matrix reconstruction problem. • Any sampling algorithm that w.h.p. finds a good low rank approximation requires (m+n) queries. • Even if the algorithm is given the exact weight distribution over the columns of a matrix it will still require (k/4) queries. • Finding a matrix D such that ||A-D||F ≤  ||A||F requires (mn) queries and that finding a D such that ||A-D||2 ≤  ||A||2requires (m+n) queries. • Applied to our results: • The LinearTimeSVD algorithm is optimal w.r.t. ||||F bounds. • The ConstantTimeSVD algorithm is optimal w.r.t. ||||2 bounds up to poly factors. • The CUR algorithm is optimal for constant .

  34. Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • Data mining • DNA microarray (and DNA SNP) data • Recommendation Systems • Kernel-CUR and the Nystrom Method

  35. CUR application: Data Mining • Database: An m-by-n matrix A, e.g., m (>106) objects and n(>105) features. • Queries: Given a new object x, find similar objects (nearest neighbors) in A. • Closeness: Two normalized objects x and y are “close” xT·d = cos(x,d) is high. • Given a query vector x, the matrix product A·x computes all the angles/distances. • Key observation: The exact value xT· d might not be necessary. • The feature values in the vectors are set by coarse heuristics. • It is in general enough to see if xT·d > Threshold. • Algorithm: Given a query vector x, compute CUR · x to identify nearest neighbors. • Theorem: We have a bound on the worst case of x using CUR instead of A:

  36. CUR application: Genetic Microarray Data Exploit structural properties of CUR in biological applications: Experimental conditions Find a “good” set of genes and arrays to include in C and R? Provable and/or heuristic strategies are acceptable. genes • Common in Biological/Chemical/Medical applications of PCA: • Explain the singular vectors, by mapping them to meaningful biological processes. • This is a “challenging” task (think: reification) ! • CUR is a low-rank decomposition in terms of the data that practitioners understand. • Use it to explain the data and do dimensionality reduction, classification, clustering. • Gene microarray data: M., D., & Alter (UT Austin) (sporulation and cell cycle data).

  37. Customer sample (purchases, small surveys) products Customer sample (guinea pigs) customers CUR application: Recommendation Systems (D., Raghavan, & Kerenidis, STOC ’02) • The problem: m customers and n products; Aij is the (unknown) utility of product j for customer i. • The goal: recreate A from a few samples to recommend high utility products. • (KRRT98): Assuming strong clustering of the products, competitive algorithms even with only 2 samples/customer. • (AFKMS01): Assuming sampling of (mn) entries of A and a gap requirement, accurately recreate A. • Question: Can we get competitive performance by sampling o(mn) elements? • Answer: Apply the CUR decomposition:

  38. Kernel-CUR Motivation • Kernel-based learning methods to extract non-linear structure: • Choose features to define a (dot product) space F. • Map the data, X, to F by : X ->F. • Do classification, regression, and clustering in F with linear methods (SVMs,GPs,SVD). • If the Gram matrix G, where Gij=kij=((X(i)), (X(j))), is dense but has low numerical rank, then calculations of interest need O(n2) space and O(n3) time: • matrix inversion in GP prediction, • quadratic programming problems in SVMs, • computation of eigendecomposition of G. • Relevant recent work using low-rank methods: • (Williams and Seeger, NIPS ’01, etc.): Nystrom method for out-of-sample extensions. • (Achlioptas, McSherry, and Schölkopf, NIPS ’02): randomized kernels.

  39. Kernel-CUR Decomposition (D. & M., COLT ’05, TR ’05, JMLR ‘05) • Input: n x n SPSD matrix G, probabilities {pi, 1=1,…,n}, c <= n, and k <= c. • Algorithm: • Let C be the n x c matrix containing c randomly sampled columns of G. • Let W be the c x c matrix with containing intersection of C and CT. • Theorem: Let pi = Gii2/ iGii2. If c = O(k log(1/)/4), then w.p. at least 1-, If c = O(log(1/)/4), then with probability at least 1-,

  40. Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems

  41. Datasets modeled as tensors • Tensors: (naively, a dataset subscripted by multiple indices) appear both in Math and CS. • Represent high dimensional functions. • Connections to complexity theory (i.e., matrix multiplication complexity). • Statistical applications (i.e., ICA, HOS, etc.). • Large data-set applications (e.g., Medical Imaging & Hyperspectral Imaging) • Problem: There does not exist a definition of tensor rank (and associated tensor SVD) with the – nice – properties found in the matrix case. • (Lek-Heng Lim ’05: strong impossibility results!) • Common heuristic: “unfold” the tensor along a mode and apply Linear Algebra. • We will do this, but note that this kills the essential tensor structure.

  42. Mode 3 m x n x p tensor A Mode 1 Mode 2 Datasets modeled as tensors (cont’d) Goal: Extract structure from a tensor dataset A (naively, a dataset subscripted by multiple indices) using a small number of samples. • Tensor rank (minimum number of rank-one tensors) is NP-hard to compute. • Tensor -rank (“unfold” along the th mode and the the matrix SVD) is a commonly-used heuristic. Randomized-Tensor-CUR: unfold along a “distinguished” mode and reconstruct. Randomized-Tensor-SVD: unfold along every mode and choose columns. (Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD,” TR05.)

  43. The TensorCUR algorithm (3-modes) • Choose the preferred mode  (e.g., time). • Pick a few representative “slabs” (let R denote the tensor of the sampled slabs). • Use only information in a small number of representative “fibers” (let C denote the tensor of sampled fibers and U a low-dimensional encoding tensor). • Express the remaining slabs as linear combinations of the basis of sampled slabs.

  44. 128 frequencies ca. 500 pixels ca. 500 pixels Tensor-CUR application: Hyperspectral Image Analysis (with M. Maggioni and R. Coifman at Yale) Goal: Extract structure from temporally-resolved images or spectrally-resolved images of medical interest using a small number of samples (images and/or pixels). Note: A temporally or spectrally resolved image may be viewed as a tensor (naively, a dataset subscripted by multiple indices) or as a matrix (whose columns have internal structure that is not modeled). m x n x p tensor A or mn x p matrix A Note: The chosen images are a dictionary from the data to express every image. Note: The chosen pixels are a dictionary from the data to express every pixel.

  45. Sampling hyperspectral data • Sample slabs depending on total absorption. • For example, absorption at two pixel types: Sample fibers uniformly (since intensity depends on stain).

  46. Eigen-analysis of slabs and fibers

  47. Look at the exact (65-th) slab.

  48. The (65-th) slab approximately reconstructed This slab was reconstructed by approximate least-squares fit to the basis from slabs 41 and 50, using 1000 (of 250K) pixels/fibers.

  49. Tissue Classification - Exact Data

More Related