1 / 32

Numerical Linear Algebra in the Streaming Model

Numerical Linear Algebra in the Streaming Model. Ken Clarkson - IBM David Woodruff - IBM. The Problems. Given n x d matrix A and n x d’ matrix B , we want estimators for The matrix product A T B The matrix X * minimizing ||AX-B||

halia
Download Presentation

Numerical Linear Algebra in the Streaming Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM

  2. The Problems Given n x d matrix A and n x d’ matrix B, we want estimators for • The matrix product AT B • The matrix X* minimizing ||AX-B|| • A slightly generalized version of linear regression • Given integer k, the matrix Ak of rank k minimizing ||A-Ak|| • We consider the Frobenius matrix norm: square root of sum of squares

  3. General Properties of Our Algorithms • 1 pass over matrix entries, given in any order (allow multiple updates) • Maintain compressed versions or “sketches” of matrices • Do small work per entry to maintain the sketches • Output result using the sketches • Randomized approximation algorithms • Since we minimize space complexity, we restrict matrix entries to be O(log nc) bits, or O(log nd) bits

  4. Matrix Compression Methods • In a line of similar efforts… • Element-wise sampling [AM01], [AHK06] • Row / column sampling: pick small random subset of the rows, columns, or both [DK01], [DKM04], [DMM08] • Sketching / Random Projection: maintain a small number of random linear combinations of rows or columns [S06] • Usually more than 1 pass • Here: sketching

  5. Outline • Matrix Product • Linear Regression • Low-rank approximation

  6. An Optimal Matrix Product Algorithm • A and B have n rows, and a total of c columns, and we want to estimate ATB, so that ||Est-ATB|| ·ε||A||¢||B|| • Let S be an n x m sign (Rademacher) matrix • Each entry is +1 or -1 with probability ½ • m small, set to O(log 1/ δ) ε-2 • Entries are O(log 1/δ)-wise independent • Observation: • E[ATSSTB/m] = ATE[SST]B/m = ATB • Wouldn’t it be nice if all the algorithm did was maintain STA and STB, andoutput ATSSTB/m?

  7. An Optimal Matrix Product Algorithm • This does work, and we are able to improve the previous dependence on m: • New Tail Estimate: for δ, ε > 0, there is m = O(log 1/ δ) ε-2so that Pr[||ATSSTB/m-ATB|| > ε ||A|| ||B||] ·δ (again ||C|| = [Σi, j Ci, j2]1/2) • Follows from bounding O(log 1/δ)-th moment of ||ATSSTB/m-ATB||

  8. Efficiency • Easy to maintain sketches given updates • O(m) time/update, O(mc log(nc)) bits of space for STA and STB • Improves Sarlos’ algorithm by a log c factor. • Sarlos’ algorithm based on JL Lemma • JL preserves all entries of ATB up to an additive error, whereas we only preserve overall error • Can compute [ATS][STB]/m via fast rectangular matrix multiplication

  9. Matrix Product Lower Bound • Our algorithm is space-optimal for constant δ • a new lower bound • Reduction from a communication game • Augmented Indexing, players Alice and Bob • Alice has random x 2 {0,1}s • Bob has random i 2 {1, 2, …, s} • also xi+1, …, xs • Alice sends Bob one message • Bob should output xi with probability at least 2/3 • Theorem [MNSW]: Message must be (s) bits on average

  10. Lower Bound Proof • Set s := £(cε-2log cn) • Alice makes matrix U • Uses x1…xs • Bob makes matrix U’ and B • Uses i and xi+1, …, xs • Alg input will be A:=U+U’ and B • A and B are n x c/2 • Alice: • Runs streaming matrix product Alg on U • Sends Alg state to Bob • Bob continues Alg with A := U + U’ and B • ATB determines xi with probability at least 2/3 • By choice of U, U’, B • Solving Augmented Indexing • So space of Alg must be (s) = (cε-2log cn) bits

  11. Lower Bound Details • U = U(1); U(2); …, U(log(cn)); 0s • Each U(k)is an£(ε-2) x c/2 submatrix with entries in {-10k, 10k} • U(k)i, j = 10k if matched entry of x is 0, else U(k)i, j = -10k • Bob’s index i corresponds to U(k*)i*, j* • U’ is such that A = U+U’ = U(1); U(2); …, U(k*); 0s • U’ is determined from xi+1, …, xs • ATB is i*-throw of U(k*) • ||A|| ¼ ||U(k*)|| since the entries of A are geometrically increasing • ε2||A||2¢ ||B||2, the squared error, is small, so most entries of the approximation to ATB have the correct sign

  12. Outline • Matrix Product • Linear Regression • Low-rank approximation

  13. Linear Regression • The problem: minX ||AX-B|| • X* minimizing this has X* = A-B, where A- is the pseudo-inverse of A • Every matrix A = UΣVT using singular value decomposition (SVD) • If A is n x d of rank k, then • U is n x k with orthonormal columns • Σis k x k diagonal matrix, diagonal is positive • V is d x k with orthonormal columns • A- = VΣ-1UT • Normal Equations: ATA X = ATB for optimal X

  14. Linear Regression • Let S be an n x m sign matrix, m = O(dε-1log(1/δ)) • The algorithm is • Maintain STA and STB • Return X’ solving minX ||ST(AX-B)|| • Space is O(d2ε-1log(1/δ)) words • Improves Sarlos’ space by log c factor • Space is optimal via new lower bound • Main claim: With probability at least 1- δ, ||AX’-B|| · (1+ε)||AX*-B|| • That is, relative error for X’ is small

  15. Regression Analysis • Why should X’ solving minX ||ST(AX-B)|| be good? • ST approximately preserves AX-B for fixed X • If this worked for all X, we’re done • ST must preserve norms even for X’, chosen using S • First reduce to showing that ||A(X*-X’)|| is small • Use normal equation ATAX* = ATB • Implies ||AX’-B||2 = ||AX*-B||2 + ||A(X’-X*)||2 • Bounding ||A(X’-X*)||2 equivalent to bounding ||UTA(X’-X*)||2, where A = UΣVT, from SVD, and U is an orthonormal basis of the columnspace of A

  16. Regression Analysis Continued • Bounding ||¯||2 := ||UTA(X’-X*)||2 • ||¯|| · ||UTSSTU¯/m|| + ||UTSSTU¯/m-¯|| • Normal equations in sketch space imply (STA)T(STA)X’ = (STA)T(STB) • UTSSTU¯ = UTSSTA(X’-X*) = UTSSTA(X’-X*) + UTSST(B-AX’) = UTSST(B-AX*) • || UTSSTU¯/m|| = ||UTSST(B-AX*)/m|| · (ε/k)1/2||U||¢||B-AX*|| (new tail estimate) = ε1/2 ||B-AX*||

  17. Regression Analysis Continued • Hence, ||¯||2 := ||UTA(X’-X*)||2 • ||¯|| · ||UTSST¯/m|| + ||UTSSTU¯/m-¯|| · ε1/2 ||B-AX*|| + ||UTSSTU¯/m-¯|| • Recall the spectral norm: ||A||2 = supx ||Ax||/||x|| • Implies ||CD|| · ||C||2 ||D|| • ||UTSSTU¯/m-¯|| · ||UTSSTU/m-I ||2 ||¯|| • Subspace JL: for m = (k log(1/δ)), STapproximately preserves lengths of all vectors in a k-space • ||UTSSTU¯/m-¯|| · ||¯||/2 • ||¯|| · 2ε1/2||AX*-B|| • ||AX’-B||2 = ||AX*-B||2 + ||¯||2 = ||AX*-B||2 + 4ε||AX*-B||2

  18. Regression Lower Bound • Tight (d2 log (nd) ε-1) space lower bound • Again a reduction from augmented indexing • This time more complicated • Embed log (nd) ε-1independent regression sub-problems into hard instance • Uses deletions and geometrically growing property, as in matrix product lower bound • Choose the entries of A and b so that the algorithm’s output x encodes some entries of A

  19. Regression Lower Bound • Lower bound of (d2) already tricky because of bit complexity • Natural approach: • Alice has random d x d sign matrix A-1 • b is a standard basis vector ei • Alice computes A = (A-1)-1 and puts it into the stream. Solution x to minx ||Ax=b|| is i-th column of A-1 • Bob can isolate entries of A-1, solving indexing • Wrong:A has entries that can be exponentially small!

  20. Regression Lower Bound • We design A and b together (Aug. Index) 1 A1, 2 A1, 3 A1, 4 A1,5 0 1 A2, 3 A2, 4 A2,5 0 0 1 A3, 4 A3,5 0 0 0 1 A4,5 0 0 0 0 1 x1 x2 x3 x4 x5 0 A2,4 A3,4 1 0 = x5 = 0, x4 = 1, x3 = 0, x2 = 0, x1 = -A1,4

  21. Outline • Matrix Product • Regression • Low-rank approximation

  22. Best Low-Rank Approximation • For any matrix A and integer k, there is a matrix Ak of rank k that is closest to A among all matrices of rank k. • Since rank of Ak is k, it is the product CDTof two k-column matrices C and D • Ak can be found from the SVD (singular value decomposition), where C and D are orthogonal matrices U and VΣk • This is a good compression of A • LSI, PCA, recommendation systems, clustering

  23. Best Low-Rank Approximation • Previously, nothing was known for 1-pass low-rank approximation and relative error • Even for k = 1, best upper bound O(nd log (nd)) bits • Problem 28 of [Mut]: can one get sublinear space? • We get 1-pass and O(kε-2(n+dε-2)log(nd)) space • Update time is O(kε-4), so total work is O(Nkε-4), where N is the number of non-zero entries of A • New space lower bound shows optimal up to 1/ε

  24. Best Low-Rank Approximation and STA • The sketch STA holds information about A • In particular, there is a rank k matrix Ak’ in the rowspace of STA nearly as close to A as the closest rank k matrix Ak • The rowspace of STA is the set of linear combinations of its rows • That is, ||A-Ak’|| · (1+ε)||A-Ak|| • Why is there such an Ak’?

  25. Low-Rank Approximation via Regression • Apply the regression results with A ! Ak, B ! A • The X’ minimizing ||ST(AkX-A)|| has ||AkX’-A||· (1+ ε)||Ak X*-A|| • But here X* = I, and X’ = (ST Ak)- STA • So the matrix AkX’ = Ak(STAk)-STA: • Has rank k • In the rowspace of STA • Within 1+εof smallest distance of any rank-k matrix

  26. Low-Rank Approximation in 2 Passes • Can’t use Ak(STAk)- STA without finding Ak • Instead: maintain STA • Can show that if GT has orthonormal rows, then the best rank-k approximation to A in the rowspace of GT is A’kGT, where A’k is the best rank-k approximation to AG • After 1st pass, compute orthonormal basis GT for rowspace of STA • In 2nd pass, maintain AG • Afterwards, compute A’k and A’kGT

  27. Low-Rank Approximation in 2 Passes • A’k is best rank-k approximation to AG • For any rank-k matrix Z, • ||AGGT – A’kGT|| = ||AG-A’k|| · ||AG-Z|| · ||AGGT – ZGT|| • For all Y: (AGGT-YGT) ¢ (A-AGGT)T = 0, so we can apply Pythagorean Theorem twice: • ||A – A’kG|| = ||A-AGGT|| + ||AGGT – A’kGT|| · ||A-AGGT|| + ||AGGT – ZGT|| = ||A – ZGT||

  28. 1 Pass Algorithm • With high probability,(*) ||AX’-B|| · (1+ε)||AX*-B||, where • X* minimizes ||AX-B|| • X’ minimizes ||STAX-STB|| • Apply (*) with A ! AR and B ! A and X’ minimizing ||ST(ARX-A)|| • So X’= (STAR)-STA has ||ARX’-A|| · (1+ε)minX ||ARX-A|| • Columnspace of AR contains a (1+ε)-approximation to Ak • So, ||ARX’-A|| · (1+ε)minX ||ARX-A|| · (1+ε)2 minX ||A-Ak|| • Key idea: ARX’ = AR(STAR)-STA is • easy to compute in 1-pass with small space and fast update time • behaves like A (similar to SVD) • use it instead of A in our 2-pass algorithm!

  29. 1 Pass Algorithm • Algorithm: • Maintain AR and STA • Compute AR(STAR)-STA • Let GT be an orthonormal basis for the rowspace of STA, as before • Output the best rank-k approximation to AR(STAR)-STA in the rowspace of STA • Same as 2-pass algorithm except we don’t need a second pass to project A onto the rowspace of STA • Analysis is similar to that for regression

  30. A Lower Bound binary string x matrix A index i and xi+1, xi+2, … k ε-1 columns per block 0s k rows 0, 0 0, 0 0, 0 … … -1000, -1000 -1000, 1000 1000, 1000 … … 10000, -10000 -10000, -10000 -10000, -10000 … … 0, 0 0, 0 0, 0 … … … … … … … 0 0 … … … 10, -10 -10, 10 -10,-10 … … -100, -100 100, 100 100, 100 … … n-k rows Error now dominated by block of interest Bob also inserts a k x k identity submatrix into block of interest

  31. Lower Bound Details Block of interest: k ε-1 columns k rows 0s P*Ik 0s n-k rows * Bob inserts k x k identity submatrix, scaled by large value P Show any rank-k approximation must err on all of shaded region So good rank-k approximation likely has correct sign on Bob’s entry

  32. Concluding Remarks • Space bounds are tight for product, regression • Sharpen prior upper bounds • Prove optimal lower bounds • Space bounds off by a factor of ε-1 for low-rank approximation • First sub-linear (and near-optimal) 1-pass algorithm • We have better upper bounds for restricted cases • Improve the dependence on εin the update time • Lower bounds for multi-pass algorithms?

More Related