1 / 70

Iterative Row Sampling

Iterative Row Sampling. Richard Peng. CMU  MIT. Joint work with Mu Li (CMU) and Gary Miller (CMU). Outline. Matrix Sketches Existence Samples  better samples Iterative algorithms. Data. n-by-d matrix A , m entries Columns: data Rows: attributes. A. Goal:

mari-mccall
Download Presentation

Iterative Row Sampling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Iterative Row Sampling Richard Peng CMU  MIT Joint work with Mu Li (CMU) and Gary Miller (CMU)

  2. Outline • Matrix Sketches • Existence • Samples  better samples • Iterative algorithms

  3. Data • n-by-d matrix A, m entries • Columns: data • Rows: attributes A Goal: • Classification/ clustering • Identify patterns • Interpret new data

  4. Linear Model • Can add/scale data points • x1: coefficients,combo: Ax x3A:,3 x2A:,2 Ax x1A:,1

  5. Problem Interpret new data point b as combination of known ones Ax ?

  6. Regression • Express as combination of current examples • Regression:minx ║Ax–b║ p • p=2: least squares • p=1: compressive sensing • ║x║2: Euclidean norm of x • ║x║1: sum of absolute values

  7. Variants of Compressive Sensing • minx ║Ax-b║1 +║x║1 • minx║Ax-b║2 +║x║1 • minx ║x║1s.t.Ax=b • minx ║Ax║1s.t.Bx= y • minx║Ax-b║1 + ║Bx- y║1 All similar to minx║Ax-b║1

  8. Simplified • minx║Ax–b║p= minx║[A,b] [x; -1]║p • Regression equivalent to min║Ax║p with one entry of x fixed x A b -1

  9. ‘Big’ Data Points • Each data point has many attributes • #rows (n) >> #columns (d) • Examples: • Genetic data • Time series (videos) • Reverse (d>>n) also common: images + SIFT A

  10. Faster? A’ A Smaller, equivalent A’ Matrix sketch

  11. Row Sampling • Pick some rows of A to be A’ • How to pick? Random A’ A

  12. Shorter Equivalent • Find shorter A’ that preserves answer • |Ax|p≈1+ε|A’x|p for all x • Run algorithm on A’, same answer good for A A’ Simplified error notation ≈: a≈kb if there exists k1, k2s.t. k2/k1 ≤ k and k1a ≤ b ≤ k2 b

  13. Outline • Matrix Sketches • How? Existence • Samples  better samples • Iterative algorithms

  14. Sketches Exist |Ax|p≈|A’x|p for all x • Linear sketches: A’=SA • [Drinealset al. `12]:Row sampling: one non-zero in each row of S • [Clarkson-Woodruff `12]:S = countSketch, one non-zero per column. A’

  15. Sketches Exist Hidden: runtime costs, ε-2dependency

  16. WHY is ≈d possible? |Ax|p≈|A’x|p for all x • ║Ax║22 = xTATAx • ATA: d-by-dmatrix • Any factorization (e.g. QR) of ATA suffices as A’

  17. ATA A:,j1 A:,j2 • Covariance matrix • Dot product of all pairs of columns (data) • Covariance:cov(j1,j2) = ΣiAi,j1TAi,j2

  18. Use of Covariance Matrix • Clustering: l2 distances of all pairs given by C • Kernel methods: all pair dot products suffice for many models. C=ATA C

  19. Other Use of Covariance • Covariance of attributes used to tune parameters • Images + SIFT: many data points, few attributes. • http://www.image-net.org/: 14,197,122 images 1000 SIFT features C

  20. How Expensive is this? • d2 dots of length n vectors • Total: O(nd2) • Faster: O(ndω-1) • Expensive: nd2 > nd > m C A

  21. Equivalent View Of Sketches • Approximate covariance matrix: C’=(A’)TA’ • ║Ax║2≈║A’x║2 is the same as C ≈ C’ A’ C’

  22. Application of Sketches • A’: n’ rows • d2 dots of length n’ vectors • Total cost: O(n’dω-1) A’ C’ A

  23. Sketches in Input Sparsity Time • Need: cost of computing C’ < cost of computing C = ATA • 2 goals: • n’ small • A’ found efficiently A’ C’ A

  24. Cost and Quality of A’

  25. Outline • Matrix Sketches • How? Existence • Samples  better samples • Iterative algorithms

  26. Previous Approaches A miracle happens • Go go poly(d) rows directly • Projection to obtain key info, or the sketch itself A’ A poly(d) m

  27. Our Main Approach • Utilize the robustness of sketches, covariance matrices, and sampling • Iteratively reduce errors and sizes A” A’ A

  28. Better Algorithm for p=2

  29. Composing Sketches O(n’dlogd +dω) O(m) Total cost: O(m + n’dlogd + dω) = O(m + dω) A” A’ A n’ = d1+α n rows dlogd rows

  30. Accumulation of Errors ║Ax║2 ≈kk’║A’x║2 ║A”x║2≈k’║A’x║2 ║Ax║2≈k║A”x║2 A” A’ A n’ = d1+α n rows dlogd rows

  31. Accmulation of Errors • Final error: product of both errors • Dependency of error in cost: usually ε-2 or more for 1± ε error • [Avron & Toledo `11]: only final step needs to be accurate • Idea: compute sketches indirectly ║Ax║ 2≈kk’║A’x║2

  32. Row Sampling • Pick some rows of A to be A’ • How to pick? Random A’ A

  33. Are All Rows Equal? column with one entry one non-zero row A A |A[1;0;…;0]|p≠ 0

  34. Row Sampling • τ’ : weights on rows  distribution • Pick a number of rows independently from this distribution, rescale to form A’ A’ A

  35. Matrix Chernoff Bounds • Sufficient property of τ’ • τ: statistical leverage scores • If τ' ≥ τ,║τ'║1logd (scaled) rows suffices for A’≈ A τ' A

  36. Statistical Leverage Scores • Studied in stats since 70s • Importance of rows • leverage score of row i, Ai: • τi= Ai(ATA)-1AiT • Key fact: ║τ║1 = rank ≤ d •  ║τ'║1logd = dlogd rows τ A

  37. Computing Leverage scores • τi= Ai(ATA)-1AiT • = AiC-1AiT • ATA: covariance matrix, C • Given C-1, can compute each τiin O(d2) time • Total cost: O(nd2+dω)

  38. Computing Leverage scores • τi= AiC-1AiT • =║AiC-1/2║22 • 2-norm of a vector, AiC-1/2 • rows in isotropic positions • Decorrelates columns

  39. Aside: What is Leverage? Ai AiC-1/2 • Geometric view: • Rows define ‘energy’ directions. • Normalize so total energy is uniform • τi: norm of row i after normalizing

  40. Aside: What is Leverage? • How to interpret statistical leverage scores? • Statistics ([Hoaglin-Welsh `78], [Chatterjee-Hadi `86]): • Influence on data set • Likelihood of outlier • Uniqueness of Row τ A

  41. Aside: What is Leverage? • High Leverage Score: • Key attribute? • Outlier (measuring error)?

  42. Aside: What is Leverage? • My current view (motivated by graph sparsification): • Sampling probabilities • Use them to find sketches τ A

  43. Computing Leverage scores τi= ║AiC-1/2║22 • Only need τ' ≥ τ • Can use approximations after scaling them up • Error leads to larger ║τ'║1

  44. Dimensionality Reduction x Gx ║x║22 ≈jl║Gx║22 • Johnson Lindenstrauss Transform • G: d-by-O(1/α) Gaussian • Errorjl = dα

  45. Estimating Leverage scores τi=║AiC-1/2║22 ≈jl║AiC-1/2G║22 • G: d-by-O(1/α) Gaussian • C1/2G: d-by-O(1/α) • Cost: O(α ∙ nnz(Ai))total: O(α ∙ m + α ∙ d2logd)

  46. Estimating Leverage scores τi=║AiC-1/2║ 22 ≈║AiC’-1/2║ 22 • C ≈k C’ gives ║C-1/2x║2≈k║C’-1/2x║2 • Using C’ as a preconditioner for C • Can also combine with JL

  47. Estimating Leverage scores • τi’=║AiC’-1/2G║22 • ≈jl║AiC-1/2║22 • ≈jl∙kτi • (jl∙ k) ∙ τ’≥ τ • Total number of rows: • ║jl ∙ k ∙ τ’║1 ≤ jl ∙ k ∙║τ’║1 • ≤ k d1 + α

  48. Estimating Leverage scores • (jl ∙ k) ∙ τ’≥ τ • ║jl∙ k ∙ τ’║1 ≤ jl ∙ k ∙ d1+α • Quality of A’ does not depend on quality of τ' • C ≈k C’ gives A’≈2A with O(kd1+α) rows in O(m + dω) time Some fixableissues when n >>>d

  49. Size reduction A” C” A’ τ' • A” ≈O(1) A • C” ≈O(1) C • τ' ≈O(1) τ • A’ ≈O(1) A , O(d1+α logd) rows

  50. High Error Setting A” C” A’ τ' • A” ≈k A • C” ≈k C • τ'≈k τ • A’ ≈O(1) A , O(kd1+α logd) rows

More Related