1 / 20

Sketching for M-Estimators: A Unified Approach to Robust Regression

Sketching for M-Estimators: A Unified Approach to Robust Regression. Kenneth Clarkson David Woodruff IBM Almaden. Regression. Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Ohm ' s law V = R ∙ I

kimberlee
Download Presentation

Sketching for M-Estimators: A Unified Approach to Robust Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden

  2. Regression Linear Regression • Statistical method to study linear dependencies between variables in the presence of noise. Example • Ohm's law V = R ∙ I • Find linear function that best fits the data

  3. Regression 1 d 0 1 1 d d Standard Setting • One measured variable b • A set of predictor variables a ,…, a • Assumption: b = x + a x + … + a x + e e is assumed to be noise and the xi are model parameters we want to learn • Can assume x0 = 0 • Now consider n observations of b

  4. Regression Matrix form Input: nd-matrix A and a vector b=(b1,…, bn)n is the number of observations; d is the number of predictor variables Output: x* so that Ax* and b are close • Consider the over-constrained case, when n À d

  5. Fitness Measures What about the many other fitness measures used in practice? Least Squares Method • Find x* that minimizes |Ax-b|22 • Ax* is the projection of b onto the column span of A • Certain desirable statistical properties • Closed form solution: x* = (ATA)-1 AT b Method of least absolute deviation (l1 -regression) • Find x* that minimizes |Ax-b|1 = S |bi – <Ai*, x>| • Cost is less sensitive to outliers than least squares • Can solve via linear programming

  6. M-Estimators • Measure function • G: R -> R¸ 0 • G(x) = G(-x), G(0) = 0 • G is non-decreasing in |x| • |y|M = Σi=1n G(yi) • Solve minx |Ax-b|M • Least squares and L1-regression are special cases

  7. Huber Loss Function G(x) = x2/(2c) for |x| · c G(x) = |x|-c/2 for |x| > c Enjoys smoothness properties of l22 and robustness properties of l1

  8. Other Examples • L1-L2 G(x) = 2((1+x2/2)1/2 – 1) • Fair estimator G(x) = c2 [ |x|/c - log(1+|x|/c) ] • Tukey estimator G(x) = c2/6 (1-[1-(x/c)2]3) if |x| · c = c2/6 if |x| > c

  9. Nice M-Estimators • An M-Estimator is nice if it has at least linear growth and at most quadratic growth • There is CG > 0 so that for all a, a’ with |a| ¸ |a’| > 0, |a/a’|2¸ G(a)/G(a’) ¸ CG |a/a’| • Any convex G satisfies the linear lower bound • Any sketchable G satisfies the quadratic upper bound • sketchable => there is a distribution on t x n matrices S for which |Sx|M = £(|x|M) with probability 2/3 and t is independent of n

  10. Our Results Let nnz(A) denote # of non-zero entries of an n x d matrix A • [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm to output x’ so that w.h.p. |Ax’-b|H· (1+ε) minx |Ax-b|H • [Nice M-Estimators] O(nnz(A)) + poly(d log n) time algorithm to output x’ so that for any constant C > 1, w.h.p. |Ax’-b|M· C*minx |Ax-b|M Remarks: - For convex nice M-estimators can solve with convex programming, but slow – poly(nd) time - Our algorithm for nice M-estimators is universal

  11. Talk Outline • Huber result • Nice M-Estimators result

  12. A - x b minx Naive Sampling Algorithm M S¢A - x S¢b x’ = argminx M S uniformly samples poly(d/ε) rows – this is a terrible algorithm

  13. For l2, the qi are the squared row norms in an orthonormal basis of A • For lp, the qi are p-th powers of the p-norms of rows in a “well conditioned basis” • [Dasgupta et al.] Leverage Score Sampling • For lp-norms, there are probabilities q1, …, qn with Σi qi = poly(d/ε) so that sampling works A - x b minx M S¢A - x S¢b x’ = argminx M • All qi can be found in O(nnz(A)log n) + poly(d) time • S is diagonal. Si,i = 1/qi if row i is sampled, 0 otherwise

  14. Huber Regression Algorithm • [Huber inequality]: For z 2 Rn, £(n-1/2) min(|z|1, |z|22/(2c)) · |z|H· |z|1 • Proof by case analysis • Sample from a mixture of l1-leverage scores and l2-leverage scores • pi = n1/2¢(qi(1) + qi(2)) • Our nnz(A)log n + poly(d/ε) algorithm • After one step, number of rows < n1/2 poly(d/ε) • Recursively solve a weighted Huber • Weights do not grow quickly • Once size is < n.01 poly(d/ε), solve by convex programming

  15. Talk Outline • Huber result • Nice M-Estimators result

  16. [ [ 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 CountSketch • For l2 regression, CountSketch with poly(d) rows works [Clarkson, W]: • Compute S*A in nnz(A) time • Compute x’ argminx |SAx-Sb|2 in poly(d) time S =

  17. The same M-Sketch works for all nice M-estimators! • x’ = argminx |TAx-Tb|M, w M-Sketch [ [ Note: many uses of this data structure do not work since they involve a median operation S1¢ R0 S2¢ R1 S3¢ R2 … Slog n¢ Rlog n • Sketch used for estimating frequency moments [Indyk, W] and earthmover distance [Verbin, Zhang] • Si are independent CountSketch matrices with poly(d) rows • Ri is n x n diagonal and uniformly samples a 1/bi fraction of [n]

  18. M-Sketch Intuition • Consider a fixed y = Ax-b • For M-Sketch T, output |Ty|w, M = Σi wi G((Ty)i) • [Contraction] |Ty|w,M¸ ½ |y|M w.pr. 1-exp(-d log d) • [Dilation] |Ty|w,M· 2 |y|M w.pr. 9/10 • Contraction allows for a net argument (no scale-invariance!) • Dilation implies the optimal y* does not dilate much

  19. M-Skech Analysis • Partition into weight classes: • Si = {j | G(yj) 2 (|y|M/bi, |y|M/bi-1]} • If |Si| > d log d, there’s a “sampling level” containing about dlog d elements of Si (gives exp(-dlog d) failure probability) • Elements from Sj for j · i do not collide • Elements from Sj for j > i cancel in a bucket (concentrate to 2-norm) • If |Si| small, all elements are found in the top level or Si is not important (relate M to l2) • If G close to quadratic growth, need to “clip” top buckets • Ky-Fan norm

  20. Conclusions Summary: • [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm • [Nice M-Estimators] O(nnz(A)) + poly(d) time algorithm Questions: 1. Is there a sketch-based estimator for (1+ε)-approximation? 2. (Meta-question) Apply streaming techniques to linear algebra - countsketch –> l_2-regression - p-stable random variables -> l_p regression for p in [1,2] - countsketch + heavy hitters -> nice M-estimators - Pagh’s tensorsketch -> polynomial kernel regression …

More Related