1 / 76

A ccelerated, P arallel and PROX imal coordinate descent

A ccelerated, P arallel and PROX imal coordinate descent. Peter Richt á rik. A. P. PROX. Moscow February 2014. (Joint work with Olivier Fercoq - arXiv:1312.5799). Optimization Problem. Problem. Loss. Regularizer. Convex (smooth or nonsmooth ). Convex

Download Presentation

A ccelerated, P arallel and PROX imal coordinate descent

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accelerated, Parallel and PROXimal coordinate descent • Peter Richtárik A P PROX Moscow February 2014 (Joint work with Olivier Fercoq - arXiv:1312.5799)

  2. Optimization Problem

  3. Problem Loss Regularizer Convex (smooth or nonsmooth) Convex (smooth or nonsmooth) - separable - allow

  4. Regularizer: examples e.g., LASSO No regularizer Weighted L1 norm Box constraints Weighted L2 norm e.g., SVM dual

  5. Loss: examples Quadratic loss BKBG’11 RT’11b TBRS’13 RT ’13a Logistic loss Square hinge loss L-infinity L1 regression FR’13 Exponential loss

  6. RANDOMIZED COORDINATE DESCENTIN 2D

  7. 2D Optimization Contours of a function Find the minimizer of Goal:

  8. Randomized Coordinate Descent in 2D N E W S

  9. Randomized Coordinate Descent in 2D N E W 1 S

  10. Randomized Coordinate Descent in 2D 2 N E W 1 S

  11. Randomized Coordinate Descent in 2D 2 N 3 E W 1 S

  12. Randomized Coordinate Descent in 2D 4 2 N 3 E W 1 S

  13. Randomized Coordinate Descent in 2D 4 5 2 N 3 E W 1 S

  14. Randomized Coordinate Descent in 2D 6 4 5 2 N 3 E W 1 S

  15. Randomized Coordinate Descent in 2D SOLVED! 6 4 7 5 2 N 3 E W 1 S

  16. CONTRIBUTIONS

  17. Variants of Randomized Coordinate Descent Methods • Block • can operate on “blocks” of coordinates • as opposed to just on individual coordinates • General • applies to “general” (=smooth convex) functions • as opposed to special ones such as quadratics • Proximal • admits a “nonsmoothregularizer” that is kept intact in solving subproblems • regularizer not smoothed, nor approximated • Parallel • operates on multiple blocks / coordinates in parallel • as opposed to just 1 block / coordinate at a time • Accelerated • achieves O(1/k^2) convergence rate for convex functions • as opposed to O(1/k) • Efficient • complexity of 1 iteration is O(1) per processor on sparse problems • as opposed to O(# coordinates) : avoids adding two full vectors

  18. Brief History of Randomized Coordinate Descent Methods + new long stepsizes

  19. APPROX

  20. A P PROX “PARALLEL” “ACCELERATED” “PROXIMAL”

  21. PCDM (R. & Takáč, 2012) = APPROX if we force

  22. APPROX: Smooth Case Partial derivative of f Update for coordinate i Want this to be as large as possible

  23. CONVERGENCE RATE

  24. Convergence Rate Key assumption Theorem [FR’13b] # coordinates # iterations average # coordinates updated / iteration implies

  25. Special Case: Fully Parallel Variant all coordinates are updated in each iteration # iterations # normalized weights (summing to n) implies

  26. Special Case: Effect of New Stepsizes With the new stepsizes (will mention later!), we have: Average degree of separability “Average” of the Lipschitz constants

  27. “EFFICIENCY” OF APPROX

  28. Cost of 1 Iteration of APPROX Scalar function: derivative = O(1) Assume N = n (all blocks are of size 1) and that Sparse matrix Then the average cost of 1 iteration of APPROX is arithmetic ops = average # nonzeros in a column of A

  29. Bottleneck: Computation of Partial Derivatives maintained

  30. PRELIMINARYEXPERIMENTS

  31. L1 Regularized L1 Regression Gradient Method Nesterov’s Accelerated Gradient Method SPCDM APPROX Dorothea dataset:

  32. L1 Regularized L1 Regression

  33. L1Regularized Least Squares (LASSO) PCDM APPROX KDDB dataset:

  34. Training Linear SVMs Malicious URL dataset:

  35. Choice of Stepsizes:How (not) to ParallelizeCoordinate Descent

  36. Convergence of Randomized Coordinate Descent Strongly convex F (Simple Mehod) ‘Difficult’ nonsmoothF (Simple Method) Focus on n (big data = big n) ‘Difficult’ nonsmoothF (Accelerated Method) or smooth F (Simple method) Smooth or ‘simple’ nonsmoothF (Accelerated Method)

  37. Parallelization Dream Serial Parallel What do we actually get? WANT Depends on to what extent we can add up individual updates, which depends on the properties of F and the way coordinates are chosen at each iteration

  38. “Naive” parallelization Do the same thing as before, but forMORE or ALL coordinates & ADD UP the updates

  39. Failure of naive parallelization 1b 1a 0

  40. Failure of naive parallelization 1 1b 1a 0

  41. Failure of naive parallelization 1 2b 2a

  42. Failure of naive parallelization 1 2b 2a 2

  43. Failure of naive parallelization OOPS! 2

  44. Idea: averaging updates may help 1b SOLVED! 1 1a 0

  45. Averaging can be too conservative 2b and so on... 1b 2 2a 1 0 1a

  46. Averaging may be too conservative 2 But we wanted: BAD!!! WANT

  47. What to do? Update to coordinate i i-th unit coordinate vector Averaging: Summation: Figure out when one can safely use:

  48. ESO:Expected SeparableOverapproximation

  49. 5 Models for f Admitting Small 1 Smooth partially separable f [RT’11b ] 2 Nonsmooth max-type f [FR’13] 3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

  50. 5 Models for f Admitting Small 4 Partially separable f with smooth components [NC’13] 5 Partially separable f with block smooth components [FR’13b]

More Related