1 / 32

Frank-Wolfe optimization insights in machine learning

Frank-Wolfe optimization insights in machine learning. Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team. SMILE – November 4 th 2013. Outline. Frank-Wolfe optimization Frank-Wolfe for structured prediction links with previous algorithms

macey-eaton
Download Presentation

Frank-Wolfe optimization insights in machine learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Frank-Wolfe optimization insights in machine learning Simon Lacoste-Julien INRIA / École Normale Supérieure SIERRA Project Team SMILE– November 4th 2013

  2. Outline • Frank-Wolfe optimization • Frank-Wolfe for structured prediction • links with previous algorithms • block-coordinate extension • results for sequence prediction • Herding as Frank-Wolfe optimization • extension: weighted Herding • simulations for quadrature

  3. Frank-Wolfe algorithm [Frank, Wolfe 1956] (aka conditional gradient) • FW algorithm – repeat: • alg. for constrained opt.: convex & cts. differentiable where: convex & compact 1) Find good feasible direction by minimizing linearization of : • Properties: O(1/T) rate • sparse iterates • get duality gap for free • affine invariant • rate holds even if linear subproblem solved approximately 2) Take convex step in direction:

  4. Frank-Wolfe: properties • convex steps => convex sparse combo: • get duality gap certificate for free • (special case of Fenchel duality gap) • also converge as O(1/T)! • only need to solve linear subproblem *approximately* (additive/multiplicative bound) • affine invariant! [see Jaggi ICML 2013]

  5. Block-Coordinate Frank-Wolfe Optimization for Structured SVMs [ICML 2013] Martin Jaggi Simon Lacoste-Julien Patrick Pletscher Mark Schmidt

  6. Structured SVM optimization • structured prediction: • learn classifier: decoding structured hinge loss: • structured SVM primal: -> loss-augmented decoding vs. binary hinge loss: • structured SVM dual: primal-dual pair: -> exp. number of variables!

  7. Structured SVM optimization (2) rate: after K passes through data: • popular approaches: • stochastic subgradient method • pros: online! • cons: sensitive to step-size; don’t know when to stop • cutting plane method (SVMstruct) • pros: automatic step-size; duality gap • cons: batch! -> slow for large n • our approach: block-coordinate Frank-Wolfe on dual -> combines best of both worlds: • online! • automatic step-size via analytic line search • duality gap • rates also hold for approximate oracles [Ratliff et al. 07, Shalev-Shwartz et al. 10] [Tsochantaridis et al. 05, Joachims et al. 09]

  8. Frank-Wolfe algorithm [Frank, Wolfe 1956] (aka conditional gradient) • FW algorithm – repeat: • alg. for constrained opt.: convex & cts. differentiable where: convex & compact 1) Find good feasible direction by minimizing linearization of : • Properties: O(1/T) rate • sparse iterates • get duality gap for free • affine invariant • rate holds even if linear subproblem solved approximately 2) Take convex step in direction:

  9. Frank-Wolfe for structured SVM • structured SVM dual: • FW algorithm – repeat: use primal-dual link: link between FW and subgradient method: see [Bach 12] key insight: 1) Find good feasible direction by minimizing linearization of : loss-augmented decoding on each example i  2) Take convex step in direction: becomes a batch subgradient step: choose by analytic line search on quadratic dual

  10. FW for structured SVM: properties  • running FW on dual batch subgradient on primal • but adaptive step-size from analytic line-search • and duality gap stopping criterion • ‘fully corrective’ FW on dual cutting plane alg. • still O(1/T) rate; but provides simpler proof for SVMstruct convergence + approximate oracles guarantees • not faster than simple FW in our experiments • BUT: still batch => slow for large n...  (SVMstruct)

  11. Block-Coordinate Frank-Wolfe (new!) • for constrained optimization over compact product domain: • pick i at random; update only block i with a FW step: • Properties: O(1/T) rate • sparse iterates • get duality gap guarantees • affine invariant • rate holds even if linear subproblem solved approximately • we proved same O(1/T) rate as batch FW -> each step n times cheaper though -> constant can be the same (SVM e.g.)

  12. Block-Coordinate Frank-Wolfe (new!) • for constrained optimization over compact product domain: • pick i at random; update only block i with a FW step: structured SVM:  loss-augmented decoding • we proved same O(1/T) rate as batch FW -> each step n times cheaper though -> constant can be the same (SVM e.g.)

  13. BCFW for structured SVM: properties (vs. n for SVMstruct) • each update requires 1 oracle call • advantages over stochastic subgradient: • step-sizes by line-search -> more robust • duality gap certificate -> know when to stop • guarantees hold for approximate oracles • implementation: https://github.com/ppletscher/BCFWstruct • almost as simple as stochastic subgradient method • caveat: need to store one parameter vector per example (or store the dual variables) • for binary SVM -> reduce to DCA method [Hsieh et al. 08] • interesting link with prox SDCA [Shalev-Shwartz et al. 12] so get error after K passes through data (vs. for SVMstruct)

  14. More info about constants... • batch FW rate: • BCFW rate: “curvature” “product curvature” ->remove with line-search • comparing constants: • for structured SVM – same constants: • identity Hessian + cube constraint: (no speed-up)

  15. Sidenote: weighted averaging • standard to average iterates of stochastic subgradient method uniform averaging: vs. t-weighted averaging: [L.-J. et al. 12], [Shamir & Zhang 13] • weighted avg. improves duality gap for BCFW • also makes a big difference in test error!

  16. Experiments OCR dataset CoNLL dataset

  17. Surprising test error though! CoNLL dataset test error: optimization error: flipped!

  18. Conclusions for 1st part • applying FW on dual of structured SVM • unified previous algorithms • provided line-search version of batch subgradient • new block-coordinate variant of Frank-Wolfe algorithm • same convergence rate but with cheaper iteration cost • yields a robust & fast algorithm for structured SVM • future work: • caching tricks • non-uniform sampling • regularization path • explain weighted avg. test error mystery

  19. On the Equivalence between Herding and Conditional Gradient Algorithms [ICML 2012] Guillaume Obozinski Simon Lacoste-Julien Francis Bach

  20. A motivation: quadrature • Approximating integrals: • Random sampling yields error • Herding [Welling 2009] yields error! [Chen et al. 2010] (like quasi-MC) • This part -> links herding with optimization algorithm (conditional gradient / Frank-Wolfe) • suggests extensions - e.g. weighted version with • BUT extensions worse for learning??? • -> yields interesting insights on properties of herding...

  21. Outline • Background: • Herding • [Conditional gradient algorithm] • Equivalence between herding & cond. gradient • Extensions • New rates & theorems • Simulations • Approximation of integrals with cond. gradient variants • Learned distribution vs. max entropy

  22. Review of herding [Welling ICML 2009] • Learning in MRF: • Motivation: feature map parameter learning: (app.) ML / max. entropy moment matching (app.) inference: sampling samples data (pseudo)- herding

  23. Herding updates • Zero temperature limit of log-likelihood: • Herding updates - subgradient ascent updates: • Properties: 1) weakly chaotic -> entropy? 2) Moment matching: -> our focus ‘Tipi’ function: (thanks to Max Welling for picture)

  24. Approx. integrals in RKHS • Controllingmoment discrepancy is enough to control error of integrals in RKHS : • Reproducing property: • Define mean map : • Want to approximate integrals of the form: • Use weighted sum to get approximated mean: • Approximation error is then bounded by:

  25. Conditional gradient algorithm (aka Frank-Wolfe) • Alg. to optimize: • Repeat: Find good feasible direction by minimizing linearization of J: Take convex step in direction: -> Converges in O(1/T) in general convex & (twice) cts. differentiable convex & compact

  26. Herding & cond. grad. are equiv. • Trick: look at cond. gradient on dummy objective: + Do change of variable: cond. grad. updates: herding updates: Subgradient ascent and cond. gradient are Fenchel duals of each other! (see also [Bach 2012]) Same with step-size:

  27. Extensions of herding • More general step-sizes -> gives weighted sum: • Two extensions: 1) Line search for 2) Min. norm point algorithm (min J(g) on convex hull of previously visited points)

  28. Rates of convergence & thms. • No assumption: cond. grad. yields*: • If assume in rel. int. of with radius • [Chen et al. 2010] yields for herding • Whereas line search version yields [Guélat & Marcotte 1986, Beck & Teboulle 2004] • Propositions: 1) 2) (i.e. [Chen et al. 2010] doesn’t hold!)

  29. Simulation 1: approx. integrals • Kernel herding on Use RKHS with Bernouilli polynomial kernel (infinite dim.) (closed form)

  30. Simulation 2: max entropy? • learning independent bits: error on moments error on distribution

  31. Conclusions for 2nd part • Equivalence of herding and cond. gradient: -> Yields better alg. for quadrature based on moments -> But highlights max entropy / moment matching tradeoff! • Other interesting points: • Setting up fake optimization problems -> harvest properties of known algorithms • Conditional gradient algorithm useful to know... • Duality of subgradient & cond. gradient is more general • Recentrelatedwork: • linkwithBayesian quadrature [Huszar & Duvenaud UAI 2012] • herded Gibbs sampling [Born et al. ICLR 2013]

  32. Thank you!

More Related