Semi stochastic gradient descent methods
Download
1 / 19

Semi-Stochastic Gradient Descent Methods - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

Semi-Stochastic Gradient Descent Methods. Jakub Kone čný (joint work with Peter Richt árik ) University of Edinburgh. Introduction. Large scale problem setting. Problems are often structured Frequently arising in machine learning. is BIG. Structure – sum of functions. Examples.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Semi-Stochastic Gradient Descent Methods' - jonny


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Semi stochastic gradient descent methods

Semi-Stochastic Gradient Descent Methods

Jakub Konečný(joint work with Peter Richtárik)

University of Edinburgh



Large scale problem setting
Large scale problem setting

  • Problems are often structured

  • Frequently arising in machine learning

is BIG

Structure – sum of functions


Examples
Examples

  • Linear regression (least squares)

  • Logistic regression (classification)


Assumptions
Assumptions

  • Lipschitz continuity of derivative of

  • Strong convexity of


Gradient descent gd
Gradient Descent (GD)

  • Update rule

  • Fast convergence rate

    • Alternatively, for accuracy we need

      iterations

  • Complexity of single iteration –

    (measured in gradient evaluations)


Stochastic gradient descent sgd
Stochastic Gradient Descent (SGD)

  • Update rule

  • Why it works

  • Slow convergence

  • Complexity of single iteration –

    (measured in gradient evaluations)

a step-size parameter


Goal

GD

SGD

Fast convergence

gradient evaluations in each iteration

Slow convergence

Complexity of iteration independent of

Combine in a single algorithm



Intuition
Intuition

  • The gradient does not change drastically

  • We could reuse the information from “old” gradient


Modifying old gradient
Modifying “old” gradient

  • Imagine someone gives us a “good” point and

  • Gradient at point , near , can be expressed as

  • Approximation of the gradient

Gradient change

We can try to estimate

Already computed gradient


The s2gd algorithm
The S2GD Algorithm

Simplification; size of the inner loop is random, following a geometric rule



Convergence rate
Convergence rate

  • How to set the parameters ?

For any fixed , can be made arbitrarily small by increasing

Can be made arbitrarily small, by decreasing


Setting the parameters
Setting the parameters

  • The accuracy is achieved by setting

  • Total complexity (in gradient evaluations)

Fix target accuracy

# of epochs

stepsize

# of iterations

# of epochs

cheap iterations

full gradient evaluation


Complexity
Complexity

  • S2GD complexity

  • GD complexity

    • iterations

    • complexity of a single iteration

    • Total


Related methods
Related Methods

  • SAG – Stochastic Average Gradient

    (Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)

    • Refresh single stochastic gradient in each iteration

    • Need to store gradients.

    • Similar convergence rate

    • Cumbersome analysis

  • MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)

    • Similar to SAG, slightly worse performance

    • Elegant analysis


Related methods1
Related Methods

  • SVRG – Stochastic Variance Reduced Gradient

    (Rie Johnson, Tong Zhang, 2013)

    • Arises as a special case in S2GD

  • Prox-SVRG

    (Tong Zhang, Lin Xiao, 2014)

    • Extended to proximal setting

  • EMGD – Epoch Mixed Gradient Descent

    (Lijun Zhang, MehrdadMahdavi , Rong Jin, 2013)

    • Handles simple constraints,

    • Worse convergence rate


Experiment
Experiment

  • Example problem, with


ad