semi stochastic gradient descent methods n.
Skip this Video
Download Presentation
Semi-Stochastic Gradient Descent Methods

Loading in 2 Seconds...

play fullscreen
1 / 19

Semi-Stochastic Gradient Descent Methods - PowerPoint PPT Presentation

  • Uploaded on

Semi-Stochastic Gradient Descent Methods. Jakub Kone čný (joint work with Peter Richt árik ) University of Edinburgh. Introduction. Large scale problem setting. Problems are often structured Frequently arising in machine learning. is BIG. Structure – sum of functions. Examples.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Semi-Stochastic Gradient Descent Methods' - jonny

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
semi stochastic gradient descent methods

Semi-Stochastic Gradient Descent Methods

Jakub Konečný(joint work with Peter Richtárik)

University of Edinburgh

large scale problem setting
Large scale problem setting
  • Problems are often structured
  • Frequently arising in machine learning

is BIG

Structure – sum of functions

  • Linear regression (least squares)
  • Logistic regression (classification)
  • Lipschitz continuity of derivative of
  • Strong convexity of
gradient descent gd
Gradient Descent (GD)
  • Update rule
  • Fast convergence rate
    • Alternatively, for accuracy we need


  • Complexity of single iteration –

(measured in gradient evaluations)

stochastic gradient descent sgd
Stochastic Gradient Descent (SGD)
  • Update rule
  • Why it works
  • Slow convergence
  • Complexity of single iteration –

(measured in gradient evaluations)

a step-size parameter




Fast convergence

gradient evaluations in each iteration

Slow convergence

Complexity of iteration independent of

Combine in a single algorithm

  • The gradient does not change drastically
  • We could reuse the information from “old” gradient
modifying old gradient
Modifying “old” gradient
  • Imagine someone gives us a “good” point and
  • Gradient at point , near , can be expressed as
  • Approximation of the gradient

Gradient change

We can try to estimate

Already computed gradient

the s2gd algorithm
The S2GD Algorithm

Simplification; size of the inner loop is random, following a geometric rule

convergence rate
Convergence rate
  • How to set the parameters ?

For any fixed , can be made arbitrarily small by increasing

Can be made arbitrarily small, by decreasing

setting the parameters
Setting the parameters
  • The accuracy is achieved by setting
  • Total complexity (in gradient evaluations)

Fix target accuracy

# of epochs


# of iterations

# of epochs

cheap iterations

full gradient evaluation

  • S2GD complexity
  • GD complexity
    • iterations
    • complexity of a single iteration
    • Total
related methods
Related Methods
  • SAG – Stochastic Average Gradient

(Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)

    • Refresh single stochastic gradient in each iteration
    • Need to store gradients.
    • Similar convergence rate
    • Cumbersome analysis
  • MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)
    • Similar to SAG, slightly worse performance
    • Elegant analysis
related methods1
Related Methods
  • SVRG – Stochastic Variance Reduced Gradient

(Rie Johnson, Tong Zhang, 2013)

    • Arises as a special case in S2GD
  • Prox-SVRG

(Tong Zhang, Lin Xiao, 2014)

    • Extended to proximal setting
  • EMGD – Epoch Mixed Gradient Descent

(Lijun Zhang, MehrdadMahdavi , Rong Jin, 2013)

    • Handles simple constraints,
    • Worse convergence rate
  • Example problem, with