# Semi-Stochastic Gradient Descent Methods - PowerPoint PPT Presentation

1 / 19

Semi-Stochastic Gradient Descent Methods. Jakub Kone čný (joint work with Peter Richt árik ) University of Edinburgh. Introduction. Large scale problem setting. Problems are often structured Frequently arising in machine learning. is BIG. Structure – sum of functions. Examples.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

#### Presentation Transcript

Jakub Konečný(joint work with Peter Richtárik)

University of Edinburgh

### Large scale problem setting

• Problems are often structured

• Frequently arising in machine learning

is BIG

Structure – sum of functions

### Examples

• Linear regression (least squares)

• Logistic regression (classification)

### Assumptions

• Lipschitz continuity of derivative of

• Strong convexity of

• Update rule

• Fast convergence rate

• Alternatively, for accuracy we need

iterations

• Complexity of single iteration –

• Update rule

• Why it works

• Slow convergence

• Complexity of single iteration –

a step-size parameter

### Goal

GD

SGD

Fast convergence

Slow convergence

Complexity of iteration independent of

Combine in a single algorithm

### Intuition

• The gradient does not change drastically

• We could reuse the information from “old” gradient

• Imagine someone gives us a “good” point and

• Gradient at point , near , can be expressed as

We can try to estimate

### The S2GD Algorithm

Simplification; size of the inner loop is random, following a geometric rule

### Convergence rate

• How to set the parameters ?

For any fixed , can be made arbitrarily small by increasing

Can be made arbitrarily small, by decreasing

### Setting the parameters

• The accuracy is achieved by setting

• Total complexity (in gradient evaluations)

Fix target accuracy

# of epochs

stepsize

# of iterations

# of epochs

cheap iterations

### Complexity

• S2GD complexity

• GD complexity

• iterations

• complexity of a single iteration

• Total

### Related Methods

• SAG – Stochastic Average Gradient

(Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)

• Refresh single stochastic gradient in each iteration

• Similar convergence rate

• Cumbersome analysis

• MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)

• Similar to SAG, slightly worse performance

• Elegant analysis

### Related Methods

• SVRG – Stochastic Variance Reduced Gradient

(Rie Johnson, Tong Zhang, 2013)

• Arises as a special case in S2GD

• Prox-SVRG

(Tong Zhang, Lin Xiao, 2014)

• Extended to proximal setting

• EMGD – Epoch Mixed Gradient Descent

(Lijun Zhang, MehrdadMahdavi , Rong Jin, 2013)

• Handles simple constraints,

• Worse convergence rate

### Experiment

• Example problem, with