Loading in 5 sec....

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative PenaltyPowerPoint Presentation

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

- 102 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty' - marlin

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou

University of Manchester

Log-linear models in NLP Log-linear Models with Cumulative Penalty

- Maximum entropy models
- Text classification (Nigam et al., 1999)
- History-based approaches (Ratnaparkhi, 1998)

- Conditional random fields
- Part-of-speech tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), etc.

- Structured prediction
- Parsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc.

Log-linear models Log-linear Models with Cumulative Penalty

- Log-linear (a.k.a. maximum entropy) model
- Training
- Maximize the conditional likelihood of the training data

Weight

Feature function

Partition function:

Regularization Log-linear Models with Cumulative Penalty

- To avoid overfitting to the training data
- Penalize the weights of the features

- L1 regularization
- Most of the weights become zero
- Produces sparse (compact) models
- Saves memory and storage

Training log-linear models Log-linear Models with Cumulative Penalty

- Numerical optimization methods
- Gradient descent (steepest descent or hill-climbing)
- Quasi-Newton methods (e.g. BFGS, OWL-QN)
- Stochastic Gradient Descent (SGD)
- etc.

- Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc.

Gradient Descent (Hill Climbing) Log-linear Models with Cumulative Penalty

objective

Stochastic Gradient Descent (SGD) Log-linear Models with Cumulative Penalty

objective

Compute an approximate

gradient using one

training sample

Stochastic Gradient Descent (SGD) Log-linear Models with Cumulative Penalty

- Weight update procedure
- very simple (similar to the Perceptron algorithm)

Not differentiable

: learning rate

Using subgradients Log-linear Models with Cumulative Penalty

- Weight update procedure

Using subgradients Log-linear Models with Cumulative Penalty

- Problems
- L1 penalty needs to be applied to all features (including the ones that are not used in the current sample).
- Few weights become zero as a result of training.

Clipping-at-zero approach Log-linear Models with Cumulative Penalty

- Carpenter (2008)
- Special case of the FOLOS algorithm (Duchi and Singer, 2008) and the truncated gradient method (Langford et al., 2009)
- Enables lazy update

w

Clipping-at-zero approach Log-linear Models with Cumulative Penalty

- Text chunking Log-linear Models with Cumulative Penalty
- Named entity recognition
- Part-of-speech tagging

Why it does not produce sparse models Log-linear Models with Cumulative Penalty

- In SGD, weights are not updated smoothly

Fails to become

zero!

L1 penalty is wasted away

Cumulative L1 penalty Log-linear Models with Cumulative Penalty

- The absolute value of the total L1 penalty which should have been applied to each weight
- The total L1 penalty which has actually been applied to each weight

Applying L1 with cumulative penalty Log-linear Models with Cumulative Penalty

- Penalize each weight according to the difference between and

Implementation Log-linear Models with Cumulative Penalty

10 lines of code!

Experiments Log-linear Models with Cumulative Penalty

- Model: Conditional Random Fields (CRFs)
- Baseline: OWL-QN (Andrew and Gao, 2007)
- Tasks
- Text chunking (shallow parsing)
- CoNLL 2000 shared task data
- Recognize base syntactic phrases (e.g. NP, VP, PP)

- Named entity recognition
- NLPBA 2004 shared task data
- Recognize names of genes, proteins, etc.

- Part-of-speech (POS) tagging
- WSJ corpus (sections 0-18 for training)

- Text chunking (shallow parsing)

CoNLL 2000 chunking task: objective Log-linear Models with Cumulative Penalty

CoNLL 2000 chunking: non-zero features Log-linear Models with Cumulative Penalty

CoNLL 2000 chunking Log-linear Models with Cumulative Penalty

- Performance of the produced model

- Training is 4 times faster than OWL-QN
- The model is 4 times smaller than the clipping-at-zero approach
- The objective is also slightly better

NLPBA 2004 named entity recognition Log-linear Models with Cumulative Penalty

Part-of-speech tagging on WSJ

Discussions Log-linear Models with Cumulative Penalty

- Convergence
- Demonstrated empirically
- Penalties applied are not i.i.d.

- Learning rate
- The need for tuning can be annoying
- Rule of thumb:
- Exponential decay (passes = 30, alpha = 0.85)

Conclusions Log-linear Models with Cumulative Penalty

- Stochastic gradient descent training for L1-regularized log-linear models
- Force each weight to receive the total L1 penalty that would have been applied if the true (noiseless) gradient were available

- 3 to 4 times faster than OWL-QN
- Extremely easy to implement

Download Presentation

Connecting to Server..