Loading in 2 Seconds...

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

Loading in 2 Seconds...

- 102 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty' - marlin

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou

University of Manchester

Log-linear models in NLP

- Maximum entropy models
- Text classification (Nigam et al., 1999)
- History-based approaches (Ratnaparkhi, 1998)
- Conditional random fields
- Part-of-speech tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), etc.
- Structured prediction
- Parsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc.

Log-linear models

- Log-linear (a.k.a. maximum entropy) model
- Training
- Maximize the conditional likelihood of the training data

Weight

Feature function

Partition function:

Regularization

- To avoid overfitting to the training data
- Penalize the weights of the features
- L1 regularization
- Most of the weights become zero
- Produces sparse (compact) models
- Saves memory and storage

Training log-linear models

- Numerical optimization methods
- Gradient descent (steepest descent or hill-climbing)
- Quasi-Newton methods (e.g. BFGS, OWL-QN)
- Stochastic Gradient Descent (SGD)
- etc.
- Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc.

Gradient Descent (Hill Climbing)

objective

Stochastic Gradient Descent (SGD)

- Weight update procedure
- very simple (similar to the Perceptron algorithm)

Not differentiable

: learning rate

Using subgradients

- Weight update procedure

Using subgradients

- Problems
- L1 penalty needs to be applied to all features (including the ones that are not used in the current sample).
- Few weights become zero as a result of training.

Clipping-at-zero approach

- Carpenter (2008)
- Special case of the FOLOS algorithm (Duchi and Singer, 2008) and the truncated gradient method (Langford et al., 2009)
- Enables lazy update

w

Text chunking

- Named entity recognition
- Part-of-speech tagging

Why it does not produce sparse models

- In SGD, weights are not updated smoothly

Fails to become

zero!

L1 penalty is wasted away

Cumulative L1 penalty

- The absolute value of the total L1 penalty which should have been applied to each weight
- The total L1 penalty which has actually been applied to each weight

Applying L1 with cumulative penalty

- Penalize each weight according to the difference between and

Implementation

10 lines of code!

Experiments

- Model: Conditional Random Fields (CRFs)
- Baseline: OWL-QN (Andrew and Gao, 2007)
- Tasks
- Text chunking (shallow parsing)
- CoNLL 2000 shared task data
- Recognize base syntactic phrases (e.g. NP, VP, PP)
- Named entity recognition
- NLPBA 2004 shared task data
- Recognize names of genes, proteins, etc.
- Part-of-speech (POS) tagging
- WSJ corpus (sections 0-18 for training)

CoNLL 2000 chunking

- Performance of the produced model

- Training is 4 times faster than OWL-QN
- The model is 4 times smaller than the clipping-at-zero approach
- The objective is also slightly better

NLPBA 2004 named entity recognition

Part-of-speech tagging on WSJ

Discussions

- Convergence
- Demonstrated empirically
- Penalties applied are not i.i.d.
- Learning rate
- The need for tuning can be annoying
- Rule of thumb:
- Exponential decay (passes = 30, alpha = 0.85)

Conclusions

- Stochastic gradient descent training for L1-regularized log-linear models
- Force each weight to receive the total L1 penalty that would have been applied if the true (noiseless) gradient were available
- 3 to 4 times faster than OWL-QN
- Extremely easy to implement

Download Presentation

Connecting to Server..