Stochastic gradient descent training for l1 regularizaed log linear models with cumulative penalty
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on
  • Presentation posted in: General

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty. Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou University of Manchester. Log-linear models in NLP. Maximum entropy models Text classification (Nigam et al., 1999)

Download Presentation

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Stochastic gradient descent training for l1 regularizaed log linear models with cumulative penalty

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou

University of Manchester


Log linear models in nlp

Log-linear models in NLP

  • Maximum entropy models

    • Text classification (Nigam et al., 1999)

    • History-based approaches (Ratnaparkhi, 1998)

  • Conditional random fields

    • Part-of-speech tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), etc.

  • Structured prediction

    • Parsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc.


Log linear models

Log-linear models

  • Log-linear (a.k.a. maximum entropy) model

  • Training

    • Maximize the conditional likelihood of the training data

Weight

Feature function

Partition function:


Regularization

Regularization

  • To avoid overfitting to the training data

    • Penalize the weights of the features

  • L1 regularization

    • Most of the weights become zero

    • Produces sparse (compact) models

    • Saves memory and storage


Training log linear models

Training log-linear models

  • Numerical optimization methods

    • Gradient descent (steepest descent or hill-climbing)

    • Quasi-Newton methods (e.g. BFGS, OWL-QN)

    • Stochastic Gradient Descent (SGD)

    • etc.

  • Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc.


Gradient descent hill climbing

Gradient Descent (Hill Climbing)

objective


Stochastic gradient descent sgd

Stochastic Gradient Descent (SGD)

objective

Compute an approximate

gradient using one

training sample


Stochastic gradient descent sgd1

Stochastic Gradient Descent (SGD)

  • Weight update procedure

    • very simple (similar to the Perceptron algorithm)

Not differentiable

: learning rate


Using subgradients

Using subgradients

  • Weight update procedure


Using subgradients1

Using subgradients

  • Problems

    • L1 penalty needs to be applied to all features (including the ones that are not used in the current sample).

    • Few weights become zero as a result of training.


Clipping at zero approach

Clipping-at-zero approach

  • Carpenter (2008)

  • Special case of the FOLOS algorithm (Duchi and Singer, 2008) and the truncated gradient method (Langford et al., 2009)

  • Enables lazy update

w


Clipping at zero approach1

Clipping-at-zero approach


Stochastic gradient descent training for l1 regularizaed log linear models with cumulative penalty

  • Text chunking

  • Named entity recognition

  • Part-of-speech tagging


Why it does not produce sparse models

Why it does not produce sparse models

  • In SGD, weights are not updated smoothly

Fails to become

zero!

L1 penalty is wasted away


Cumulative l1 penalty

Cumulative L1 penalty

  • The absolute value of the total L1 penalty which should have been applied to each weight

  • The total L1 penalty which has actually been applied to each weight


Applying l1 with cumulative penalty

Applying L1 with cumulative penalty

  • Penalize each weight according to the difference between and


Implementation

Implementation

10 lines of code!


Experiments

Experiments

  • Model: Conditional Random Fields (CRFs)

  • Baseline: OWL-QN (Andrew and Gao, 2007)

  • Tasks

    • Text chunking (shallow parsing)

      • CoNLL 2000 shared task data

      • Recognize base syntactic phrases (e.g. NP, VP, PP)

    • Named entity recognition

      • NLPBA 2004 shared task data

      • Recognize names of genes, proteins, etc.

    • Part-of-speech (POS) tagging

      • WSJ corpus (sections 0-18 for training)


Conll 2000 chunking task objective

CoNLL 2000 chunking task: objective


Conll 2000 chunking non zero features

CoNLL 2000 chunking: non-zero features


Conll 2000 chunking

CoNLL 2000 chunking

  • Performance of the produced model

  • Training is 4 times faster than OWL-QN

  • The model is 4 times smaller than the clipping-at-zero approach

  • The objective is also slightly better


Stochastic gradient descent training for l1 regularizaed log linear models with cumulative penalty

NLPBA 2004 named entity recognition

Part-of-speech tagging on WSJ


Discussions

Discussions

  • Convergence

    • Demonstrated empirically

    • Penalties applied are not i.i.d.

  • Learning rate

    • The need for tuning can be annoying

    • Rule of thumb:

      • Exponential decay (passes = 30, alpha = 0.85)


Conclusions

Conclusions

  • Stochastic gradient descent training for L1-regularized log-linear models

    • Force each weight to receive the total L1 penalty that would have been applied if the true (noiseless) gradient were available

  • 3 to 4 times faster than OWL-QN

  • Extremely easy to implement


  • Login