Stochastic gradient descent training for l1 regularizaed log linear models with cumulative penalty
Download
1 / 24

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty - PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty. Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou University of Manchester. Log-linear models in NLP. Maximum entropy models Text classification (Nigam et al., 1999)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty' - marlin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Stochastic gradient descent training for l1 regularizaed log linear models with cumulative penalty

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou

University of Manchester


Log linear models in nlp
Log-linear models in NLP Log-linear Models with Cumulative Penalty

  • Maximum entropy models

    • Text classification (Nigam et al., 1999)

    • History-based approaches (Ratnaparkhi, 1998)

  • Conditional random fields

    • Part-of-speech tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), etc.

  • Structured prediction

    • Parsing (Clark and Curan, 2004), Semantic Role Labeling (Toutanova et al, 2005), etc.


Log linear models
Log-linear models Log-linear Models with Cumulative Penalty

  • Log-linear (a.k.a. maximum entropy) model

  • Training

    • Maximize the conditional likelihood of the training data

Weight

Feature function

Partition function:


Regularization
Regularization Log-linear Models with Cumulative Penalty

  • To avoid overfitting to the training data

    • Penalize the weights of the features

  • L1 regularization

    • Most of the weights become zero

    • Produces sparse (compact) models

    • Saves memory and storage


Training log linear models
Training log-linear models Log-linear Models with Cumulative Penalty

  • Numerical optimization methods

    • Gradient descent (steepest descent or hill-climbing)

    • Quasi-Newton methods (e.g. BFGS, OWL-QN)

    • Stochastic Gradient Descent (SGD)

    • etc.

  • Training can take several hours (or even days), depending on the complexity of the model, the size of training data, etc.


Gradient descent hill climbing
Gradient Descent (Hill Climbing) Log-linear Models with Cumulative Penalty

objective


Stochastic gradient descent sgd
Stochastic Gradient Descent (SGD) Log-linear Models with Cumulative Penalty

objective

Compute an approximate

gradient using one

training sample


Stochastic gradient descent sgd1
Stochastic Gradient Descent (SGD) Log-linear Models with Cumulative Penalty

  • Weight update procedure

    • very simple (similar to the Perceptron algorithm)

Not differentiable

: learning rate


Using subgradients
Using subgradients Log-linear Models with Cumulative Penalty

  • Weight update procedure


Using subgradients1
Using subgradients Log-linear Models with Cumulative Penalty

  • Problems

    • L1 penalty needs to be applied to all features (including the ones that are not used in the current sample).

    • Few weights become zero as a result of training.


Clipping at zero approach
Clipping-at-zero approach Log-linear Models with Cumulative Penalty

  • Carpenter (2008)

  • Special case of the FOLOS algorithm (Duchi and Singer, 2008) and the truncated gradient method (Langford et al., 2009)

  • Enables lazy update

w


Clipping at zero approach1
Clipping-at-zero approach Log-linear Models with Cumulative Penalty


  • Text chunking Log-linear Models with Cumulative Penalty

  • Named entity recognition

  • Part-of-speech tagging


Why it does not produce sparse models
Why it does not produce sparse models Log-linear Models with Cumulative Penalty

  • In SGD, weights are not updated smoothly

Fails to become

zero!

L1 penalty is wasted away


Cumulative l1 penalty
Cumulative L1 penalty Log-linear Models with Cumulative Penalty

  • The absolute value of the total L1 penalty which should have been applied to each weight

  • The total L1 penalty which has actually been applied to each weight


Applying l1 with cumulative penalty
Applying L1 with cumulative penalty Log-linear Models with Cumulative Penalty

  • Penalize each weight according to the difference between and


Implementation
Implementation Log-linear Models with Cumulative Penalty

10 lines of code!


Experiments
Experiments Log-linear Models with Cumulative Penalty

  • Model: Conditional Random Fields (CRFs)

  • Baseline: OWL-QN (Andrew and Gao, 2007)

  • Tasks

    • Text chunking (shallow parsing)

      • CoNLL 2000 shared task data

      • Recognize base syntactic phrases (e.g. NP, VP, PP)

    • Named entity recognition

      • NLPBA 2004 shared task data

      • Recognize names of genes, proteins, etc.

    • Part-of-speech (POS) tagging

      • WSJ corpus (sections 0-18 for training)


Conll 2000 chunking task objective
CoNLL 2000 chunking task: objective Log-linear Models with Cumulative Penalty


Conll 2000 chunking non zero features
CoNLL 2000 chunking: non-zero features Log-linear Models with Cumulative Penalty


Conll 2000 chunking
CoNLL 2000 chunking Log-linear Models with Cumulative Penalty

  • Performance of the produced model

  • Training is 4 times faster than OWL-QN

  • The model is 4 times smaller than the clipping-at-zero approach

  • The objective is also slightly better


NLPBA 2004 named entity recognition Log-linear Models with Cumulative Penalty

Part-of-speech tagging on WSJ


Discussions
Discussions Log-linear Models with Cumulative Penalty

  • Convergence

    • Demonstrated empirically

    • Penalties applied are not i.i.d.

  • Learning rate

    • The need for tuning can be annoying

    • Rule of thumb:

      • Exponential decay (passes = 30, alpha = 0.85)


Conclusions
Conclusions Log-linear Models with Cumulative Penalty

  • Stochastic gradient descent training for L1-regularized log-linear models

    • Force each weight to receive the total L1 penalty that would have been applied if the true (noiseless) gradient were available

  • 3 to 4 times faster than OWL-QN

  • Extremely easy to implement


ad