1 / 25

Scalable training of L 1 -regularized log-linear models

Scalable training of L 1 -regularized log-linear models . Galen Andrew (Joint work with Jianfeng Gao) ICML, 2007. Minimizing regularized loss. Many parametric ML models are trained by minimizing a regularized loss of the form: is a loss function quantifying “ fit to the data ”

bernie
Download Presentation

Scalable training of L 1 -regularized log-linear models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable training of L1-regularizedlog-linear models Galen Andrew (Joint work with Jianfeng Gao) ICML, 2007

  2. Minimizing regularized loss • Many parametric ML models are trained by minimizing a regularized loss of the form: • isa loss function quantifying “fit to the data” • Negative log-likelihood of training data • Distance from decision boundary of incorrect examples • If zero is a reasonable “default” parameter value, we can use where is a norm, penalizing large vectors, and C is a constant

  3. Types of norms • A norm precisely defines “size” of a vector 1 2 1 3 2 3 Contours of L2-norm in 2D Contours of L1-norm in 2D

  4. A nice property of L1 • Gradients of L2- and L1-norm 1 2 1 3 2 3 “Negative gradient” of L1-norm (direction of steepest descent) points toward coordinate axes Negative gradient of L2-norm always points directly toward 0

  5. A nice property of L1 • 1-D slice of L1-regularized objective Sharp bend causes optimal value at x = 0 x

  6. A nice property of L1 • At global optimum, many parameters have value exactly zero • L2 would give small, nonzero values • Thus L1does continuous feature selection • More interpretable, computationally manageable models • C parameter tunes sparsity/accuracy tradeoff • In our experiments, only 1.5% of feats remain

  7. A nasty property of L1 • The sharp bend at zero is also a problem: • Objective is non-differentiable • Cannot solve with standard gradient-based methods Non-differentiable at sharp bend (gradient undefined)

  8. Digression: Newton’s method • To optimize a function f: • Form 2nd-order Taylor expansion around x0 • Jump to minimum: (Actually, line search in direction of xnew) • Repeat • Sort of an ideal. • In practice, H is too large ( )

  9. Limited-Memory Quasi-Newton • Approximate H-1 with a low-rank matrix built using information from recent iterations • Approximate H-1 and not H, sono need to invert the matrix or solve linear system! • Most popular L-M Q-N method: L-BFGS • Storage and computation are O(# vars) • Very good theoretical convergence properties • Empirically, best method for training large-scale log-linear models with L2 (Malouf ‘02, Minka ‘03)

  10. Orthant-Wise Limited-memory Quasi-Newton algorithm • Our algorithm (OWL-QN) uses the fact that L1 is differentiable on any given orthant • In fact, it is linear, so it doesn’t affect Hessian

  11. OWL-QN (cont.) • For a given orthant defined by the objective can be written • The Hessian of fdetermined by loss alone • Can use gradient of loss at previous iterations to estimate Hessian of objective on any orthant • Constrain steps to not cross orthant boundaries Linear function of w Hessian = 0

  12. OWL-QN (cont.) • Choose an orthant • Find Quasi-Newton quadratic approximation to objective on orthant • Jump to minimum of quadratic (Actually, line search in direction of minimum) • Project back onto sectant • Repeat steps 1-4 until convergence

  13. Choosing a sectant to explore • We use the sectant… • in which the current point sits • into which the direction of steepest descent points (Computing direction of steepest descent given the gradient of the loss is easy; see the paper for details.)

  14. Toy example • One iteration of L-BFGS-L1: • Find vector of steepest descent • Choose sectant • Find L-BFGS quadratic approximation • Jump to minimum • Project back onto sectant • Update Hessian approximation using gradient of loss alone

  15. Notes • Variables added/subtracted from model as orthant boundaries are hit • A variable can change signs in two iterations • Glossing over some details: • Line search with projection at each iteration • Convenient for implementation to expand notion of “orthant” to constrain some variables at zero • See paper for complete details • In paper we prove convergence to optimum

  16. Experiments • We ran experiments with the parse re-ranking model of Charniak & Johnson (2005) • Start with a set of candidate parses for each sentence (produced by a baseline parser) • Train a log-linear model to select the correct one • Model uses ~1.2M features of a parse • Train on Sections 2-19 of PTB (36K sentences with 50 parses each) • Fit C to max. F-meas on Sec. 20-21 (4K sent.)

  17. Training methods compared • Compared OWL-QN with three other methods • Kazama & Tsujii (2003) paired variable formulation for L1 implemented with AlgLib’s L-BFGS-B • L2 with our own implementation of L-BFGS (on which OWL-QN is based) • L2 with AlgLib’s implementation of L-BFGS • K&T turns L1 training into constrained differentiable problem by doubling variables • Similar to Goodman’s 2004 method, but with L-BFGS-B instead of GIS

  18. Comparison Methodology • For each problem (L1 and L2) • Run both algorithms until value nearly constant • Report time to reach within 1% of best value • We also report num. of function evaluations • Implementation independent comparison • Function evaluation dominates runtime • Results reported with chosen value of C • L-BFGS memory parameter = 5 for all runs

  19. Results

  20. Notes: • Our L-BFGS and AlgLib’s are comparable, so OWL-QN and K&T with AlgLib is fair comparison • In terms of function evaluations and raw time, OWL-QN orders of magnitude faster than K&T • The most expensive step of OWL-QN is computing L-BFGS direction (not projections, computing steepest descent vector, etc.) • Optimizing L1 objective with OWL-QN is twice as fast as optimizing L2 with L-BFGS

  21. Objective value during training L1, OWL-QN L2, Our L-BFGS L1, Kazama & Tsujii L2, AlgLib’s L-BFGS

  22. Sparsity during training • Both algorithms start with ~5% of features, then gradually prune them away • At second iteration, OWL-QN removes many features, then replaces them with opp. sign OWL-QN Kazama & Tsujii

  23. Extensions • For ACL paper, ran on 3 very different log-linear NLP models with up to 8M features • CMM sequence model for POS tagging • Reranking log-linear model for LM adaptation • Semi-CRF for Chinese word segmentation • Can use any smooth convex loss • We’ve also tried least-squares (LASSO regression) • A small change allows non-convex loss • Only local minimum guaranteed

  24. Software download • We’ve released c++ OWL-QN source • User can specify arbitrary convex smooth loss • Also included are standalone trainer for L1 logistic regression and least-squares (LASSO) • Please visit my webpage for download • (Find with search engine of your choice)

  25. THANKS.

More Related