1 / 29

Efficient and Numerically Stable Sparse Learning

Efficient and Numerically Stable Sparse Learning. Sihong Xie 1 , Wei Fan 2 , Olivier Verscheure 2 , and Jiangtao Ren 3 1 University of Illinois at Chicago, USA 2 IBM T.J. Watson Research Center, New York, USA 3 Sun Yat-Sen University, Guangzhou, China. Sparse Linear Model. Input:

winda
Download Presentation

Efficient and Numerically Stable Sparse Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient and Numerically Stable Sparse Learning Sihong Xie1, Wei Fan2, Olivier Verscheure2, and Jiangtao Ren3 1University of Illinois at Chicago, USA 2 IBM T.J. Watson Research Center, New York, USA 3 Sun Yat-Sen University, Guangzhou, China

  2. Sparse Linear Model • Input: • Output: sparse linear model • Learning formulation Sparse regularization Large Scale Contest http://largescale.first.fraunhofer.de/instructions/

  3. Objectives • Sparsity • Accuracy • Numerical Stability • limited precision friendly • Scalability • Large scale training data (rows and columns)

  4. Outline • “Numerical un-stability” of two popular approaches • Propose sparse linear model • online • numerically stable • parallelizable • good sparcity – don’t take features unless necesseary • Experiments results

  5. Stability in Sparse learning • NumericalProblemsof DirectIterativeMethods • Numerical Problems of Mirror Descent

  6. Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Solve the following optimization problem Sparse Degree Data Matrix Label vector Linear model The error to minimize L-0 regularization

  7. Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Incorporating gradient descent with hard thresholding • At each iteration: Negative of Gradient 1 2 Hard Thresholding: Keep s top significant elements

  8. Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Advantages: Simple and scalable • Convergence of IHT

  9. Stability in Sparse learning • Iterative Hard Thresholding (IHT) • For IHT algorithm to converge, the iteration matrix should have its spectral radius less than 1 • Spectral radius

  10. Experiments Divergence of IHT

  11. Stability in Sparse learning Example of error growth of IHT

  12. Stability in Sparse learning • Numerical Problems of Direct Iterative Methods • Numerical Problemsof Mirror Descent • Numerical Problemsof Mirror Descent

  13. Stability in Sparse learning • Mirror Descent Algorithm (MDA) • Solve the L-1 regularized formulation • Maintain two vectors: primal and dual

  14. 1. 2. soft-thresholding Stability in Sparse learning Primal space Dual space link function Sparse Dual vector p is a parameter for MDA Illustration adopted from Peter Bartlett’s lecture slide http://www.cs.berkeley.edu/~bartlett/courses/281b-sp08/

  15. Stability in Sparse learning • Floating number system • MDA link function exponent base significant digits × • Example: • A computer with only 4 significant digits • 0.1 + 0.00001 = 0.1

  16. Stability in Sparse learning The difference between elements is amplified via the link function, when comparing elements in dual and primal vectors, respectively

  17. Experiments Numerical problem of MDA • Experimental settings • Train models with 40% density. • Parameter p is set to 2ln(d) (p=33) and 0.5 ln(d) respectively [ST2009] [ST2009] Shai S. Shwartz and Ambuj Tewari. Stochastic methods for ℓ1 regularized loss minimization. In ICML, pages 929–936. ACM, 2009.

  18. Experiments Numerical problem of MDA • Performance Criteria • Percentage of elements that are truncated during prediction • Dynamical range

  19. Objectives of a Simple Approach • Numerically Stable • Computationally efficient • Online, parallelizable • Accurate models with higher sparsity • Costly to obtain too many features (e.g. medical diagnostics) For an excellent theoretical treatment of trading off between accuracy and sparsity see S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity. Technical report, TTIC, May 2009.

  20. The proposed method algorithm SVM like margin

  21. Numerical Stability and Scalability Considerations • numerical stability • Less conditions on data matrix such as spectral radius and no change of scales • Less precision demanding (works under limited precision, theorem 1) • Under mild conditions, the proposed method converges even for a large number of iterations proportional to Machine precision

  22. Numerical Stability and Scalability Considerations • Online fashion: one example at a time • Parallelization for intensive data access • Data can be distributed to computers, where parts of the inner product can be obtained. • Small network communication (only parts of inner product and signals to update model)

  23. The proposed method properties • Soft-thresholding • L1-regularization for sparse model • Perceptron: avoids updates when the current features are able to predict well – sparcity • Convergence under soft-thresholding and limited precision (Lemma 2and Theorem 1) – numerical stability • Generalization error bound (Theorem 3) Don’t complicate the model when unnecessary

  24. The proposed method A toy example The proposed method: a sparse model is enough to predict well (margin indicates good-enough model, so enough features) 1st update 2nd update TG: truncated descent Relatively dense model 3rd update

  25. Experiments Overall comparison • The proposed algorithm + 3 baseline sparse learning algorithms (all with logistic loss function) • SMIDAS (MDA based [ST2009]): p = 0.5log(d) (cannot run with bigger p due to numerical problem) • TG (Truncated Gradient [LLZ2009]) • SCD (Stochastic Coordinate Descent [ST2009]) [ST2009] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for l1 regularized loss minimization. Proceedings of the 26th International Conference on Machine Learning, pages 929-936, 2009. [LLZ2009] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777–801, 2009.

  26. Experiments Overall comparison • Accuracy under the same model density • First 7 datasets: maximum 40% of features • Webspam: select maximum 0.1% of features • Stop running the program when maximum percentage of features are selected

  27. Experiments Overall comparison • Accuracy vs. sparsity • The proposed algorithm works consistently better than other baselines. • On 5 out of 8 tasks, stopped updating model before reaching the maximum density (40% of features) • On task 1, outperforms others with 10% features • On task 3, ties with the best baseline using 20% features Convergence Sparse

  28. Conclusion • Numerical Stability of Sparse Learning • Gradient Descent using matrix iteration may diverge without the spectral radius assumption. • When dimensionality is high, MDA produces many infinitesimal elements. • Trading off Sparsity and Accuracy • Other methods (TG, SCD) are unable to train accurate models with high sparsity. • Proposed approach is numerically stable, online parallelizable and converges. • Controlled by margin • L-1 regularization and soft threshold • Experimental codes are available • www.weifan.info

More Related