Efficient Sparse Learning for Numerical Stability in Large-Scale Data Analysis

Efficient and Numerically Stable Sparse Learning Sihong Xie1, Wei Fan2, Olivier Verscheure2, and Jiangtao Ren3 1University of Illinois at Chicago, USA 2 IBM T.J. Watson Research Center, New York, USA 3 Sun Yat-Sen University, Guangzhou, China

Sparse Linear Model • Input: • Output: sparse linear model • Learning formulation Sparse regularization Large Scale Contest http://largescale.first.fraunhofer.de/instructions/

Objectives • Sparsity • Accuracy • Numerical Stability • limited precision friendly • Scalability • Large scale training data (rows and columns)

Outline • “Numerical un-stability” of two popular approaches • Propose sparse linear model • online • numerically stable • parallelizable • good sparcity – don’t take features unless necesseary • Experiments results

Stability in Sparse learning • NumericalProblemsof DirectIterativeMethods • Numerical Problems of Mirror Descent

Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Solve the following optimization problem Sparse Degree Data Matrix Label vector Linear model The error to minimize L-0 regularization

Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Incorporating gradient descent with hard thresholding • At each iteration: Negative of Gradient 1 2 Hard Thresholding: Keep s top significant elements

Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Advantages: Simple and scalable • Convergence of IHT

Stability in Sparse learning • Iterative Hard Thresholding (IHT) • For IHT algorithm to converge, the iteration matrix should have its spectral radius less than 1 • Spectral radius

Experiments Divergence of IHT

Stability in Sparse learning Example of error growth of IHT

Stability in Sparse learning • Numerical Problems of Direct Iterative Methods • Numerical Problemsof Mirror Descent • Numerical Problemsof Mirror Descent

Stability in Sparse learning • Mirror Descent Algorithm (MDA) • Solve the L-1 regularized formulation • Maintain two vectors: primal and dual

1. 2. soft-thresholding Stability in Sparse learning Primal space Dual space link function Sparse Dual vector p is a parameter for MDA Illustration adopted from Peter Bartlett’s lecture slide http://www.cs.berkeley.edu/~bartlett/courses/281b-sp08/

Stability in Sparse learning • Floating number system • MDA link function exponent base significant digits × • Example: • A computer with only 4 significant digits • 0.1 + 0.00001 = 0.1

Stability in Sparse learning The difference between elements is amplified via the link function, when comparing elements in dual and primal vectors, respectively

Experiments Numerical problem of MDA • Experimental settings • Train models with 40% density. • Parameter p is set to 2ln(d) (p=33) and 0.5 ln(d) respectively [ST2009] [ST2009] Shai S. Shwartz and Ambuj Tewari. Stochastic methods for ℓ1 regularized loss minimization. In ICML, pages 929–936. ACM, 2009.

Experiments Numerical problem of MDA • Performance Criteria • Percentage of elements that are truncated during prediction • Dynamical range

Objectives of a Simple Approach • Numerically Stable • Computationally efficient • Online, parallelizable • Accurate models with higher sparsity • Costly to obtain too many features (e.g. medical diagnostics) For an excellent theoretical treatment of trading off between accuracy and sparsity see S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity. Technical report, TTIC, May 2009.

The proposed method algorithm SVM like margin

Numerical Stability and Scalability Considerations • numerical stability • Less conditions on data matrix such as spectral radius and no change of scales • Less precision demanding (works under limited precision, theorem 1) • Under mild conditions, the proposed method converges even for a large number of iterations proportional to Machine precision

Numerical Stability and Scalability Considerations • Online fashion: one example at a time • Parallelization for intensive data access • Data can be distributed to computers, where parts of the inner product can be obtained. • Small network communication (only parts of inner product and signals to update model)

The proposed method properties • Soft-thresholding • L1-regularization for sparse model • Perceptron: avoids updates when the current features are able to predict well – sparcity • Convergence under soft-thresholding and limited precision (Lemma 2and Theorem 1) – numerical stability • Generalization error bound (Theorem 3) Don’t complicate the model when unnecessary

The proposed method A toy example The proposed method: a sparse model is enough to predict well (margin indicates good-enough model, so enough features) 1st update 2nd update TG: truncated descent Relatively dense model 3rd update

Experiments Overall comparison • The proposed algorithm + 3 baseline sparse learning algorithms (all with logistic loss function) • SMIDAS (MDA based [ST2009]): p = 0.5log(d) (cannot run with bigger p due to numerical problem) • TG (Truncated Gradient [LLZ2009]) • SCD (Stochastic Coordinate Descent [ST2009]) [ST2009] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for l1 regularized loss minimization. Proceedings of the 26th International Conference on Machine Learning, pages 929-936, 2009. [LLZ2009] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777–801, 2009.

Experiments Overall comparison • Accuracy under the same model density • First 7 datasets: maximum 40% of features • Webspam: select maximum 0.1% of features • Stop running the program when maximum percentage of features are selected

Experiments Overall comparison • Accuracy vs. sparsity • The proposed algorithm works consistently better than other baselines. • On 5 out of 8 tasks, stopped updating model before reaching the maximum density (40% of features) • On task 1, outperforms others with 10% features • On task 3, ties with the best baseline using 20% features Convergence Sparse

Conclusion • Numerical Stability of Sparse Learning • Gradient Descent using matrix iteration may diverge without the spectral radius assumption. • When dimensionality is high, MDA produces many infinitesimal elements. • Trading off Sparsity and Accuracy • Other methods (TG, SCD) are unable to train accurate models with high sparsity. • Proposed approach is numerically stable, online parallelizable and converges. • Controlled by margin • L-1 regularization and soft threshold • Experimental codes are available • www.weifan.info

Efficient Sparse Learning for Numerical Stability in Large-Scale Data Analysis

Efficient Sparse Learning for Numerical Stability in Large-Scale Data Analysis

Presentation Transcript

Sparse, Flexible and Efficient Modeling using L 1 -Regularization

Describing Distributions Numerically

Supporting Durable and Efficient Student Learning

Sparse Factor Analysis for Learning Analytics

Efficient Sparse Voxel Octrees

Growing spiritually, numerically and financially.

DESCRIBING DISTRIBUTION NUMERICALLY

Sparse Bayesian Learning for Efficient Visual Tracking

Efficient and Numerically Stable Sparse Learning

Solving equations numerically

Efficient Randomized Broadcasting in Sparse Random Networks

Limits Numerically

Online Learning for Matrix Factorization and Sparse Coding

Efficient HopID Based Routing for Sparse Ad Hoc Networks

Solving Equations numerically

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

Finding Limits Graphically and Numerically

Paradigms and Efficient Learning

Supporting Durable and Efficient Student Learning

Machine Learning for Signal Processing Sparse and Overcomplete Representations