300 likes | 395 Views
This research focuses on developing an efficient and numerically stable approach for sparse learning in large-scale data analysis. By addressing issues of sparsity, accuracy, and numerical stability, the proposed method offers scalability and limited-precision friendliness. It introduces a sparse linear model that is online, parallelizable, and ensures good sparsity by only considering essential features. Experimental results show the stability achieved in sparse learning, addressing numerical problems through methods like Iterative Hard Thresholding (IHT) and Mirror Descent Algorithm (MDA). The proposed approach emphasizes simplicity, scalability, and convergence, making it suitable for practical applications with large training datasets.
E N D
Efficient and Numerically Stable Sparse Learning Sihong Xie1, Wei Fan2, Olivier Verscheure2, and Jiangtao Ren3 1University of Illinois at Chicago, USA 2 IBM T.J. Watson Research Center, New York, USA 3 Sun Yat-Sen University, Guangzhou, China
Sparse Linear Model • Input: • Output: sparse linear model • Learning formulation Sparse regularization Large Scale Contest http://largescale.first.fraunhofer.de/instructions/
Objectives • Sparsity • Accuracy • Numerical Stability • limited precision friendly • Scalability • Large scale training data (rows and columns)
Outline • “Numerical un-stability” of two popular approaches • Propose sparse linear model • online • numerically stable • parallelizable • good sparcity – don’t take features unless necesseary • Experiments results
Stability in Sparse learning • NumericalProblemsof DirectIterativeMethods • Numerical Problems of Mirror Descent
Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Solve the following optimization problem Sparse Degree Data Matrix Label vector Linear model The error to minimize L-0 regularization
Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Incorporating gradient descent with hard thresholding • At each iteration: Negative of Gradient 1 2 Hard Thresholding: Keep s top significant elements
Stability in Sparse learning • Iterative Hard Thresholding (IHT) • Advantages: Simple and scalable • Convergence of IHT
Stability in Sparse learning • Iterative Hard Thresholding (IHT) • For IHT algorithm to converge, the iteration matrix should have its spectral radius less than 1 • Spectral radius
Stability in Sparse learning Example of error growth of IHT
Stability in Sparse learning • Numerical Problems of Direct Iterative Methods • Numerical Problemsof Mirror Descent • Numerical Problemsof Mirror Descent
Stability in Sparse learning • Mirror Descent Algorithm (MDA) • Solve the L-1 regularized formulation • Maintain two vectors: primal and dual
1. 2. soft-thresholding Stability in Sparse learning Primal space Dual space link function Sparse Dual vector p is a parameter for MDA Illustration adopted from Peter Bartlett’s lecture slide http://www.cs.berkeley.edu/~bartlett/courses/281b-sp08/
Stability in Sparse learning • Floating number system • MDA link function exponent base significant digits × • Example: • A computer with only 4 significant digits • 0.1 + 0.00001 = 0.1
Stability in Sparse learning The difference between elements is amplified via the link function, when comparing elements in dual and primal vectors, respectively
Experiments Numerical problem of MDA • Experimental settings • Train models with 40% density. • Parameter p is set to 2ln(d) (p=33) and 0.5 ln(d) respectively [ST2009] [ST2009] Shai S. Shwartz and Ambuj Tewari. Stochastic methods for ℓ1 regularized loss minimization. In ICML, pages 929–936. ACM, 2009.
Experiments Numerical problem of MDA • Performance Criteria • Percentage of elements that are truncated during prediction • Dynamical range
Objectives of a Simple Approach • Numerically Stable • Computationally efficient • Online, parallelizable • Accurate models with higher sparsity • Costly to obtain too many features (e.g. medical diagnostics) For an excellent theoretical treatment of trading off between accuracy and sparsity see S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity. Technical report, TTIC, May 2009.
The proposed method algorithm SVM like margin
Numerical Stability and Scalability Considerations • numerical stability • Less conditions on data matrix such as spectral radius and no change of scales • Less precision demanding (works under limited precision, theorem 1) • Under mild conditions, the proposed method converges even for a large number of iterations proportional to Machine precision
Numerical Stability and Scalability Considerations • Online fashion: one example at a time • Parallelization for intensive data access • Data can be distributed to computers, where parts of the inner product can be obtained. • Small network communication (only parts of inner product and signals to update model)
The proposed method properties • Soft-thresholding • L1-regularization for sparse model • Perceptron: avoids updates when the current features are able to predict well – sparcity • Convergence under soft-thresholding and limited precision (Lemma 2and Theorem 1) – numerical stability • Generalization error bound (Theorem 3) Don’t complicate the model when unnecessary
The proposed method A toy example The proposed method: a sparse model is enough to predict well (margin indicates good-enough model, so enough features) 1st update 2nd update TG: truncated descent Relatively dense model 3rd update
Experiments Overall comparison • The proposed algorithm + 3 baseline sparse learning algorithms (all with logistic loss function) • SMIDAS (MDA based [ST2009]): p = 0.5log(d) (cannot run with bigger p due to numerical problem) • TG (Truncated Gradient [LLZ2009]) • SCD (Stochastic Coordinate Descent [ST2009]) [ST2009] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for l1 regularized loss minimization. Proceedings of the 26th International Conference on Machine Learning, pages 929-936, 2009. [LLZ2009] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10:777–801, 2009.
Experiments Overall comparison • Accuracy under the same model density • First 7 datasets: maximum 40% of features • Webspam: select maximum 0.1% of features • Stop running the program when maximum percentage of features are selected
Experiments Overall comparison • Accuracy vs. sparsity • The proposed algorithm works consistently better than other baselines. • On 5 out of 8 tasks, stopped updating model before reaching the maximum density (40% of features) • On task 1, outperforms others with 10% features • On task 3, ties with the best baseline using 20% features Convergence Sparse
Conclusion • Numerical Stability of Sparse Learning • Gradient Descent using matrix iteration may diverge without the spectral radius assumption. • When dimensionality is high, MDA produces many infinitesimal elements. • Trading off Sparsity and Accuracy • Other methods (TG, SCD) are unable to train accurate models with high sparsity. • Proposed approach is numerically stable, online parallelizable and converges. • Controlled by margin • L-1 regularization and soft threshold • Experimental codes are available • www.weifan.info