1 / 49

Support Vector Regression

Support Vector Regression. David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000 http://www.cs.wisc.edu/~musicant. Outline. Robust Regression Huber M-Estimator loss function New quadratic programming formulation

teigra
Download Presentation

Support Vector Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium onMathematical ProgrammingThursday, August 10, 2000 http://www.cs.wisc.edu/~musicant

  2. Outline • Robust Regression • Huber M-Estimator loss function • New quadratic programming formulation • Numerical comparisons • Nonlinear kernels • Tolerant Regression • New formulation of Support Vector Regression (SVR) • Numerical comparisons • Massive regression:Row-column chunking • Conclusions & Future Work

  3. Focus 1:Robust Regression a.k.a. Huber Regression -g g

  4. “Standard” Linear Regression Find w, b such that: m points in Rn, represented by an m x n matrix A. y in Rm is the vector to be approximated.

  5. Find w, b such that: Optimization problem • Bound the error by s: • Minimize the error: Traditional approach: minimize squared error.

  6. Examining the loss function • Standard regression uses a squared error loss function. • Points which are far from the predicted line (outliers) are overemphasized.

  7. Alternative loss function • Instead of squared error, try absolute value of the error: This is the 1-norm loss function.

  8. 1-Norm Problems And Solution • Overemphasizes error on points close to the predicted line • Solution: Huber loss function hybrid approach Linear Quadratic Many practitioners prefer the Huber loss function.

  9. Mathematical Formulation • g indicates switchover from quadratic to linear -g g Larger g means “more quadratic.”

  10. Regression Approach Summary • Quadratic Loss Function • Standard method in statistics • Over-emphasizes outliers • Linear Loss Function (1-norm) • Formulates well as a linear program • Over-emphasizes small errors • Huber Loss Function (hybrid approach) • Appropriate emphasis on large and small errors

  11. Previous attempts complicated • Earlier efforts to solve Huber regression: • Huber: Gauss-Seidel method • Madsen/Nielsen: Newton Method • Li: Conjugate Gradient Method • Smola: Dual Quadratic Program • Our new approach: convex quadratic program Our new approach is simpler and faster.

  12. Experimental Results: Census20k 20,000 points11 features g Faster! Time (CPU sec)

  13. Experimental Results: CPUSmall 8,192 points12 features g Faster! Time (CPU sec)

  14. Introduce nonlinear kernel • Begin with previous formulation: • Substitute w = A’a and minimize a instead: • Substitute K(A,A’) for AA’:

  15. Nonlinear results Nonlinear kernels improve accuracy.

  16. Focus 2:Support Vector Tolerant Regression

  17. Regression Approach Summary • Quadratic Loss Function • Standard method in statistics • Over-emphasizes outliers • Linear Loss Function (1-norm) • Formulates well as a linear program • Over-emphasizes small errors • Huber Loss Function (hybrid approach) • Appropriate emphasis on large and small errors

  18. Find w, b such that: Optimization problem • Bound the error by s: • Minimize the error: Minimize the magnitude of the error.

  19. The overfitting issue • Noisy training data can be fitted “too well” • leads to poor generalization on future data • Prefer simpler regressions, i.e. where • some w coefficients are zero • line is “flatter”

  20. Reducing overfitting • To achieve both goals • minimize magnitude of w vector • C is a parameter to balance the two goals • Chosen by experimentation • Reduces overfitting due to points far from surface

  21. Overfitting again: “close” points • “Close points” may be wrong due to noise only • Line should be influenced by “real” data, not noise • Ignore errors from those points which are close!

  22. Tolerant regression • Allow an interval of size e with uniform error • How large should e be? • Large as possible, while preserving accuracy

  23. How about a nonlinear surface?

  24. Introduce nonlinear kernel • Begin with previous formulation: • Substitute w = A’a and minimize a instead: • Substitute K(A,A’) for AA’: K(A,A’) = nonlinear kernel function

  25. Our formulation Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation tolerance as aconstraint single error bound

  26. Smola, Schölkopf, Rätsch multiple error bounds

  27. Reduction in: • Variables: • 4m+2 --> 3m+2 • Solution time

  28. Our formulation Smola, Schölkopf, Rätsch Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation • Reduction in: • Variables: • 4m+2 --> 3m+2 • Solution time tolerance as aconstraint single error bound multiple error bounds

  29. Natural interpretation for m • our linear program is equivalent to classical stabilized least 1-norm approximation problem • Perturbation theory results show there exists a fixed such that: • For all • we solve the above stabilized least 1-norm problem • additionally we maximize e, the least error component • As m goes from 0 to 1, • least error component e is monotonically nondecreasing function of m.

  30. Numerical Testing • Two sets of tests • Compare computational times of our method (MM) and the SSR method • Row-column chunking for massive datasets • Datasets: • US Census Bureau Adult Dataset: 300,000 points in R11 • Delve Comp-Activ Dataset: 8192 points in R13 • UCI Boston Housing Dataset: 506 points in R13 • Gaussian noise was added to each of these datasets. • Hardware: Locop2: Dell PowerEdge 6300 server with: • Four gigabytes of memory, 36 gigabytes of disk space • Windows NT Server 4.0 • CPLEX 6.5 solver

  31. Experimental Process • m is a parameter which needs to be determined experimentally • Use a hold-out tuning set to determine optimal value for m • Algorithm: m = 0 while (tuning set accuracy continues to improve) { Solve LP m = m + 0.1 } • Run for both our method and SSR methods and compare times

  32. Comparison Results

  33. Linear Programming Row Chunking • Basic approach: (PSB/OLM) for classification problems • Classification problem is solved for a subset, or chunk of constraints (data points) • Those constraints with positive multipliers are preserved and integrated into next chunk (support vectors) • Objective function is montonically nondecreasing • Dataset is repeatedly scanned until objective function stops increasing

  34. Innovation: Simultaneous Row-Column Chunking • Row Chunking • Cannot handle problems with large numbers of variables • Therefore: Linear kernel only • Row-Column Chunking • New data increase the dimensionality of K(A,A’) by adding both rows and columns (variables) to the problem. • We handle this with row-column chunking. • General nonlinear kernel

  35. Row-Column Chunking Algorithm while (problem termination criteria not satisfied) { choose set of rows as row chunk while (row chunk termination criteria not satisfied) { from row chunk, select set of columns solve LP allowing only these columns to vary add columns with nonzero values to next column chunk } add rows with nonzero multipliers to next row chunk }

  36. Row-Column Chunking Diagram

  37. Row-Column Chunking Diagram

  38. Row-Column Chunking Diagram

  39. Row-Column Chunking Diagram

  40. Row-Column Chunking Diagram

  41. Row-Column Chunking Diagram

  42. Row-Column Chunking Diagram

  43. Chunking Experimental Results

  44. Objective Value & Tuning Set Errorfor Billion-Element Matrix

  45. Conclusions and Future Work • Conclusions • Robust regression can be modeled simply and efficiently as a quadratic program • Tolerant Regression can be handled more efficiently using improvements on previous formulations • Row-column chunking is a new approach which can handle massive regression problems • Future work • Chunking via parallel and distributed approaches • Scaling Huber regression to larger problems

  46. Questions?

  47. LP Perturbation Regime #1 • Our LP is given by: • When m = 0, the solution is the stabilized least 1-norm solution. • Therefore, by LP Perturbation Theory, there exists a such that • The solution to the LP with is a solution to the least 1-norm problem that also maximizes e.

  48. LP Perturbation Regime #2 • Our LP can be rewritten as: • Similarly, by LP Perturbation Theory, there exists a such that • The solution to the LP with is the solution that minimizes least error (e) among all minimizers of average tolerated error.

  49. Motivation for dual variable substitution • Primal: • Dual:

More Related