1 / 24

Parameter Tuning with Response Surface Models

This presentation provides an update on the work in progress on parameter tuning based on response surface models. The focus is on optimizing runtime and using predictive models to improve algorithm configurations. The presenter discusses the motivation, problem setting, learning predictive models, desired properties of the model, and future work.

lmatthews
Download Presentation

Parameter Tuning with Response Surface Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parameter tuning based on response surface modelsAn update on work in progress EARG, Feb 27th, 2008 Presenter: Frank Hutter

  2. Motivation • Parameter tuning is important • Recent approaches (ParamILS, racing, CALIBRA) “only” return the best parameter configuration • Extra information would be nice, e.g. • The most important parameter is X • The effect of parameters X and Y is largely independent • For parameter X options 1 and 2 are bad, 3 is best, 4 is decent • ANOVA is one tool for that, but has limitations (e.g. discretization of parameters, linear model)

  3. More motivation • Support the actual design process by providing feedback about parameters • E.g. parameter X should always be i (code gets simpler!!) • Predictive models of runtime are widely applicable • Prediction can be updated based on new information (such as “the algorithm has been unsuccessfully running for X seconds”) • (True) portfolios of algorithms • Once we can learn a function f:Q! runtime, learning a function g:Q, X! runtime should be a simply extension (X=inst. charac., Lin learns h: X! runtime)

  4. The problem setting • For now: static algorithm configuration, i. e. find the best fixed parameter setting across instances • But as mentioned above this approach extends to PIAC (per instance algorithm configuration) • Randomized algorithms: variance for a single instance (runtime distributions) • High inter-instance variance in hardness • We focus on minimizing runtime • But the approach also applies to other objectives • (Special treatment of censoring and cost for gathering a data point is then simply not necessary) • We focus on optimizing averages across instances • Generalization to other objectives may not be straight-forward

  5. Learning a predictive model • Supervised learning problem, regression • Given training data (x1, o1), …, (xn, on), learn function f such that f(xi) ¼ oi • What is a data point xi ? • 1) Predictive model of average cost • Average of how many instances/runs ? • Not too many data points, but each one very costly • Doesn’t have to be average cost, could be anything • 2) Predictive model of single costs, get average cost by aggregation • Have to deal with ten thousands of data points • If predictions are Gaussian, the aggregates are Gaussian (means and variances add)

  6. Desired properties of model • 1) Discrete and continuous inputs • Parameters are discrete/continuous • Instances features (so far) all continuous • 2) Censoring • When a run times out we only have a lower bound on its true runtime • 3) Scalability: tens of thousands of points • 4) Explicit predictive uncertainties • 5) Accuracy of predictions • Considered models: • Linear regression (basis functions? especially for discrete inputs) • regression trees (no uncertainty estimates) • Gaussian processes (4&5 ok, 1 done, 2 almost done, hopefully 3)

  7. Coming up • 1) Implemented: model average runtimes, optimize based on that model • Censoring “almost” integrated • 2) Further TODOs: • Active learning criterion under noise • Scaling: Bayesian committee machine

  8. Active learning for function optimization • EGO [Jones, Schonlau & Welch, 1998] • Assumes deterministic functions • Here: averages over 100 instances • Start with a Latin hypercube design • Run the algorithm, get (xi,oi) pairs • While not terminate • Fit the model (kernel parameter optimization, all continuous) • Find best point to sample (optimization in the space of parameter configurations) • Run the algorithm at that point, add new (x,y) pair

  9. Active learning criterion • EGO uses maximum expected improvement • EI(x) = s p(y|mx, s2x) max(0, f_min-y) dy • Easy to evaluate (can be solved in closed form) • Problem in EGO: sometimes not the actual runtime y is modeled, but a transformation, e.g. log(y) • Expected improvement then needs to be adapted: • EI(x) = s p(y|mx, s2x) max(0, f_min-exp(y)) dy • Easy to evaluate (can still be solved in closed form) • Take into account cost of sample: • EI(x) = s p(y|mx, s2x) 1/exp(y) max(0, f_min-exp(y)) dy • Easy to evaluate (can still be solved in closed form) • Not implemented yet (the others are implemented)

  10. How to optimize exp. improvement? • Currently only 3 algorithms to be tuned: • SAPS (4 continuous params) • SPEAR(26 parameters, about half of them discrete) • For now continuous ones are discretized • CPLEX(60 params, 50 of them discrete) • For now continuous ones are discretized • Purely continuous/purely discrete optimization • DIRECT / multiple restart local search

  11. GPs: which kernel to use? • Kernel: distance measure between two data points • Low distance ! high correlation • Squared exponential, Matern, etc: • SE: k(x, x’) = ss exp(- å li(xi-xi’)2 ) • For discrete parameters: new Hamming distance kernel • ss epx(- å li(xi ¹ xi’) ) • Positive definite by reduction to String kernels • “Automatic relevance determination” • One length scale parameter li per dimension • Many kernel parameters lead to • Problems with overfitting • Very long runtimes for kernel parameter optimization • For CPLEX: 60 extra parameters, about 15h for a single kernel parameter optimization using DIRECT, without any improvement • Thus: no length scale parameters.Only two parameters: noise sn, and overall variability of the signal, ss

  12. Continuing from last week … where were we? • Start with a Latin hypercube design • Run the algorithm, get (xi,oi) pairs • While not terminate • Fit the model (kernel parameter optimization, all continuous) • Haven’t covered yet, coming up • Censoring will come in here • Find best point to sample (optimization in the space of parameter configurations) • Covered last week • Run the algorithm at that point, add new (x,y) pair

  13. How to optimize kernel parameters? • Objective • Standard: maximizing marginal likelihood p(o) = s p(o|f) p(f) df • Doesn’t work under censoring • Alternative: maximizing likelihood of unseen data using cross-validation p(otest|mtest, Stest) • Efficient when not too many folds k are used: • Marginal likelihood requires inversion of N by N matrix • Cross validation with k=2 requires inversions of two N/2 by N/2 matrices. In practice faster for large N • Algorithm • Using DIRECT (DIviding RECTangles), global sampling-based method (does not scale to high dim)

  14. Censoring complicates predictions • P(f1:N|o1:N) / p(f1:N) £ p(o1:N|f1:N), both Gaussian • For censored data point oi, p(oi|fi) = F((oi-mi)/si), not Gaussian at all • But product p(f1:N|o1:N) = p(f1:N) £ p(o1:N|f1:N) closer to Gaussian • Laplace approximation: find mode of p(f1:N|o1:N), use Hessian at that point as second order approximation of the precision matrix • Finding mode: gradient & Hessian-based numerical optimization in N dimensions, where N=number of data points • Without censoring closed form, but still N3 • How to score a kernel parameter configuration? • Cross-validated likelihood of unseen test data under predictive distribution • I.e. for each fold, learn a model under censoring, and predict the unseen validation data

  15. Don’t use censored data, 4s

  16. Treat as “completed at threshold”, 4s

  17. Laplace approximation to posterior, 10s

  18. Schmee & Hahn, 21 iterations, 36s

  19. Anecdotal: Lin’s original implementation of Schmee & Hahn, on my machine – beware of normpdf

  20. TODO: Active learning under noise • [Williams, Santner, and Notz, 2000] • Very heavy on notation • But there is good stuff in there • 1) Actively choose a parameter setting • Best setting so far is not known ! fmin is now a random variable • Take joint samples f1:N(i) of performance from predictive distribution for all settings tried so far (sample from our Gaussian approximation to p(f1:N|o1:N) • take min of those samples, compute expected improvement as if that min was the deterministic fmin • Average the exp. improvements computed for 100 independent samples • Efficiency: the most costly part in evaluating expected improvement at a parameter configuration is the probabilistic prediction with the GP; even with many samples we only need to predict once • 2) Actively choose an instance to run for that parameter setting: minimizing posterior variance

  21. TODO: Integrating expected cost into AL criterion • EI criterion discussed last time that takes into account the cost of a sample: • EI(x) = s p(y|mx, s2x) 1/exp(y) max(0, f_min-exp(y)) dy • Easy to evaluate (can still be solved in closed form) • The above approach for noisy functions re-uses EI for deterministic functions, so it could use this • Open question: should the cost be taken into account when selecting an instance for that parameter setting? • Another open question: how to select the censoring threshold? • Something simple might suffice, such as picking cutoff equal to predicted runtime or to the best runtime so far • Integration bounds in expected improvement would change, but nothing else

  22. TODO: scaling • Bayesian committee machine • More or less a mixture of GPs, each of them on a small subset of data (cluster data ahead of time) • Fairly straight-forward wrapper around GP code (actually around any code that provides Gaussian predictions) • Maximizing cross-validated performance is easy • In principle could update by just updating one component at a time • But in practice once we re-optimize kernel parameters we’re changing every component anyways • Likewise we can do rank-1 updates for the basic GPs, but a single matrix inversion is really not the expensive part (rather the 1000s of matrix inversions for kernel parameter optimization)

  23. Preliminary results and demo • Experiments with noise-free kernel • Great cross-validation results for SPEAR & CPLEX • Poor cross-validation results for SAPS • Explanation • Even when averaging 100 instances, the response is NOT noise-free • SAPS is continuous: • can pick configurations arbitrarily close to each other • if results differ substantially SE kernel must have huge variance ! very poor results • Matern kernel works better for SAPS

  24. Future work (figures from EGO paper) • We can get main effects and interaction effects, much like in ANOVA • The integrals seem to be solvable in closed form • We can get plots of predicted mean and variance as one parameter is varied, marginalized over all others • Similarly as two or three are varied • This allows for plots of interactions

More Related