1 / 44

GAPS IN OUR KNOWLEDGE?

Explore the challenges of gaps in data modelling and how to overcome them. Topics covered include sampling, density and distribution, representation, regularization, and cross-validation. Learn how to infer behavior when data is scarce and the importance of validation and generalization. Discover the impact of dimensionality and the use of basis functions. Understand the concepts of overfitting and underfitting, and learn techniques for performance evaluation and generalization. Cross-validation is introduced as a method for realistic performance estimation.

brianam
Download Presentation

GAPS IN OUR KNOWLEDGE?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

  2. Agenda • sampling • density & distribution • representation • accuracy vs generality • regularization • trading off accuracy and generality • cross validation • what happens when we can’t see

  3. The Data Modelling Problem • y= f(x) z=y+e • multivariate & non-linear • measurement errors • {xi, zi} i = 1:N zi = f(xi)+ei • infer behaviour everywhere from a few examples • little or no prior information on f(x)

  4. A Simple Example • one cycle of a sine wave • “well-spaced” samples • enough data (N = 6) • noise-free

  5. “observational” data

  6. Sparsely Sampled • two well-spaced samples

  7. Densely Sampled • 200 well-spaced samples

  8. Large N • 200 poorly-spaced samples

  9. dimension!! What’s So Hard? • the gaps • get more (well-spaced) data • lack of prior knowledge • can’t see

  10. Two Dimensions • same sampling density (N = 62) • well-spaced?

  11. Dimensionality • lose ability to see the shape of f(x) • try it in 13-D • number of samples exponential in d • if N OK in 1-D, Nd needed in d-D • how do we know if “well-spaced”? • how can we sample where the action is? • observational vs experimental data!

  12. Generic Structure • use e.g. a power series • Stone-Weierstrass • other bases e.g. Fourier series … • y = a5x5 + a4x4 + a3x3 + …+ a1x + a0 • d = 5 & six samples – no error!

  13. PROBLEM SOLVED!

  14. Generic Structure • use e.g. a power series • Stone-Weierstrass • other bases e.g. Fourier series … • y = a5x5 + a4x4 + a3x3 + …+ a1x + a0 • d = 5 & six samples – no error! • d > 5 still no error but …

  15. poor inter-sample behaviour how can we know without looking?

  16. Generic Structure • use e.g. a power series • Stone-Weierstrass • other bases e.g. Fourier series … • y = a5x5 + a4x4 + a3x3 + …+ a1x + a0 • d = 5 & six samples – no error! • d > 5 still no error but … • measurement error

  17. Generic Structure • use e.g. a power series • Stone-Weierstrass • other bases e.g. Fourier series … • y = a5x5 + a4x4 + a3x3 + …+ a1x + a0 • d = 5 & six samples – no error! • d > 5 still no error but … • measurement error • model is as complex as data

  18. Curse Of Dimension • we can still use the idea but … • in 2-D we get 21 terms • direct and cross products • in d-D we get (d+d)!/d!d! • e.g. transform a 16x16 bitmap by d=3 polynomial and get ~ 3 million terms • sample size / distribution • practical for “small” problems

  19. Other Basis Functions • Gaussian radial basis functions • additional design choices • how many?, where?, how wide? • adaptive sigmoidal basis functions • how many?

  20. Overfitting (sample data) zero error? rough components

  21. Underfitting (sample data) over-smooth components

  22. Goldilocks v. small error “just right” components

  23. Restricting “Flexibility” • can we use data to tell the estimator how to behave? • regularization/penalization • penalize roughness • e.g. SSE + rQ • use potentially complex structure • data constrains where it can • Q constrains elsewhere

  24. four narrow grbfs RMSE = 0.80

  25. four narrow grbfs + penalty RMSE = 0.24

  26. How Well Are We Doing? • so far we know answer • for d > 2 we are “blind” • what’s happening between samples? • test on new samples in between VALIDATION • compute a performance index GENERALIZATION

  27. Hold-out Method keep back P% for testing wasteful sample dependent Training RMSE 0.23 Testing RMSE 0.38

  28. Cross Validation • leave-one-out CV • train on all but one • test that one • repeat N times • compute performance • m-fold CV • divide sample into m non-overlapping sets • proceed as above • all data used for training and testing • more work but realistic performance estimates • used to choose “hyper-parameters” • e.g. r, number, width …

  29. Conclusion • gaps in our data cause most of our problems • noise • more data only part of the answer • distribution / parsimony • restricting flexibility helps • if no evidence to contrary • cross validation is a window into d-D • estimates inter-sample behaviour

More Related