1 / 21

Tutorial on Gaussian Processes DAGS ’07 Jonathan Laserson and Ben Packer

9/10/07. Tutorial on Gaussian Processes DAGS ’07 Jonathan Laserson and Ben Packer. Outline. Linear Regression Bayesian Inference Solution Gaussian Processes Gaussian Process Solution Kernels Implications. Linear Regression. Task: Predict y given x. Linear Regression.

hosea
Download Presentation

Tutorial on Gaussian Processes DAGS ’07 Jonathan Laserson and Ben Packer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 9/10/07 Tutorial onGaussian Processes DAGS ’07 Jonathan Laserson and Ben Packer

  2. Outline • Linear Regression • Bayesian Inference Solution • Gaussian Processes • Gaussian Process Solution • Kernels • Implications

  3. Linear Regression • Task: Predict y given x

  4. Linear Regression • Predicting Y given X

  5. L2 Regularized Lin Reg • Predicting Y given X

  6. Bayesian Instead of MAP • Instead of using wMAP =argmaxP(y,w|X) to predict y*, why don’t we use entire distribution P(y,w|X) to estimate P(y*|X,y,x*)? • We have P(y|w,X) and P(w) • Combine these to get P(y,w|X) • Marginalize to get P(y|X) • Same as P(y,y*|X,x*) • Conditional Gaussian->Joint to get P(y*|y,X,x*)

  7. Bayesian Inference • We have P(y|w,X) and P(w) • Combine these to get P(y,w|X) • Marginalize to get P(y|X) • Same as P(y,y*|X,x*) • Joint Gaussian->Conditional Gaussian Error bars!

  8. Gaussian Process • We saw a distribution over Y directly • Why not start from here? • Instead of choosing a prior over w and defining fw(x), put your prior over f directly • Since y = f(x) + noise, this induces a prior over y • Next… How to put a prior on f(x)

  9. What is a random process? • It’s a prior over functions • A stochastic process is a collection of random variables, f(x), indexed by x • It is specified by giving the joint probability of every finite subset of variables f(x1), f(x2), …, f(xk) • In a consistent way!

  10. What is a Gaussian process? • It’s a prior over functions • A stochastic process is a collection of random variables, f(x), indexed by x • It is specified by giving the joint probability of every finite subset of variables f(x1), f(x2), …, f(xk) • In a consistent way! • The joint probability of f(x1), f(x2), …, f(xk) is a multivariate Gaussian

  11. What is a Gaussian Process? • It is specified by giving the joint probability of every finite subset of variables f(x1), f(x2), …, f(xk) • In a consistent way! • The joint probability of f(x1), f(x2), …, f(xk) is a multivariate Gaussian • Enough to specify mean and covariance functions • μ(x) = E[f(x)] • C(x,x’) = E[ (f(x)- μ(x)) (f(x’)- μ(x’)) ] • f(x1), …, f(xk) ~ N( [μ(x1) … μ(xk)], K) Ki,j = C(xi, xj) • For simplicity, we’ll assume μ(x) = 0.

  12. Back to Linear Regression • Recall: Want to put a prior directly on f • Can use a Gaussian Process to do this • How do we choose μ and C? • Use knowledge of prior over w • w ~ N(0, σ2I) • μ(x) = E[f(x)] = E[wTx] = E[wT]x = 0 • C(x,x’) = E[ (f(x)- μ(x)) (f(x’)- μ(x’)) ] = E[f(x)f(x’)] = xTE[wwT]x’ = xT(σ2I)x’ = σ2xTx’ Can have f(x) = WTΦ(x)

  13. Back to Linear Regression • μ(x) = 0 • C(x,x’) = σ2xTx’ • f ~ GP(μ,C) • It follows that • f(x1),f(x2),…,f(xk) ~ N(0, K) • y1,y2,…,yk ~ N(0,ν2I + K) • K = σ2XXT • Same as Least Squares Solution! • If we use a different C, we’ll have a different K

  14. Kernels • If we use a different C, we’ll have a different K • What do these look like? • Linear • Poly • Gaussian C(x,x’) = σ2xTx’

  15. Kernels • If we use a different C, we’ll have a different K • What do these look like? • Linear • Poly • Gaussian C(x,x’) = (1+xTx’)2

  16. Kernels • If we use a different C, we’ll have a different K • What do these look like? • Linear • Poly • Gaussian C(x,x’) = exp{-0.5*(x-x’)2}

  17. End

  18. Learning a kernel • Parameterize a family of kernel functions using θ • Learn K using gradient of likelihood

  19. GP Graphical Model

  20. Starting point • For details, see • Rasmussen’s NIPS 2006 Tutorial • http://www.kyb.mpg.de/bs/people/carl/gpnt06.pdf • Williamson’s Gaussian Processes paper • http://www.dai.ed.ac.uk/homes/ckiw/postscript/hbtnn.ps.gz • GPs for classification (approximation) • Sparse methods • Connection to SVMs

  21. Your thoughts…

More Related