- 92 Views
- Uploaded on
- Presentation posted in: General

Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Ch 3. Linear Models for Regression (1/2)Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Previously summarized by Yung-Kyun Noh

Modified and presented by Rhee, Je-Keun

Biointelligence Laboratory, Seoul National University

http://bi.snu.ac.kr/

- 3.1Linear Basis Function Models
- 3.1.1Maximum likelihood and least squares
- 3.1.2Geometry of least squares
- 3.1.3Sequential learning
- 3.1.4Regularized least squares
- 3.1.5Multiple outputs

- 3.2The Bias-Variance Decomposition
- 3.3Bayesian Lear Regression
- 3.3.1Parameter distribution
- 3.3.2Predictive distribution
- 3.3.3Equivalent kernel

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Linear regression
- Linear model
- Linearity in the parameters
- Using basis functions, allow nonlinear function of the input vector x.
- Simplify the analysis of this class of models
- Have some significant limitations
- M: total number of parameters
- : basis functions ( : dummy basis function)
- ,

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Polynomial functions:
- Global functions of the input variable
spline functions

- Global functions of the input variable
- Gaussian basis functions:
- Sigmoidal basis functions:
- Logistic sigmoid functions:

- Fourier basis wavelets

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Assumption: Gaussian noise model
- : zero mean Gaussian random variable with precision (inverse variance) .

- Result
- Conditional mean = (unimodal)

- For dataset
- Likelihood: (Drop the explicit x)

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Log-likelihood
- Maximization of the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to minimizing a sum-of-squares error function.

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- The gradient of the log likelihood function
- Setting the gradient of log likelihood and setting it to zero to get
where the NxM design matrix

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Some other solutions we can get by setting derivative to zero.
- Bias maximizing log likelihood
- The bias compensates for the difference between the averages (over the training set) of the target values and the weighted sum of the averages of the basis function values.

- Noise precision parameter maximizing log likelihood

- Bias maximizing log likelihood

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- If the number M of basis functions is smaller than the number N of data points, then the M vectors will span a linear subspace S of dimensionality M.
- : jth column of

- y: linear combination of
- The least-squares solution for w corresponds to that choice of y that lies in subspace S and that is closest to t.

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- On-line learning
- Technique of Stochastic gradient descent (or sequential gradient descent)
- For the case of sum-of-squares error function (least-mean-square or the LMS algorithm)

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Regularized least-square
- Control over-fitting
- Total error function
- Closed form solution (setting the gradient):
- This represents a simple extension of the least-squares solution.

- A more general regularizer

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- In case q=1 in general regularizer
- ‘lasso’ in the statistical literature
- If λis sufficiently large, some of the coefficients wj are driven to zero.
- Sparse model: corresponding basis functions play no role.
- Minimizing the unregularized sum-of-squares error s.t. the constraint

Contours of the regularization term

The lasso gives the sparse solution

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Regularization allows complex models to be trained on data sets of limited size without severe over-fitting, essentially by limiting the effective model complexity.
- However, the problem of determining the optimal model complexity is then shifted from on of finding the appropriate number of basis functions to one of determining a suitable value of the regularization coefficient λ.

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- For K>1 target variables
- 1. Introduce a different set of basis functions for each component of t.
- 2. Use the same set of basis functions to model all of the components of the target vector. (W: MxK matrix of parameters)

- For each variable tk,
- : pseudo-inverse of

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Frequentist viewpoint of the model complexity issue: bias-variance trade-off.
- Expected squared loss
- Bayesian: the uncertainty in our model is expressed through a posterior distribution over w.
- Frequentist: make a point estimate of w based on the data set D.

Arises from the intrinsic noise on the data

Dependent on the particular dataset D.

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Bias
- The extent to which the average prediction over all data sets differs from the desired regression function.

- Variance
- The extent to which the solutions for individual data sets vary around their average.
- The extent to which the function y(x;D) is sensitive to the particular choice of data set.

- Expected loss = (bias)2 + variance + noise

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- bias-variance trade-off
- Averaging many solutions for the complex model (M=25) is a beneficial procedure.
- A weighted averaging (although with respect to the posterior distribution of parameters, not with respect to multiple data sets) of multiple solutions lies at the heart of Bayesian approach.

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- The average prediction
- Bias and variance
- Bias-variance decomposition is based on averages with respect to ensembles of data sets (frequentist perspective). We would be better off combining them into a single large training set.

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- In the particular problem, it cannot be decided simply by maximizing the likelihood function, because it always leads to excessively complex models and overfitting.
- Independent hold-out data can be used to determine model complexity, but this can be both computationally expensive and wasteful of valuable data.
- Bayesian treatment of linear regression will avoid the overfitting problem of maximum likelihood, and will also lead to autoamtic methods of determining model complexity using training data alome.

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Conjugate prior of likelihood
- Posterior
- The maximum posterior weight vector
- If S0=α-1I with α→ 0, the mean mN reduces to wML given by (3.15)

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Consider prior
- Corresponding posterior
- Log of the posterior
- Maximization of this posterior distribution with respect to w is equivalent to the minimization of the sum-of squares error function with the addition of a quadratic regularization term with λ=α/β.

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Other forms of prior over parameters

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Our real interests

Uncertainty associated with the parameters w.

0 if N∞

Mean of the Gaussian predictive distribution (red line), and predictive uncertainty (shaded region) as the number of data increases.

noise

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Draw samples from the posterior distribution over w.

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- If we substitue (3.53) into the expression (3.3), we see that the predictive mean can be written in the form
- Mean of the predictive distribution at a point x.

Smoother matrix or equivalent kernel

Polynomial and sigmoidal basis function

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

- Instead of introducing a set of basis functions, which implicitly determines an equivalent kernel, we can instead define a localized kernel directly and use this to make predictions for new input vector x, given the observed training set.
- This leads to a practical framework for regression (and classification) called Gaussian processes.

- The equivalent kernel satisfies an important property shared by kernel functions in general, namely that it can be expressed in the form an inner product with respect to a vector ψ(x) of nonlinear functions.
- Inner product of nonlinear functions

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/