Neural Networks and Kernel Methods
Download
1 / 40

Neural Networks and Kernel Methods - PowerPoint PPT Presentation


  • 47 Views
  • Uploaded on

Neural Networks and Kernel Methods. Generally, this will take a lot longer than 24 hours… We need to avoid doing this by hand!. How are we doing on the pass sequence?. We can now track both men, provided with Hand-labeled coordinates of both men in 30 frames

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Neural Networks and Kernel Methods' - renee


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

How are we doing on the pass sequence

Generally, this will take a lot longer than 24 hours…

We need to avoid doing this by hand!

How are we doing on the pass sequence?

  • We can now track both men, provided with

    • Hand-labeled coordinates of both men in 30 frames

    • Hand-extracted features (stripe detector, white blob detector)

    • Hand-labeled classes for the white-shirt tracker

  • We have a framework for how to optimally make decisions and track the men


Recall multi input linear regression
Recall: Multi-input linear regression

y(x,w)= w0+ w1 f1(x)+ w2 f2(x)+ … + wM fM(x)

  • xcan be an entire scan-line or image!

  • We could try to uniformly distribute basis functions in the input space:

  • This is futile, because of the curse of dimensionality

x = entire scan line


Neural networks and kernel methods
Neural networks and kernel methods

Two main approaches to avoiding the curse of dimensionality:

  • “Neural networks”

    • Parameterize the basis functions and learn their locations

    • Can be nested to create a hierarchy

    • Regularize the parameters or use Bayesian learning

  • “Kernel methods”

    • The basis functions are associated with data points, limiting complexity

    • A subset of data points may be selected to further limit complexity


Neural networks and kernel methods1
Neural networks and kernel methods

Two main approaches to avoiding the curse of dimensionality:

  • “Neural networks”

    • Parameterize the basis functions and learn their locations

    • Can be nested to create a hierarchy

    • Regularize the parameters or use Bayesian learning

  • “Kernel methods”

    • The basis functions are associated with data points, limiting complexity

    • A subset of data points may be selected to further limit complexity


Two layer neural networks
Two-layer neural networks

  • Before, we used

  • Replace each fj with a variable zj,

    where

    and h() is a fixed activation function

  • The outputs are obtained from

    where s() is another fixed function

  • In all, we have (simplifying biases):


Typical activation functions
Typical activation functions

h(a)

  • Logistic sigmoid, aka logit:

    h(a) = s(a) = 1/(1+e-a)

  • Hyperbolic tangent:

    h(a) = tanh(a) = (ea-e-a)/(ea+e-a)

  • Cumulative Gaussian (error function):

    h(a) = 2x=-∞a N(x|0,1)dx - 1

    • This one has a lighter tail

Normalized to have same range and slope ata=0

As above, but h is on a log-scale

a


Examples of functions learned by a neural network 3 tanh hidden units one linear output unit
Examples of functions learned by a neural network(3 tanh hidden units, one linear output unit)


Multi layer neural networks
Multi-layer neural networks

  • Only weights corresponding to the feed-forward topology are instantiated

  • The sum is over those values of j with instantiated weights wkj

From now on, we’ll denote all activation functions by h


Learning neural networks
Learning neural networks

  • As for regression, we consider a squared error cost function:

    E(w)= ½SnSk ( tnk– yk(xn,w) )2

    which corresponds to a Gaussian density p(t|x)

  • We can substitute

    and use a general purpose optimizer to estimate w, but it is illustrative and useful to study the derivatives of E…


Learning neural networks1
Learning neural networks

E(w)= ½SnSk ( tnk– yk(xn,w) )2

  • Recall that for linear regression:

    E(w)/wm= -Sn ( tn- yn ) xnm

  • We’ll use the chain rule of differentiation to derive a similar-looking expression, where

    • Local input signals are forward-propagated from the input

    • Local error signals are back-propagated from the output

Weight in-between error signal and input signal

Error signal

Input signal


Local signals needed for learning

Weight

Local error signal

Local input signal

Local signals needed for learning

  • For clarity, consider the error for one training case:

  • To compute En/wji, note that wji appears in only one term of the overall expression, namely

  • Using the chain rule of differentiation, we have

    where

if wji is in the 1st layer, zi is actually input xi


Forward propagating local input signals
Forward-propagating local input signals

  • Forward propagation gives all the a’s and z’s


Back propagating local error signals
Back-propagating local error signals

  • Back-propagation gives all the d ’s

t2

t1


Back propagating error signals
Back-propagating error signals

  • To compute En/aj (dj), note that aj appears in all those expressions ak = Siwkih(ai) that depend on aj

  • Using the chain rule, we have

  • The sum is over k s.t. unit j is connected to unit k and for each such term, ak/aj = wkjh’(aj)

  • Noting that En/ak=dk, we get the back-propagation rule:

  • For output units: -


Putting the propagations together
Putting the propagations together

  • For each training case n, apply forward propagation and back-propagation to compute

    for each weight wji

  • Sum these over training cases to compute

  • Use these derivatives for steepest descent learning or as input to a conjugate gradients optimizer, etc

  • On-line learning: After each pattern presentation, use the above gradient to update the weights



The effect of local minima
The effect of local minima learned function

  • Because of random weight initialization, each training run will find a different solution

Validation error

M


Regularizing neural networks
Regularizing neural networks learned function

Demonstration of over-fitting (M = # hidden units)


Regularizing neural networks1
Regularizing neural networks learned function

Over-fitting:

  • Use cross-validation to select the network architecture (number of layers, number of units per layer)

  • Add to E a term (l/2)Sjiwji2 that penalizes large weights, so

    Use cross-validation to select l

  • Use early-stopping and cross-validation (next slide)

  • Take a Bayesian approach: Put a prior on the w’s and integrate over them to make predictions


Early stopping
Early stopping learned function

  • The weights start at small values and grow

  • Perhaps the number of learning iterations is a surrogate for model complexity?

  • This works for some learning tasks

Training error

Validation error

Number of learning iterations


Can we use a standard neural network to automatically learn the features needed for tracking
Can we use a standard neural network to automatically learn the features needed for tracking?

  • x is 320-dimensional, so the number of parameters would be at least 320

  • We have only 15 data points (setting aside 15 for cross validation) so over-fitting will be an issue

  • We could try weight decay, Bayesian learning, etc, but a little thinking reveals that our approach is wrong…

  • In fact, we want the weights connecting different positions in the scan line to use the same feature (eg, stripes)

x = entire scan line


Convolutional neural networks
Convolutional neural networks the features needed for tracking?

  • Recall that a short portion of the scan line was sufficient for tracking the striped shirt

  • We can use this idea to build a convolutional network

With constrained weights, the number of free parameters is now only ~ one dozen, so…

We can use Bayesian/regularized learning to automatically learn the features

Same set of weights used for all hidden units


Convolutional neural networks in 2 d from le cun et al 1989
Convolutional neural networks in 2-D the features needed for tracking?(from Le Cun et al, 1989)


Neural networks and kernel methods2
Neural networks and kernel methods the features needed for tracking?

Two main approaches to avoiding the curse of dimensionality:

  • “Neural networks”

    • Parameterize the basis functions and learn their locations

    • Can be nested to create a hierarchy

    • Regularize the parameters or use Bayesian learning

  • “Kernel methods”

    • The basis functions are associated with data points, limiting complexity

    • A subset of data points may be selected to further limit complexity


Kernel methods
Kernel methods the features needed for tracking?

  • Basis functions offer a way to enrich the feature space, making simple methods (such as linear regression and linear classifiers) much more powerful

  • Example: Input x; Features x, x2, x3, sin(x), …

  • There are two problems with this approach

    • Computational efficiency: Generally, the appropriate features are not known, so there is a huge (possibly infinite) number of them to search over

    • Regularization: Even if we could search over the huge number of features, how can we select appropriate features so as to prevent overfitting?

  • The kernel framework enables efficient approaches to both problems


Kernel methods1
Kernel methods the features needed for tracking?

x2

f2

x1

f1


Definition of a kernel
Definition of a kernel the features needed for tracking?

  • Suppose f(x) is a mapping from the D-dimensional input vector x to a high (possibly infinite) dimensional feature space

  • Many simple methods rely on inner products of feature vectors, f(x1)Tf(x2)

  • For certain feature spaces, the “kernel trick” can be used to compute f(x1)Tf(x2) using the input vectors directly:

    f(x1)Tf(x2) = k(x1,x2)

  • k(x1,x2) is referred to as a kernel

  • If a function satisfies “Mercer’s conditions” (see textbook), it can be used as a kernel


Examples of kernels
Examples of kernels the features needed for tracking?

  • k(x1,x2) = x1T x2

  • k(x1,x2) = x1T S-1x2

    (S-1is symmetric positive definite)

  • k(x1,x2) = exp(-||x1-x2||2/2s2)

  • k(x1,x2) = exp(-½ x1T S-1x2 )

    (S-1is symmetric positive definite)

  • k(x1,x2) = p(x1)p(x2)


Gaussian processes

Example the features needed for tracking?

Gaussian processes

  • Recall that for linear regression:

  • Using a design matrix F, our prediction vector is

  • Let’s use a simple prior on w:

  • Then

  • K is called the Gram matrix, where

  • Result: The correlation between two predictions equals the kernel evaluated for the corresponding inputs


Gaussian processes learning and prediction
Gaussian processes: “Learning” and prediction the features needed for tracking?

  • As before, we assume

  • The target vector likelihood is

  • Using , we can obtain the marginal predictive distribution over targets:

    where

  • Predictions are based on

    where , =

  • is Gaussian with


Example samples from
Example: Samples from the features needed for tracking?


Example learning and prediction
Example: Learning and prediction the features needed for tracking?


Sparse kernel methods and svms
Sparse kernel methods and SVMs the features needed for tracking?

  • Idea: Identify a small number of training cases, called support vectors, which are used to make predictions

  • See textbook for details

Support vector


Questions? the features needed for tracking?


How are we doing on the pass sequence1

Same set of weights used for all hidden units the features needed for tracking?

How are we doing on the pass sequence?

  • We can now automatically learn the features needed to track both people


How are we doing on the pass sequence2

Same set of weights used for all hidden units the features needed for tracking?

How are we doing on the pass sequence?

Pretty good! We can

now automatically learn

the features needed to

track both people

But, it sucks that we need to hand-label the coordinates of both men in 30 frames and hand-label the 2 classes for the white-shirt tracker


Lecture 5 Appendix the features needed for tracking?


Constructing kernels
Constructing kernels the features needed for tracking?

  • Provided with a kernel or a set of kernels, we can construct new kernels using any of the rules:


ad