- 65 Views
- Uploaded on
- Presentation posted in: General

1. Stat 231. A.L. Yuille. Fall 2004.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

- Perceptron Rule and Convergence Proof
- Capacity of Perceptrons.
- Multi-layer Perceptrons.
- Read 5.4,5.5 9.6.8 Duda, Hart, Stork.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- N samples
where the

- Can we find a hyperplane in feature space through the origin,
that separates the two types of samples

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- For the two-class case, simplify by replacing all samples with
Then find a plane such that

- The weight vector is almost never unique.
- Determine the weight vector that has the biggest margin m(>0), where (Next lecture).
- Discriminative: no attempt to model probability distributions. Recall that the decision boundary is a hyperplane if the distributions are Gaussian with identical covariance.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- Assume there is a hyperplane separating the two classes. How can we find it?
- Single Sample Perceptron Rule.
- Order samples
- Set
loop over j,

if is misclassified, set

repeat until all samples are classified correctly.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- Novikov’s Theorem: the single sample Perceptron rule will converge to a solution weight, if one exists.
- Proof. Suppose is a separating weight.
- Then
- decreases by at least for each misclassified sample.
- Initialize weight at 0. Then number of weight changes is less than

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- Proof of claim.
- If
- Using

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- The Perceptron was very influencial and unrealistic claims were made about its abilities (1950’s, early 1960’s).
- The model is an idealized model of neurons.
- An entire book was published in the mid 1960’s describing the limited capacity of Perceptrons (Minsky and Papert). Some classifications, exclusive or, can’t be performed by linear separation.
- But, from Learning Theory, limited capacity is good.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- The Perceptron is useful precisely because it has finite capacity and so cannot represent all classifications.
- The amount of training data required to ensure Generalization will need to be larger than the capacity. Infinite capacity requires infinite data.
- Full definition of Perceptron capacity must wait till we introduce Vapnik Chevonenkis (VC) dimension.
- But the following result (Cover) gives the basic idea.
.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- Suppose we have n sample points in a d dimensional feature space. Assume that these points are in general position – no
subset of (d+1) points lies in a (d-1) dimensional subspace

- Let f(n,d) be the fraction of the 2^n dichotomies of the n points which can be expressed by linear separation.
- It can be shown (D.H.S) that f(n,d) =1, for
- otherwise
- There is a critical value 2(d+1). f(n,d)=1 for n << 2(d+1),
- f(n,d) =0 for n >> 2(d+1), transition rapid for large d.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- Perceptron capacity is d+1. The probability of finding a separating hyperplane by chance alignment of the samples decreases rapidly for n > 2(d+1).

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- Multilayer Perceptrons were introduced in the 1980’s to increase capacity. Motivated by biological arguments (dubious).
- Key Idea: replace the binary decision rule by a Sigmoid function:
(Step function as T tends to 0).

- Input units activity
- Hidden units
- Output units
Weights connecting the Input units to the hidden units, and the hidden units to the output units.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- Multilayer perceptrons can represent any function provided there are a sufficient number of hidden units. But the number of hidden units may be enormous.
- Also the ability to represent any function may be bad, because of generalization/memorization.
- Difficult to analyze multilayer perceptrons. They are like “black boxes”. When they are successful, there is often a simpler, more transparent alternative
- The Neuronal plausibility for multilayer perceptrons is unclear.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- Train the multilayer perceptron using training data
- Define error function for each sample
- Minimize the error function for each sample by steepest descent:
- Backpropagation algorithm (propagation of errors).

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

- Perceptron and Linear Separability.
- Perceptron rule and convergence proof.
- Capacity of Perceptrons.
- Multi-layer Perceptrons.
- Next Lecture – Support Vector Machines for Linear Separation.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning