Announcements

Announcements 1. Textbookwill be on reserve at library 2. Topic schedule change; modified reading assignment: This week: Linear discrimination, evaluating classifiers Extra reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4(linked from class web page) 3. No class Monday (MLK day) 4. Guest lecture Wednesday: Josh Hugues on multi-layer perceptrons

Perceptrons as simple neural networks x1 x2 xn +1 w1 w0 w2 o . . . wn

Hyperplane Geometry of the perceptron In 2d: Feature 1 Feature 2

In-class exercise Work with one neighbor on this: (a) Find weights (w0, w1, w2) for a perceptron that separates “true” and “false” in x1x2. Find the slope and intercept, and sketch the separation line defined by this discriminant, showing that it separates the points correctly. (b) Do the same, but for x1x2. (c) What (if anything) might make one separation line better than another?

Training a perceptron • Start with random weights, w= (w1, w2, ... , wn). • Select training example (xk, tk). • Run the perceptron with input xkand weights w to obtain o. • Let  be the learning rate (a user-set parameter). Now, • Go to 2.

Perceptron learning rule: In-class exercise • S = {((0,0), -1), ((0,1), 1), ((1,1), 1)} • Let w = {w0, w1, w2) = {0.1, 0.1,−0.3} 1. Calculate new perceptronweights after each training example is processed. Let η = 0.2 . 2. What is accuracy on training data after one epoch of training? Did the accuracy improve? +1 0.1 0.1 x1 o −0.3 x2

Homework 1 summary 1. Train perceptron: 8 vs. 0 2. Evaluate perceptron : 8 vs. 0 Test data: 8 vs. 0 Training data: 8 vs. 0 Training data: 8 vs. 0 x1, ..., x64 , 8 x1, ..., x64 , 8 x1, ..., x64 , 8 x1, ..., x64 , 8 x1, ..., x64 , 8 x1, ..., x64 , 8 . . . . . . . . . Calculate accuracy on test data x1 x2 x64 Calculate accuracy on training data x1, ..., x64 , 8 x1, ..., x64 , 8 x1, ..., x64 , 8 +1 w1 x1, ..., x64 , 0 x1, ..., x64 , 0 x1, ..., x64 , 0 w0 x1, ..., x64 , 0 x1, ..., x64 , 0 x1, ..., x64 , 0 w2 . . . . . . . . . o . . . x1, ..., x64 , 0 x1, ..., x64 , 0 x1, ..., x64 , 0 Confusion matrix: 8 vs. 0 w64 Predicted Give confusion matrix for test data Actual

Homework 1 summary 1. Train perceptron: 8 vs. 1 2. Evaluate perceptron : 8 vs. 1 Test data: 8 vs. 1 Training data: 8 vs. 1 Training data: 8 vs. 1 x1, ..., x64 , 8 x1, ..., x64 , 8 x1, ..., x64 , 8 x1, ..., x64 , 8 x1, ..., x64 , 8 x1, ..., x64 , 8 . . . . . . . . . Calculate accuracy on test data x1 x2 x64 Calculate accuracy on training data x1, ..., x64 , 8 x1, ..., x64 , 8 x1, ..., x64 , 8 +1 w1 x1, ..., x64 , 1 x1, ..., x64 , 1 x1, ..., x64 , 1 w0 x1, ..., x64 , 1 x1, ..., x64 , 1 x1, ..., x64 , 1 w2 . . . . . . . . . o . . . x1, ..., x64 , 1 x1, ..., x64 , 1 x1, ..., x64 , 1 Confusion matrix: 8 vs. 1 w64 Predicted Give confusion matrix for test data Actual

Questions on HW • What should the “threshold value” be? • What should the target and output values look like? • The assignment says we will train 10 separate perceptrons; shouldn’t this be 9?

1960s: Rosenblatt proved that the perceptron learning rule converges to correct weights in a finite number of steps, provided the training examples are linearly separable. • 1969: Minsky and Papert proved that perceptrons cannot represent non-linearly separable target functions. • However, they proved that any transformation can be carried out by adding a fully connected hidden layer.

XOR function x1 x2

Multi-layer perceptron example Decision regions of a multilayer feedforward network. The network was trained to recognize 1 of 10 vowel sounds occurring in the context “h_d” (e.g., “had”, “hid”) The network input consists of two parameters, F1 and F2, obtained from a spectral analysis of the sound. The 10 network outputs correspond to the 10 possible vowel sounds. (From T. M. Mitchell, Machine Learning)

Good news: Adding hidden layer allows more target functions to be represented. • Bad news: No algorithm for learning in multi-layered networks, and no convergence theorem! • Quote from Minsky and Papert’s book, Perceptrons (1969): “[The perceptron] has many features to attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgment that the extension is sterile.”

Two major problems they saw were: • How can the learning algorithm apportion credit (or blame) to individual weights for incorrect classifications depending on a (sometimes) large number of weights? • How can such a network learn useful higher-order features? • Good news: Successful credit-apportionment learning algorithms developed soon afterwards (e.g., back-propagation). Still successful, in spite of lack of convergence theorem.

Announcements

Announcements

Presentation Transcript

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

Announcements

ANNOUNCEMENTS

Announcements