Machine Learning

Machine Learning Neural Networks

Human Brain

Neurons

Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

Human Learning • Number of neurons: ~ 1010 • Connections per neuron: ~ 104 to 105 • Neuron switching time: ~ 0.001 second • Scene recognition time: ~ 0.1 second 100 inference steps doesn’t seem much

Machine Learning Abstraction

Artificial Neural Networks • Typically, machine learning ANNs are very artificial, ignoring: • Time • Space • Biological learning processes • More realistic neural models exist • Hodkins & Huxley (1952) won a Nobel prize for theirs (in 1963) • Nonetheless, very artificial ANNs have been useful in many ML applications

Perceptrons • The “first wave” in neural networks • Big in the 1960’s • McCulloch & Pitts (1943), Woodrow & Hoff (1960), Rosenblatt (1962)

A single perceptron x0 x1 Bias (x0 =1,always) w1 w0 x2 w2 Inputs x3 w3 x4 w4 x5 w5

Logical Operators x0 -0.8 x1 AND 0.5 x0 x2 0.5 0.0 NOT x1 x0 -1.0 -0.3 x1 OR 0.5 x2 0.5

Perceptron Hypothesis Space • What’s the hypothesis space for a perceptron on ninputs?

Learning Weights x1 ? x2 ? • Perceptron Training Rule • Gradient Descent • (other approaches: Genetic Algorithms) x0 ?

Perceptron Training Rule • Weights modified for each example • Update Rule: where learning rate target value perceptron output input value

What weights make XOR? x1 ? x2 ? • No combination of weights works • Perceptrons can only represent linearly separable functions x0 ?

Linear Separability + + OR - + x2 x1

Linear Separability x2 - + AND x1 - -

Linear Separability x2 + - XOR x1 - +

Perceptron Training Rule Killed widespread interest in perceptrons till the 80’s • Converges to the correct classification IF • Cases are linearly separable • Learning rate is slow enough • Proved by Minsky and Papert in 1969

XOR x1 x2 x0 0 0.6 x0 0 XOR 1 -0.6 x0 0 -0.6 1 0.6

What’s wrong with perceptrons? • You can always plug multiple perceptrons together to calculate any function. • BUT…who decides what the weights are? • Assignment of error to parental inputs becomes a problem…. • This is because of the threshold…. • Who contributed the error?

Problem: Assignment of error x1 ? x2 ? x0 • Hard to tell from the output who contributed what • Stymies multi-layer weight learning ? Perceptron Threshold Step function

Solution: Differentiable Function x1 ? x2 ? x0 • Varying any input a little creates a perceptible change in the output • This lets us propagate error to prior nodes ? Simple linear function

Measuring error for linear units • Output Function • Error Measure: target value linear unit output data

Gradient Descent Training rule: Gradient:

Gradient Descent Rule Update Rule:

Back to XOR x1 x2 x0 x0 XOR x0

Gradient Descent for Multiple Layers x1 x2 x0 x0 XOR x0 We can compute:

Gradient Descent vs. Perceptrons • Perceptron Rule & Threshold Units • Learner converges on an answer ONLY IF data is linearly separable • Can’t assign proper error to parent nodes • Gradient Descent • Minimizes error even if examples are not linearly separable • Works for multi-layer networks • But…linear units only make linear decision surfaces (can’t learn XOR even with many layers) • And the step function isn’t differentiable…

A compromise function • Perceptron • Linear • Sigmoid (Logistic)

The sigmoid (logistic) unit x1 ? x2 ? • Has differentiable function • Allows gradient descent • Can be used to learn non-linear functions

Logistic function Inputs Output Age 34 .5 0.6 .4 S Gender 1 “Probability of beingAlive” .8 4 Stage Independent variables Coefficients Prediction

Neural Network Model S S S Inputs Output .6 Age 34 .4 .2 0.6 .5 .1 Gender 2 .2 .3 .8 “Probability of beingAlive” .7 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Getting an answer from a NN S Inputs Output .6 Age 34 .5 0.6 .1 Gender 2 .8 “Probability of beingAlive” .7 4 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Getting an answer from a NN S Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 “Probability of beingAlive” .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Getting an answer from a NN S Inputs Output .6 Age 34 .5 .2 0.6 .1 Gender 1 .3 “Probability of beingAlive” .7 .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Minimizing the Error initial error Error surface negative derivative final error local minimum initial trained w w positive change

Differentiability is key! • Sigmoid is easy to differentiate • For gradient descent on multiple layers, a little dynamic programming can help: • Compute errors at each output node • Use these to compute errors at each hidden node • Use these to compute errors at each input node

The Backpropagation Algorithm

Getting an answer from a NN S Inputs Output .6 Age 34 .5 .2 0.6 .1 Gender 1 .3 “Probability of beingAlive” .7 .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Expressive Power of ANNs • Universal Function Approximator: • Given enough hidden units, can approximate any continuous function f • Need 2+ hidden units to learn XOR • Why not use millions of hidden units? • Efficiency (neural network training is slow) • Overfitting

Overfitting Model Real Distribution Overfitted

Combating Overfitting in Neural Nets • Many techniques • Two popular ones: • Early Stopping • Use “a lot” of hidden units • Just don’t over-train • Cross-validation • Choose the “right” number of hidden units

Early Stopping error model Overfitted error a = validation set a (D min error ) b = training set error b Epochs Stopping criterion

Cross-validation • Cross-validation: general-purpose technique for model selection • E.g., “how many hidden units should I use?” • More extensive version of validation-set approach.

Cross-validation • Break training set into ksets • For each model M • For i=1…k • Train M on all but set i • Test on set i • Output M with highest average test score,trained on full training set

Summary of Neural Networks Non-linear regression technique that is trained with gradient descent. Question: How important is the biological metaphor?

Summary of Neural Networks When are Neural Networks useful? • Instances represented by attribute-value pairs • Particularly when attributes are real valued • The target function is • Discrete-valued • Real-valued • Vector-valued • Training examples may contain errors • Fast evaluation times are necessary When not? • Fast training times are necessary • Understandability of the function is required

Advanced Topics in Neural Nets • Batch Move vs. incremental • Hidden Layer Representations • Hopfield Nets • Neural Networks on Silicon • Neural Network language models

Incremental vs. Batch Mode

Incremental vs. Batch Mode • In Batch Mode we minimize: • Same as computing: • Then setting

Machine Learning