1 / 74

Machine Learning

Machine Learning. Neural Networks. Human Brain. Neurons. Input-Output Transformation. Input Spikes. Output Spike. Spike (= a brief pulse). ( Excitatory Post-Synaptic Potential ). Human Learning. Number of neurons: ~ 10 10 Connections per neuron: ~ 10 4 to 10 5

lrios
Download Presentation

Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning Neural Networks

  2. Human Brain

  3. Neurons

  4. Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

  5. Human Learning • Number of neurons: ~ 1010 • Connections per neuron: ~ 104 to 105 • Neuron switching time: ~ 0.001 second • Scene recognition time: ~ 0.1 second 100 inference steps doesn’t seem much

  6. Machine Learning Abstraction

  7. Artificial Neural Networks • Typically, machine learning ANNs are very artificial, ignoring: • Time • Space • Biological learning processes • More realistic neural models exist • Hodkins & Huxley (1952) won a Nobel prize for theirs (in 1963) • Nonetheless, very artificial ANNs have been useful in many ML applications

  8. Perceptrons • The “first wave” in neural networks • Big in the 1960’s • McCulloch & Pitts (1943), Woodrow & Hoff (1960), Rosenblatt (1962)

  9. A single perceptron x0 x1 Bias (x0 =1,always) w1 w0 x2 w2 Inputs x3 w3 x4 w4 x5 w5

  10. Logical Operators x0 -0.8 x1 AND 0.5 x0 x2 0.5 0.0 NOT x1 x0 -1.0 -0.3 x1 OR 0.5 x2 0.5

  11. Perceptron Hypothesis Space • What’s the hypothesis space for a perceptron on ninputs?

  12. Learning Weights x1 ? x2 ? • Perceptron Training Rule • Gradient Descent • (other approaches: Genetic Algorithms) x0 ?

  13. Perceptron Training Rule • Weights modified for each example • Update Rule: where learning rate target value perceptron output input value

  14. What weights make XOR? x1 ? x2 ? • No combination of weights works • Perceptrons can only represent linearly separable functions x0 ?

  15. Linear Separability + + OR - + x2 x1

  16. Linear Separability x2 - + AND x1 - -

  17. Linear Separability x2 + - XOR x1 - +

  18. Perceptron Training Rule Killed widespread interest in perceptrons till the 80’s • Converges to the correct classification IF • Cases are linearly separable • Learning rate is slow enough • Proved by Minsky and Papert in 1969

  19. XOR x1 x2 x0 0 0.6 x0 0 XOR 1 -0.6 x0 0 -0.6 1 0.6

  20. What’s wrong with perceptrons? • You can always plug multiple perceptrons together to calculate any function. • BUT…who decides what the weights are? • Assignment of error to parental inputs becomes a problem…. • This is because of the threshold…. • Who contributed the error?

  21. Problem: Assignment of error x1 ? x2 ? x0 • Hard to tell from the output who contributed what • Stymies multi-layer weight learning ? Perceptron Threshold Step function

  22. Solution: Differentiable Function x1 ? x2 ? x0 • Varying any input a little creates a perceptible change in the output • This lets us propagate error to prior nodes ? Simple linear function

  23. Measuring error for linear units • Output Function • Error Measure: target value linear unit output data

  24. Gradient Descent Training rule: Gradient:

  25. Gradient Descent Rule Update Rule:

  26. Back to XOR x1 x2 x0 x0 XOR x0

  27. Gradient Descent for Multiple Layers x1 x2 x0 x0 XOR x0 We can compute:

  28. Gradient Descent vs. Perceptrons • Perceptron Rule & Threshold Units • Learner converges on an answer ONLY IF data is linearly separable • Can’t assign proper error to parent nodes • Gradient Descent • Minimizes error even if examples are not linearly separable • Works for multi-layer networks • But…linear units only make linear decision surfaces (can’t learn XOR even with many layers) • And the step function isn’t differentiable…

  29. A compromise function • Perceptron • Linear • Sigmoid (Logistic)

  30. The sigmoid (logistic) unit x1 ? x2 ? • Has differentiable function • Allows gradient descent • Can be used to learn non-linear functions

  31. Logistic function Inputs Output Age 34 .5 0.6 .4 S Gender 1 “Probability of beingAlive” .8 4 Stage Independent variables Coefficients Prediction

  32. Neural Network Model S S S Inputs Output .6 Age 34 .4 .2 0.6 .5 .1 Gender 2 .2 .3 .8 “Probability of beingAlive” .7 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

  33. Getting an answer from a NN S Inputs Output .6 Age 34 .5 0.6 .1 Gender 2 .8 “Probability of beingAlive” .7 4 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

  34. Getting an answer from a NN S Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 “Probability of beingAlive” .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

  35. Getting an answer from a NN S Inputs Output .6 Age 34 .5 .2 0.6 .1 Gender 1 .3 “Probability of beingAlive” .7 .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

  36. Minimizing the Error initial error Error surface negative derivative final error local minimum initial trained w w positive change

  37. Differentiability is key! • Sigmoid is easy to differentiate • For gradient descent on multiple layers, a little dynamic programming can help: • Compute errors at each output node • Use these to compute errors at each hidden node • Use these to compute errors at each input node

  38. The Backpropagation Algorithm

  39. Getting an answer from a NN S Inputs Output .6 Age 34 .5 .2 0.6 .1 Gender 1 .3 “Probability of beingAlive” .7 .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

  40. Expressive Power of ANNs • Universal Function Approximator: • Given enough hidden units, can approximate any continuous function f • Need 2+ hidden units to learn XOR • Why not use millions of hidden units? • Efficiency (neural network training is slow) • Overfitting

  41. Overfitting Model Real Distribution Overfitted

  42. Combating Overfitting in Neural Nets • Many techniques • Two popular ones: • Early Stopping • Use “a lot” of hidden units • Just don’t over-train • Cross-validation • Choose the “right” number of hidden units

  43. Early Stopping error model Overfitted error a = validation set a (D min error ) b = training set error b Epochs Stopping criterion

  44. Cross-validation • Cross-validation: general-purpose technique for model selection • E.g., “how many hidden units should I use?” • More extensive version of validation-set approach.

  45. Cross-validation • Break training set into ksets • For each model M • For i=1…k • Train M on all but set i • Test on set i • Output M with highest average test score,trained on full training set

  46. Summary of Neural Networks Non-linear regression technique that is trained with gradient descent. Question: How important is the biological metaphor?

  47. Summary of Neural Networks When are Neural Networks useful? • Instances represented by attribute-value pairs • Particularly when attributes are real valued • The target function is • Discrete-valued • Real-valued • Vector-valued • Training examples may contain errors • Fast evaluation times are necessary When not? • Fast training times are necessary • Understandability of the function is required

  48. Advanced Topics in Neural Nets • Batch Move vs. incremental • Hidden Layer Representations • Hopfield Nets • Neural Networks on Silicon • Neural Network language models

  49. Incremental vs. Batch Mode

  50. Incremental vs. Batch Mode • In Batch Mode we minimize: • Same as computing: • Then setting

More Related