1 / 38

Machine Learning

Machine Learning. Neural Networks Eric Xing Lecture 3, August 12, 2010. Reading: Chap. 5 CB. Learning highly non-linear functions. f: X  Y f might be non-linear function X (vector of) continuous and/or discrete vars Y (vector of) continuous and/or discrete vars. The XOR gate.

harrisjoe
Download Presentation

Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning Neural Networks Eric Xing Lecture 3, August 12, 2010 Reading: Chap. 5 CB

  2. Learning highly non-linear functions f: X  Y • f might be non-linear function • X (vector of) continuous and/or discrete vars • Y (vector of) continuous and/or discrete vars The XOR gate Speech recognition

  3. Perceptron and Neural Nets • From biological neuron to artificial neuron (perceptron) • Activation function • Artificial neuron networks • supervised learning • gradient descent

  4. Connectionist Models • Consider humans: • Neuron switching time ~ 0.001 second • Number of neurons ~ 1010 • Connections per neuron ~ 104-5 • Scene recognition time ~ 0.1 second • 100 inference steps doesn't seem like enough  much parallel computation • Properties of artificial neural nets (ANN) • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed processes

  5. 1 1 20 37 10 1 Abdominal Pain Perceptron Intensity Duration Pain Male Temp Age WBC Pain adjustable weights Diverticulitis Pancreatitis Appendicitis Ulcer Pain Obstruction Cholecystitis Duodenal Non-specific Small Bowel Perforated 0 0 0 1 0 0 0

  6. Intensity Duration Elevation Pain Smoker Pain ECG: ST Age Male 2 1 1 50 1 4 Myocardial Infarction “Probability” of MI Myocardial Infarction Network 0.8

  7. The "Driver" Network

  8. Input units Headache Cough rule D change weights to weights decrease the error what we got - what we wanted Pneumonia No disease Flu Meningitis error Output units Perceptrons

  9. Perceptrons Output of unit j : q - ( a + ) Output o = 1/ (1 + e j j ) j j units Input to unit j : a = w a S j ij i Input to unit i : a i measured value of variable i i Input units

  10. Inputs Output Age 34 5 0.6 4 Gender 1 S “Probability of beingAlive” 4 8 Stage Jargon Pseudo-Correspondence • Independent variable = input variable • Dependent variable = output variable • Coefficients = “weights” • Estimates = “targets” Logistic Regression Model (the sigmoid unit) Independent variables x1, x2, x3 Coefficients a, b, c Dependent variable p Prediction

  11. The perceptron learning algorithm • Recall the nice property of sigmoid function • Consider regression problem f:XY , for scalar Y: • Let’s maximize the conditional data likelihood

  12. xd = input td = target output od = observed unit output wi = weight i Gradient Descent

  13. Incremental mode: • Do until converge: • For each training example d in D • 1. compute gradient ÑEd[w] • 2. • where Batch mode: Do until converge: 1. compute gradient ÑED[w] 2. xd = input td = target output od = observed unit output wi = weight i The perceptron learning rules

  14. MLE vs MAP • Maximum conditional likelihood estimate • Maximum a posteriori estimate

  15. y = 0.5 q w w 1 2 f(x w + x w ) = y 1 1 2 2 x x 1 2 f(0w + 0w ) = 1 1 2 f(0w + 1w ) = 1 1 2 f(1w + 0w ) = 1 1 2 f(1w + 1w ) = 0 1 2 1, for a > q f(a) = 0, for a £ q q some possible values for w and w 1 2 w w 2 1 0.35 0.20 0.40 0.20 0.30 0.25 0.20 0.40 What decision surface does a perceptron define? NAND

  16. y = 0.5 q w w 1 2 f(x w + x w ) = y 1 1 2 2 x x 1 2 f(0w + 0w ) = 0 1 2 f(0w + 1w ) = 1 1 2 f(1w + 0w ) = 1 1 2 f(1w + 1w ) = 0 1 2 1, for a > q f(a) = 0, for a £ q q some possible values for w and w 1 2 w w 2 1 What decision surface does a perceptron define? NAND

  17. = 0.5 for all units q w5 w6 w1 w2 w4 w3 1, for a > q f(a) = 0, for a £ q q What decision surface does a perceptron define? NAND a possible set of values for (w1, w2, w3, w4, w5 , w6): (0.6,-0.6,-0.7,0.8,1,1)

  18. Non Linear Separation

  19. S S S Neural Network Model Inputs Output .6 Age 34 .4 .2 0.6 .5 .1 Gender 2 .2 .3 .8 “Probability of beingAlive” .7 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

  20. S “Combined logistic models” Inputs Output .6 Age 34 .5 0.6 .1 Gender 2 .8 “Probability of beingAlive” .7 4 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

  21. S Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 “Probability of beingAlive” .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

  22. S Inputs Output .6 Age 34 .5 .2 0.6 .1 Gender 1 .3 “Probability of beingAlive” .7 .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

  23. S S S Not really, no target for hidden units... .6 Age 34 .4 .2 0.6 .5 .1 Gender 2 .2 .3 .8 “Probability of beingAlive” .7 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

  24. Input units Headache Cough rule D change weights to weights decrease the error what we got - what we wanted Pneumonia No disease Flu Meningitis error Output units Perceptrons

  25. Hidden Units and Backpropagation backpropagation

  26. xd = input td = target output od = observed unit output wi = weight i Backpropagation Algorithm • Initialize all weights to small random numbers Until convergence, Do • Input the training example to the network and compute the network outputs • For each output unit k • For each hidden unit h • Undate each network weight wi,j where

  27. More on Backpropatation • It is doing gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Will find a local, not necessarily global error minimum • In practice, often works well (can run multiple times) • Often include weight momentuma • Minimizes error over training examples • Will it generalize well to subsequent testing examples? • Training can take thousands of iterations,  very slow! • Using network after training is very fast

  28. Learning Hidden Layer Representation • A network: • A target function: • Can this be learned?

  29. Learning Hidden Layer Representation • A network: • Learned hidden layer representation:

  30. Training

  31. Training

  32. Expressive Capabilities of ANNs • Boolean functions: • Every Boolean function can be represented by network with single hidden layer • But might require exponential (in number of inputs) hidden units • Continuous functions: • Every bounded continuous function can be approximated with arbitrary small error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989] • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

  33. The model The learned hidden unit weights Application: ANN for Face Reco.

  34. Regression vs. Neural Networks

  35. Minimizing the Error initial error Error surface negative derivative final error local minimum initial trained w w positive change

  36. Overfitting in Neural Nets Overfitted model “Real” model Overfitted model CHD error holdout training 0 age cycles

  37. Alternative Error Functions • Penalize large weights: • Training on target slopes as well as values • Tie together weights • E.g., in phoneme recognition

  38. Artificial neural networks – what you should know • Highly expressive non-linear functions • Highly parallel network of logistic function units • Minimizing sum of squared training errors • Gives MLE estimates of network weights if we assume zero mean Gaussian noise on output values • Minimizing sum of sq errors plus weight squared (regularization) • MAP estimates assuming weight priors are zero mean Gaussian • Gradient descent as training procedure • How to derive your own gradient descent procedure • Discover useful representations at hidden units • Local minima is greatest problem • Overfitting, regularization, early stopping

More Related