1 / 73

Feed-Forward Artificial Neural Networks

Feed-Forward Artificial Neural Networks. AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University. Binary Classification Example. Example:

lucian
Download Presentation

Feed-Forward Artificial Neural Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University

  2. Binary Classification Example • Example: • Classification to malignant or benign breast cancer from mammograms • Predictor 1: lump thickness • Predictor 2: single epithelial cell size Value of predictor 2 Value of predictor 1

  3. Possible Decision Areas Class area: Green triangles Class area: red circles Value of predictor 2 Value of predictor 1

  4. Possible Decision Areas Value of predictor 2 Value of predictor 1

  5. Possible Decision Areas Value of predictor 2 Value of predictor 1

  6. Binary Classification Example The simplest non-trivial decision function is the straight line (in general a hyperplane) One decision surface Decision surface partitions space into two subspaces Value of predictor 2 Value of predictor 1

  7. Specifying a Line • Line equation: w2x2+w1x1+w0 = 0 • Classifier: • If w2x2+w1x1+w0 0 • Output • Else • Output w2x2+w1x1+w0 = 0 w2x2+w1x1+w0 < 0 w2x2+w1x1+w0 > 0 x2 x1

  8. w2x2+w1x1+w0 = 0 w2x2+w1x1+w0 < 0 w2x2+w1x1+w0 > 0 x2 x1 Classifying with Linear Surfaces • Use 1 for • Use -1 for • Classifier becomes sgn(w2x2+w1x1+w0)= sgn(w2x2+w1x1+w0x0), with x0=1 always

  9. The Perceptron Output: classification of patient (malignant or benign) W4=3 W2=0 W0=2 W1=4 W2=-2 Input: (attributes of patient to classify) 1 x4 x3 x2 x1 x0

  10. The Perceptron Output: classification of patient (malignant or benign) W4=3 W2=0 W0=2 W1=4 W2=-2 Input: (attributes of patient to classify) 2 3.1 4 0 1 x4 x3 x2 x1 x0

  11. The Perceptron Output: classification of patient (malignant or benign) W4=3 W2=0 W0=2 W1=4 W2=-2 Input: (attributes of patient to classify) 3 3.1 4 0 1 x4 x3 x2 x1 x0

  12. The Perceptron Output: classification of patient (malignant or benign) sgn(3)=1 W4=3 W2=0 W0=2 W1=4 W2=-2 Input: (attributes of patient to classify) 2 3.1 4 0 1 x4 x3 x2 x1 x0

  13. The Perceptron 1 Output: classification of patient (malignant or benign) sgn(3)=1 W4=3 W2=0 W0=2 W1=4 W2=-2 Input: (attributes of patient to classify) 2 3.1 4 0 1 x4 x3 x2 x1 x0

  14. Learning with Perceptrons • Hypotheses Space: • Inductive Bias: Prefer hypotheses that do not misclassify any of the training instances (or minimize an error function). • Search method: perceptron training rule, gradient descent, etc. • Remember: the problem is to find “good” weights

  15. Training Perceptrons True Output: -1 • Start with random weights • Update in an intelligent way to improve them • Intuitively: • Decrease the weights that increase the sum • Increase the weights that decrease the sum • Repeat for all training instances until convergence 1 sgn(3)=1 3 2 0 -2 4 2 3.1 4 0 1 x4 x3 x2 x1 x0

  16. Perceptron Training Rule • t : output the current example should have • o: output of the perceptron • xi: value of predictor variable i • t=o : No change (for correctly classified examples)

  17. Perceptron Training Rule • t=-1, o=1 : sum will decrease • t=1, o=-1 : sum will increase

  18. Perceptron Training Rule • In vector form:

  19. Example of Perceptron Training: The OR function x2 1 1 1 1 -1 x1 0 1

  20. Example of Perceptron Training: The OR function • Initial random weights: Define line: x1=0.5 x2 1 1 1 1 -1 x1 0 1

  21. Example of Perceptron Training: The OR function • Initial random weights: Defines line: x1=0.5 x2 1 1 1 1 -1 x1 0 1 x1=0.5 Area where classifier outputs -1 Area where classifier outputs 1

  22. Example of Perceptron Training: The OR function Only misclassified example x1=0, x2=1 x2 1 1 1 1 -1 x1 0 1 x1=0.5

  23. Example of Perceptron Training: The OR function Only misclassified example x1=0, x2=1 x2 1 1 1 1 -1 x1 0 1 x1=0.5

  24. Example of Perceptron Training: The OR function Only misclassified example x1=0, x2=1 x2 1 1 1 For x2=0, x1=-0.5 For x1=0, x2=-0.5 So, new line is: 1 -1 x1 0 1 x1=0.5

  25. Example of Perceptron Training: The OR function x2 1 1 1 For x2=0, x1=-0.5 For x1=0, x2=-0.5 1 -1 x1 0 1 -0.5 -0.5 x1=0.5

  26. Example of Perceptron Training: The OR function Next iteration: x2 1 1 1 Misclassified example 1 -1 x1 0 1 -0.5 -0.5

  27. Example of Perceptron Training: The OR function x2 Perfect classification No change occurs next 1 1 1 1 -1 x1 0 1 -0.5 -0.5

  28. Analysis of the Perceptron Training Rule • Algorithm will always converge within finite number of iterations if the data are linearly separable. • Otherwise, it may oscillate (no convergence)

  29. Training by Gradient Descent • Similar but: • Always converges • Generalizes to training networks of perceptrons (neural networks) • Idea: • Define an error function • Search for weights that minimize the error, i.e., find weights that zero the error gradient

  30. Setting Up the Gradient Descent • Mean Squared Error: • Minima exist where gradient is zero:

  31. The Sign Function is not Differentiable x2 1 1 1 1 -1 0 1 x1=0.5

  32. Use Differentiable Transfer Functions • Replace • with the sigmoid

  33. Calculating the Gradient

  34. Updating the Weights with Gradient Descent • Each weight update goes through all training instances • Each weight update more expensive but more accurate • Always converges to a local minimum regardless of the data

  35. Multiclass Problems • E.g., 4 classes • Two possible solutions • Use one perceptron (output unit) and interpret the results as follows: • Output 0-0.25, class 1 • Output 0.25 – 0.5, class 2 • Output 0.5 – 0.75, class 3 • Output 0.75 – 1, class 4 • Use four output units and a binary encoding of the output (1-of-m encoding).

  36. 1-of-M Encoding • Assign to class with largest output Output for Class 1 Output for Class 2 Output for Class 3 Output for Class 4 x4 x3 x2 x1 x0

  37. Feed-Forward Neural Networks Output Layer Hidden Layer 2 Hidden Layer 1 Input Layer x4 x3 x2 x1 x0

  38. Increased Expressiveness Example: Exclusive OR No line (no set of three weights) can separate the training examples (learn the true function). x2 1 1 -1 1 -1 0 1 w2 w1 w0 x2 x1 x0

  39. Increased Expressiveness Example x2 w’1,1 w’2,1 1 1 -1 w2,2 1 -1 0 1 w1,2 w2,1 w1,1 w0,2 w0,1 x2 x1 x0

  40. Increased Expressiveness Example O x2 1 1 1 -1 1 H1 H2 -1 1 -1 0 1 1 1 -1 0.5 0.5 x2 x1 x0

  41. Increased Expressiveness Example x2 1 -1 O 1 1 1 -1 x1 H1 H2 -1 1 1 -1 0.5 0.5 x2 x1 x0

  42. Increased Expressiveness Example x2 1 -1 O 1 1 1 -1 x1 H1 H2 -1 1 1 -1 -0.5 -0.5 0 0 1 x2 x1 x0

  43. Increased Expressiveness Example x2 1 -1 -1 1 1 1 -1 x1 -1 -1 -1 1 1 -1 -0.5 -0.5 0 0 1 x2 x1 x0

  44. Increased Expressiveness Example x2 1 -1 1 1 1 1 -1 x1 -1 1 -1 1 1 -1 -0.5 -0.5 0 1 1 x2 x1 x0

  45. Increased Expressiveness Example x2 1 -1 1 1 1 1 -1 x1 1 -1 -1 1 1 -1 -0.5 -0.5 1 0 1 x2 x1 x0

  46. Increased Expressiveness Example x2 1 -1 1 1 1 1 -1 x1 -1 -1 -1 1 1 -1 -0.5 -0.5 1 1 1 x2 x1 x0

  47. Increased Expressiveness Example x2 1 -1 1 1 1 1 -1 x1 -1 -1 -1 1 1 -1 -0.5 -0.5 1 1 1 x2 x1 x0

  48. Increased Expressiveness Example x2 1 -1 1 1 1 1 -1 x1 -1 -1 -1 1 1 -1 -0.5 -0.5 1 1 1 x2 x1 x0

  49. From the Viewpoint of the Output Layer x2 T4 T3 O 1 1 T1 T2 x1 H1 Mapped By Hidden Layer to: H2 H2 T2 H1 T4 T1 T3

  50. From the Viewpoint of the Output Layer • Each hidden layer maps to a new instance space • Each hidden node is a new constructed feature • Original Problem may become separable (or easier) x2 T4 T3 T1 T2 x1 Mapped By Hidden Layer to: H2 T2 H1 T4 T1 T3

More Related