1 / 41

Slides adapted from U. Oxford Connectionist Summer School 1998 http://hincapie.psych.purdue.edu/CSS/index.html

Brief Overview of Connectionism to understand Learning Walter Schneider P2476 Cognitive Neuroscience of Human Learning & Instruction http://schneider.lrdc.pitt.edu/P2476/index.htm. Slides adapted from U. Oxford Connectionist Summer School 1998 http://hincapie.psych.purdue.edu/CSS/index.html

vail
Download Presentation

Slides adapted from U. Oxford Connectionist Summer School 1998 http://hincapie.psych.purdue.edu/CSS/index.html

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Brief Overview of Connectionism to understand LearningWalter Schneider P2476 Cognitive Neuroscience of Human Learning & Instruction http://schneider.lrdc.pitt.edu/P2476/index.htm Slides adapted from U. Oxford Connectionist Summer School 1998 http://hincapie.psych.purdue.edu/CSS/index.html Hinton Lectures on connectionism http://www.cs.toronto.edu/~hinton/csc321/index.html David Plaut http://www.cnbc.cmu.edu/~plaut/ICM/

  2. Specific Example NetTalk • NetTalk: Sejnowski, T. J. & Rosenberg, C. R. (1987) Parallel Networks that Learn to Pronounce English Text Complex Systems 1 145-168 Learning input phonetic transcription of a child continuous speech

  3. Simple Units

  4. Learning Rules Change Connection Weights • Learning rules calculate the difference between desired output and the correct output and use that difference to change weights to reduce the error.

  5. Learning or 50,000 trials. NetTalk Download Initial 0:46 20sec Learn space 0:2:17 20s After 10K ep 3:50 20s Transfer 5:19 20s http://www.cnl.salk.edu/ParallelNetsPronounce/index.php Transfer to new words same speaker 78%. Note if assume 200 words per our (welfare household) and 5 hr/day, 1000/day or 50 days.

  6. Graceful Deterioration and robust processing with fast relearning

  7. More Hidden units better performance but slower learning

  8. Unit Coding Unclear in Distributed Code

  9. Hierarchical Clustering Sensible groupings

  10. Performance characteristics • With 120 hidden units • 98% within trained units • 75% generalization on dictionary of 20,012 words • 85% first pass and 90% and 97.5% after 55 passes. • Adding 2 hidden layers of 80 units slightly improved generalization (but slows learning) • 97% after 55 passes, 80% generalization,

  11. Summary Supervised Learning NetTalk – example of back propagation learning • Performed computation with simple units, connection weight matrices, parallel activation • Learning rule provided error signal from supervisor to change connection weights • It took man 105 trials to reach good performance going through babbling to word production • Learning speed and generalization varied with nature of number of units and levels • Showed good generalization to related words • Developed similarity space consistent with human clustering data • Performance was robust to loss of units and connection noise • Needed expert teacher with ability to reach in brain to set correct states

  12. How is this like and not like human learning? • Similar • Lots of trials • Babbling for a while before it makes sense • Ability to learn any language (e.g., Dutch) • Generalization to new words • Creates similarity spaces • Dissimilar • Teacher shows exact correctness by activating the correct output units • Use DecTalk only allowing correct simple output • Very simple network, small number of units • Sequential presentation of target • Learning reading not babbling/speech • Accuracy does not reach human level • Unlikely to be biologically implement able (high precision connections, back propagate precision across levels • Does not learn from instruction but only experience

  13. Switch to Contrastive Hebbian Learning

  14. 1.0 1.0 -1.0 -1.0 1.0 1.0 Some Fundamental Concepts • Parallel Processing • Distributed Representations • Learning (multiple Types) • Generalisation • Graceful Degradation

  15. Genres of Network Architecture

  16. Introduction to Neural Computation • Simplified Neuron • A layered neural network Output Connections Σ θ Input Connections Cell Body Output Neurons Input Neurons

  17. Introduction to Neural Computation • A single output neuron Output Neuron Input Neurons

  18. y x y z x z The Mapping Principle • Patterns of ActivityAn input pattern is transformed to an output pattern. • Activation States are Vectors Each pattern of activity can be considered a unique point in a space of states. The activation vector identifies this point in space. • Mapping Functions T = F (S) The network maps a source space S (the network inputs) to a target space T (the outputs).The mapping function F is most likely complex. No simple mathematical formula can capture it explicitly. • HyperspaceInput states generally have a high dimensionality. Most network states are therefore considered to populate HyperSpace. T S

  19. The Principle of Superposition Matrix 1 Composite Matrix Matrix 2

  20. w ain aout Hebbian Learning • Cellular Association“When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process of metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.” (Hebb 1949, p.50) • Learning Connections Take the product of the excitation of the two cells and change the value of the connection in proportion to this product. • The Learning Rule ε is the learning rate. • Changing ConnectionsIf ain = 0.5, aout = 0.75, and ε = 0.5then Δw = 0.5(0.75)(0.5) = 0.1875And if wstart = 0.0, then wnext = 0.1875 • Calculating Correlations 2 1 0

  21. Models of English past tense • PDP accounts • Single homogeneous architecture • Superposition • Competition between different different verb types result in overregularisation and irregularisation • Vocabulary discontinuity • Rumelhart & McClelland 1986

  22. y w ain aout x z Using an Error Signal • Perceptron Convergence RuleLearning in a single weight networkAssume a teacher signal toutAdaptation of Connection and Threshold (Rosenblatt 1958)Note that threshold always changes if incorrect output.Blame is apportioned to a connection in proportion to the activity of the input line. • Orthogonality ConstraintNumber of patterns limited by dimensionality of network. • Input patterns must be orthogonal to each other • Similarity effects. Output Neurons Input Neurons

  23. aout w20 w21 Using an Error Signal • Perceptron Convergence Rule“The perceptron convergence rule guarantees to find a solution to a mapping problem, provided a solution exists.” (Minsky & Papert 1969) • An Example of Perceptron LearningBoolean OrTraining the network

  24. Error Weight Value Gradient Descent • Calculating the Error SignalNote that Perceptron Convergence and LMS use similar learning algorithms – the Delta Rule • Error LandscapesGradient descent algorithms adapt by moving downhill in a multi-dimensional landscape – the error surface.Ball bearing analogy.In a smooth landscape, the bottom will always be reached. However, bottom may not correspond to zero error. • Least Mean Square Error (LMS)Define the error measure as the square of the discrepancy between the actual output and the desired output. (Widrow-Hoff 1960) • Plot an error curve for a single weight network • Make weight adjustments by performing gradient descent – always move down the slope.

  25. Past Tense Revisited • Vocabulary Discontinuity • Up to 10 epochs – 8 irregulars + 2 regulars. Thereafter – 420 verbs – mostly regular. • Justification: Irregulars are more frequent than regulars • Lack of Evidence • Vocabulary spurt at 2 years whereas overregularizations occur at 3 years. Furthermore, vocabulary spurt consists mostly of nouns. • Pinker and Prince (1988) show that regulars and irregulars are relatively balanced in early productive vocabularies

  26. Longitudinal evidence • Stages or phases in development? • Initial error-free performance. • Protracted period of overregularisation but at low rates (typically < 10%). • Gradual recovery from error. • Rate of overregularisation is much less the rate of regularisation of regular verbs. 1992

  27. Longitudinal evidence • Error Characteristics • High frequency irregulars are robust to overregularisation. • Some errors seem to be phonologically conditioned. • Irregularisations.

  28. Single system account • Multi-layered Perceptrons • Hidden unit representation • Error correction technique • Plunkett & Marchman 1991 • Type/Token distinction • Continuous training set

  29. Single system account • Incremental Vocabularies • Plunkett & Marchman (1993) • Initial small training set • Gradual expansion • Overregularisation • Initial error-free performance. • Protracted period of overregularisation but at low rates (typically < 5%). • High frequency irregulars are robust to overregularisation.

  30. y w ain aout x z Using an Error Signal • Perceptron Convergence RuleLearning in a single weight networkAssume a teacher signal toutAdaptation of Connection and Threshold (Rosenblatt 1958)Note that threshold always changes if incorrect output.Blame is apportioned to a connection in proportion to the activity of the input line. • Orthogonality ConstraintNumber of patterns limited by dimensionality of network. • Input patterns must be orthogonal to each other • Similarity effects. Output Neurons Input Neurons

  31. aout w20 w21 Using an Error Signal • Perceptron Convergence Rule“The perceptron convergence rule guarantees to find a solution to a mapping problem, provided a solution exists.” (Minsky & Papert 1969) • An Example of Perceptron LearningBoolean OrTraining the network

  32. Error Weight Value Gradient Descent • Calculating the Error SignalNote that Perceptron Convergence and LMS use similar learning algorithms – the Delta Rule • Error LandscapesGradient descent algorithms adapt by moving downhill in a multi-dimensional landscape – the error surface.Ball bearing analogy.In a smooth landscape, the bottom will always be reached. However, bottom may not correspond to zero error. • Least Mean Square Error (LMS)Define the error measure as the square of the discrepancy between the actual output and the desired output. (Widrow-Hoff 1960) • Plot an error curve for a single weight network • Make weight adjustments by performing gradient descent – always move down the slope.

  33. Past Tense Revisited • Vocabulary Discontinuity • Up to 10 epochs – 8 irregulars + 2 regulars. Thereafter – 420 verbs – mostly regular. • Justification: Irregulars are more frequent than regulars • Lack of Evidence • Vocabulary spurt at 2 years whereas overregularizations occur at 3 years. Furthermore, vocabulary spurt consists mostly of nouns. • Pinker and Prince (1988) show that regulars and irregulars are relatively balanced in early productive vocabularies

  34. Longitudinal evidence • Stages or phases in development? • Initial error-free performance. • Protracted period of overregularisation but at low rates (typically < 10%). • Gradual recovery from error. • Rate of overregularisation is much less the rate of regularisation of regular verbs. 1992

  35. Longitudinal evidence • Error Characteristics • High frequency irregulars are robust to overregularisation. • Some errors seem to be phonologically conditioned. • Irregularisations.

  36. Single system account • Multi-layered Perceptrons • Hidden unit representation • Error correction technique • Plunkett & Marchman 1991 • Type/Token distinction • Continuous training set

  37. Single system account • Incremental Vocabularies • Plunkett & Marchman (1993) • Initial small training set • Gradual expansion • Overregularisation • Initial error-free performance. • Protracted period of overregularisation but at low rates (typically < 5%). • High frequency irregulars are robust to overregularisation.

  38. Linear Separability • Boolean AND, OR and XOR • Partitioning Problem Space

  39. HiddenUnits InputUnits OutputUnit 1.0 θ 1.0 -1.0 θ -1.0 θ 1.0 1.0 Threshold θ = 1 1,1 0,1 1,0 0,0 Internal Representations • Multi-layered PerceptronsSolving XOR • Representing Similarity RelationsHidden units transform the input

  40. Error Local Global Weight Value Back Propagation • Assignment of Blame to Hidden Units • Local Minima • Activation Functions

  41. Learning Hierarchical Relations Hinton DiagramsUnit1: Nationality Unit2: Generation Unit3: Branch of Tree Isomorphic Family Trees Family Tree Network

More Related