Learning III

Learning III Introduction to Artificial Intelligence CS440/ECE448 Lecture 22 Next Lecture: Roxana Girju, Natural Language Processing New Homework Out Today

Last lecture • Identification trees This lecture • Neural networks: Perceptrons Reading • Chapter 20

Sunburn data There are 3 x 3 x 3 x 2 = 54 possible feature vectors

An ID tree consistent with the data Hair Color Blond Brown Red Alex Pete John Emily Lotion Used Yes No Sarah Annie Dana Katie Sunburned Not Sunburned

Another consistent ID tree Height Sunburned Not Sunburned Tall Dana Pete Short Average Hair Color Suit Color Brown Red Red Blond Yellow Blue Alex Hair Color Suit Color Sarah Red Blue Red Yellow Brown Annie Emily Katie John

An ideaSelect tests that divide as well as possible people into sets with homogeneous labels Hair Color Lotion used No Blond Yes Red Brown Sarah Annie Emily Pete John Sarah Annie Dana Katie Dana Alex Katie Alex Pete John Emily Height Suit Color

Problem: • For practical problems, it is unlikely that any test will produce one completely homogeneous subset. • Solution: • Minimize a measure of inhomogeneity or disorder. • Available from information theory.

Information • Let’s say we have a question which has n possible answers and call them vi. • Let’s say that answer vi occurs with probability P(vi), then the information content (entropy) measured in bits of knowing the answer is: • One bit of information is enough information to answer a yes or no question. • E.g. consider flipping a fair coin, how much information do you have if you know which side comes up? I(½, ½) = - (½ log2½ + ½ log2½) = 1bit

Information at a node • In our decision tree for a given feature (e.g. hair color), we have • b: number of branches (e.g. possible values for the feature) • Nb: number of samples in branch • Np: number of samples in all branches • Nbc: number of samples in class c in branch b. • Using frequencies as an estimate of the probabilities, we have • For a single branch, the information is simply

What is the amount of information required for classification after we have used the hair test? Hair Color Blond Red Brown Sarah Annie Dana Katie Alex Pete John Emily -1 log21 -0 log20 = 0 - 0 log20 - 3/3 log23/3 = 0 - 2/4 log22/4 - 2/4 log22/4 = 1 Information = 4/8*1 + 1/8*0 + 3/8*0 = 0.5

Hair Color Blond Red Brown Sarah Annie Dana Katie Alex Pete John Emily Selecting top level feature • Using the 8 samples we have so far, we get: TestInformation Hair 0.5 Height 0.69 Suit Color 0.94 Lotion 0.61 • Hair wins, least additional information needed for rest of classification. • This is used to build the first level of the identification tree:

Hair Color Blond Red Brown Sarah Annie Dana Katie Alex Pete John Emily Selecting second level feature • Let’s consider the remaining features for the blond branch (4 samples) TestInformation Height 0.5 Suit Color 1 Lotion 0 • Lotion wins, least additional information.

Thus we get to the tree we had arrived at earlier Hair Color Blond Brown Red Alex Pete John Emily Lotion Used Yes No Sarah Annie Dana Katie Sunburned Not Sunburned

Using the Identification tree as a classification procedure Hair Color Red Brown Blond Lotion Used OK Sunburn Yes No Sunburn OK Rules: • If Blond and uses lotion, then OK • If Blond and does not use lotion, then gets burned • If red-haired, then gets burned • If brown hair, then OK

Performance measurement How do we know that h ≈ f ? • Use theorems of computational/statistical learning theory • Try h on a new test set of examples (use same distribution over example space as training set) Learning curve = % correct on test set as a function of training set size

Neural Networks

What & Why • Artificial Neural Networks: a bottom-up attempt to model the functionality of the brain. • Two main areas of activity: • Biological: Try to model biological neural systems. • Computational: • Artificial neural networks are biologically inspired but not necessarily biologically plausible. • So may use other terms: Connectionism, Parallel Distributed Processing, Adaptive Systems Theory. • Interests in neural networks differ according to profession.

Some Applications NETTALK vs. DECTALK DECTALK is a system developed by DEC which reads English characters and produces, with a 95% accuracy, the correct pronunciation for an input word or text pattern. • DECTALK is an expert system which took 20 years to finalize. • It uses a list of pronunciation rules and a large dictionary of exceptions. NETTALK, a neural network version of DECTALK, was constructed over one summer vacation. • After 16 hours of training the system could read a 100-word sample text with 98% accuracy! • When trained with 15,000 words it achieved 86% accuracy on a test set. NETTALK is an example of one of the advantages of the neural network approach over the symbolic approach to Artificial Intelligence. • It is difficult to simulate the learning process in a symbolic system; rules and exceptions must be known. • On the other hand, neural systems exhibit learning very clearly; the network learns by example.

LeNet-5Convolutional Neural NetworksYann LeCun http://yann.lecun.com/exdb/lenet/index.html

Human Brain & Connectionist Model Consider humans: • Neuron switching time .001 second • Number of neurons 1010 • Connections per neuron 104 • Scene recognition time .1 second • 100 inference steps doesn’t look like enough - massively parallel computation Properties of ANNs: • Many neuron-like switching units • Many weighted interconnections among units • Highly parallel, distributed process • Emphasis on tuning weights automatically

Artificial neurons: Perceptrons Linear, threshold units Xi : inputs Wi :weights  : threshold 0

0 Artificial neurons: Perceptrons The threshold can be easily forced to 0 by introducing an additional weight input w0 = . w0 x0=-1

How powerful is a perceptron?

Concept Space & Linear Separability

Training Perceptrons

Artifical Neurons

Single-layer perceptrons

Activation functions

Gradient descent for perceptron learning

Learning III

Learning III

Presentation Transcript

Focus on Learning Title III

Latent Tree Models Part III: Learning Algorithms

TLC – The Teaching Learning Circle - Part III

English III Service Learning Project

III. Individual Differences in Second Language Learning

iii

Lecture 6. Learning (III): Associative Learning and Principal Component Analysis

Part III Learning structured representations Hierarchical Bayesian models

English III online learning

III. UNIT III

III. UNIT III

III. UNIT III

III. Demonstrated skills in teaching and learning...

Chapter 4 Learning (III)

III. UNIT III

Session III: International Service Learning (ISL)

III

Lecture III – Theories and Strategies for Active Learning

Feeding III: Specialized Structures, Learning

Part III: Theory of Learning

Session III: International Service Learning (ISL)