Donald “Godel” Rumsfeld

Donald “Godel” Rumsfeld Winner of 2003 Foot in the Mouth Award ''Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns, there are things we know we know,'' Rumsfeld said. ''We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns — the ones we don't know we don't know.'' • Rumsfeld talking about the reported lack of WMDs in Iraq (News Conference, April 2003) ''We think we know what he means,'' said Plain English Campaign spokesman John Lister. ''But we don't know if we really know.''

12/2 • Decisions.. Decisions • Vote on final • In-class • (16th 2:40pm) • OR Take-home • (will be due by 16th) • Clarification on HW5 • Participation survey

Learning Dimensions: What can be learned? --Any of the boxes representing the agent’s knowledge --action description, effect probabilities, causal relations in the world (and the probabilities of causation), utility models (sort of through credit assignment), sensor data interpretation models What feedback is available? --Supervised, unsupervised, “reinforcement” learning --Credit assignment problem What prior knowledge is available? -- “Tabularasa” (agent’s head is a blank slate) or pre-existing knowledge

Inductive Learning(Classification Learning) • Given a set of labeled examples, and a space of hypotheses • Find the rule that underlies the labeling • (so you can use it to predict future unlabeled examples) • Tabularasa, fully supervised • Idea: • Loop through all hypotheses • Rank each hypothesis in terms of its match to data • Pick the best hypothesis Closely related to * Function learning or curve-fitting (regression)

A classification learning example Predicting when Rusell will wait for a table --similar to predicting credit card fraud, predicting when people are likely to respond to junk mail

The hypothesis classifies the example as +ve, but it is actually -ve A good hypothesis will have fewest false positives (Fh+) and fewest false negatives (Fh-) [Ideally, we want them to be zero] Rank(h) = f(Fh+, Fh-) --f depends on the domain --in a medical domain False negatives are penalized more --in a junk-mailing domain, False negatives are penalized less H1: Russell waits only in italian restaurants false +ves: X10, false –ves: X1,X3,X4,X8,X12 H2: Russell waits only in cheap french restaurants False +ves: False –ves: X1,X3,X4,X6,X8,X12 Ranking hypotheses

You can classify all new instances (test cases) correctly always Always May be the training samples are not completely representative of the test samples So, we go with “probably” Correctly? May be impossible if the training data has noise (the teacher may make mistakes too) So, we go with “approximately” The goal of a learner then is to produce a probably approximately correct (PAC) hypothesis, for a given approximation (error rate) e and probability d. When is a learner A better than learner B? For the same e,d bounds, A needs fewer trailing samples than B to reach PAC. Learning Curves When do you know you have learned the concept well?

PAC learning Note: This result only holds for finite hypothesis spaces (e.g. not valid for the space of line hypotheses!)

Inductive Learning(Classification Learning) • Main variations: • Bias: the “sort” of rule are you looking for? • If you are looking for only conjunctive hypotheses, there are just 3n • Search: • Greedy search • Decision tree learner • Systematic search • Version space learner • Iterative search • Neural net learner • Given a set of labeled examples, and a space of hypotheses • Find the rule that underlies the labeling • (so you can use it to predict future unlabeled examples) • Tabularasa, fully supervised • Idea: • Loop through all hypotheses • Rank each hypothesis in terms of its match to data • Pick the best hypothesis It can be shown that sample complexity of PAC learning is proportional to 1/e, 1/d AND log |H| • The main problem is that • the space of hypotheses is too large • Given examples described in terms of n boolean variables • There are 2 different hypotheses • For 6 features, there are 18,446,744,073,709,551,616 hypotheses 2n

More expressive the bias, larger the hypothesis space Slower the learning --Line fitting is faster than curve fitting --Line fitting may miss non-line patterns IMPORTANCE OF Bias in Learning… “Gavagai” example. -The “whole object” bias in language learning.

Uses different biases in predicting Russel’s waiting habbits Decision Trees --Examples are used to --Learn topology --Order of questions If patrons=full and day=Friday then wait (0.3/0.7) If wait>60 and Reservation=no then wait (0.4/0.9) Association rules --Examples are used to --Learn support and confidence of association rules Neural Nets --Examples are used to --Learn topology --Learn edge weights Naïve bayes (bayesnet learning) --Examples are used to --Learn topology --Learn CPTs

Mirror, Mirror, on the wall Which learning bias is the best of all? Well, there is no such thing, silly! --Each bias makes it easier to learn some patterns and harder (or impossible) to learn others: -A line-fitter can fit the best line to the data very fast but won’t know what to do if the data doesn’t fall on a line --A curve fitter can fit lines as well as curves… but takes longer time to fit lines than a line fitter. -- Different types of bias classes (Decision trees, NNs etc) provide different ways of naturally carving up the space of all possible hypotheses So a more reasonable question is: -- What is the bias class that has a specialization corresponding to the type of patterns that underlie my data? -- In this bias class, what is the most restrictive bias that still can capture the true pattern in the data? --Decision trees can capture all boolean functions --but are faster at capturing conjunctive boolean functions --Neural nets can capture all boolean or real-valued functions --but are faster at capturing linearly seperable functions --Bayesian learning can capture all probabilistic dependencies But are faster at capturing single level dependencies (naïve bayes classifier)

12/4 Interactive Review next class!! Minh’s review: Next Monday evening Rao’s review: Reading day? Vote on participation credit: Should I consider participation credit or not?

Why Simple is Better? Fitting test cases vs. predicting future cases The BIG TENSION…. Review 2 1 3 Why not the 3rd?

Which one to pick? Learning Decision Trees---How? Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively

Depending on the order we pick, we can get smaller or bigger trees Which tree is better? Why do you think so??

Would you split on patrons or Type? Basic Idea: --Pick an attribute --Split examples in terms of that attribute --If all examples are +ve label Yes. Terminate --If all examples are –ve label No. Terminate --If some are +ve, some are –ve continue splitting recursively --if no attributes left to split? (label with majority element)

# expected comparisons needed to tell whether a given example is +ve or -ve N+ N- Splitting on feature fk N2+ N2- Nk+ Nk- N1+ N1- I(P1+ ,, P1-) I(P2+ ,, P2-) I(Pk+ ,, Pk-) k S [Ni+ + Ni- ]/[N+ + N-]I(Pi+ ,, Pi-) i=1 The Information Gain Computation P+ : N+ /(N++N-) P- : N- /(N++N-) I(P+ ,, P-) = -P+ log(P+) - P- log(P- ) The difference is the information gain So, pick the feature with the largest Info Gain I.e. smallest residual info Given k mutually exclusive and exhaustive events E1….Ek whose probabilities are p1….pk The “information” content (entropy) is defined as S i -pi log2 pi

I(1/2,1/2) = -1/2 *log 1/2 -1/2 *log 1/2 = 1/2 + 1/2 =1 I(1,0) = 1*log 1 + 0 * log 0 = 0 A simple example V(M) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1 V(A) = 2/4 * I(1,0) + 2/4 * I(0,1) = 0 V(N) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2) = 1 So Anxious is the best attribute to split on Once you split on Anxious, the problem is solved

“Majority” function (say yes if majority of attributes are yes) Russell Domain Lesson: Every bias makes something easier to learn and others harder to learn… Evaluating the Decision Trees Learning curves… Given N examples, partition them into Ntr the training set and Ntest the test instances Loop for i=1 to |Ntr| Loop for Ns in subsets of Ntr of size I Train the learner over Ns Test the learned pattern over Ntest and compute the accuracy (%correct)

Problems with Info. Gain. Heuristics • Feature correlation: The Costanza party problem • No obvious solution… • Overfitting: We may look too hard for patterns where there are none • E.g. Coin tosses classified by the day of the week, the shirt I was wearing, the time of the day etc. • Solution: Don’t consider splitting if the information gain given by the best feature is below a minimum threshold • Can use the c2 test for statistical significance • Will also help when we have noisy samples… • We may prefer features with very high branching • e.g. Branch on the “universal time string” for Russell restaurant example • Branch on social security number to look for patterns on who will get A • Solution: “gain ratio” --ratio of information gain with the attribute A to the information content of answering the question “What is the value of A?” • The denominator is smaller for attributes with smaller domains.

Neural Network Learning • Idea: Since classification is really a question of finding a surface to separate the +ve examples from the -ve examples, why not directly search in the space of possible surfaces? • Mathematically, a surface is a function • Need a way of learning functions • “Threshold units”

= 1 if w1I1+w2I2 > k = 0 otherwise Recurrent Feed Forward Uni-directional connections Bi-directional connections Single Layer Multi-Layer Any “continuous” decision surface (function) can be approximated to any degree of accuracy by some 2-layer neural net Can act as associative memory Any linear decision surface can be represented by a single layer neural net “Neural Net” is a collection of with interconnections threshold units differentiable

The “Brain” Connection A Threshold Unit Threshold Functions differentiable …is sort of like a neuron

I1 w1 t=k I2 w2 I0=-1 w1 w0= k t=0 w2 Perceptron Networks What happened to the “Threshold”? --Can model as an extra weight with static input ==

Can Perceptrons Learn All Boolean Functions? --Are all boolean functions linearly separable?

Any line that separates the +ve & –ve examples is a solution --may want to get the line that is in some sense equidistant from the nearest +ve/-ve --Need “support vector machines” for that Perceptron Training in Action A nice applet at: http://neuron.eng.wayne.edu/java/Perceptron/New38.html

Comparing Perceptrons and Decision Trees in Majority Function and Russell Domain Decision Trees Perceptron Decision Trees Perceptron Majority function Russell Domain Majority function is linearly seperable.. Russell domain is apparently not.... Encoding: one input unit per attribute. The unit takes as many distinct real values as the size of attribute domain

Donald “Godel” Rumsfeld