Learning From Observations

Learning From Observations Marco Loog

Learning from Observations • Idea is that percepts should be used for improving agents ability to act in the future, not only for acting per se

Outline • Learning agents • Inductive learning • Decision tree learning

Learning • Learning is essential for unknown environments, i.e., when designer lacks omniscience • Learning is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down • Learning modifies the agent’s decision mechanisms to improve performance

Learning Agent [Revisited] • Four conceptual components • Learning element : responsible for making improvements • Performance element : takes percepts and decides on actions • Critic : provides feedback on how agent is doing and determines how performance element should be modified • Problem generator : responsible for suggesting actions leading to new and informative experience

Figure 2.15 [Revisited]

Learning Element • Design of learning element is affected by • Which components of the performance element are to be learned • What feedback is available to learn these components • What representation is used for the components

Agent’s Components • Direct mapping from conditions on current state to actions [instructor : brake!] • Means to infer relevant properties about world from percept sequence [learning from images] • Info about evolution of the world and results of possible actions [braking on wet road] • Utility indicating desirability of world state [no tip / component of utility function] • ... • Each component can be learned from appropriate feedback

Types of Feedback • Supervised learning : correct answers for each example • Unsupervised learning : correct answers not given • Reinforcement learning : occasional rewards

Inductive Learning • Simplest form : learn a function from examples • I.e. learn the target function f • Examples : input / output pairs (x, f(x))

Inductive Learning • Problem • Find a hypothesis h, such that h ≈ f, based on given training set of examples • = highly simplified model of real learning • Ignores prior knowledge • Assumes examples are given

Hypothesis • A good hypothesis will generalize well, i.e., able to predict based on unseen examples

Inductive Learning Method • E.g. function fitting • Goal is to estimate real underlying functional relationship from example observations

Inductive Learning Method • Construct h to agree with f on training set

Inductive Learning Method • Construct h to agree with f on training set • h is consistent if it agrees with f on all examples

So, which ‘Fit’ is Best?

So, which ‘Fit’ is Best? • Ockham’s razor : prefer simplest hypothesis consistent with the data

So, which ‘Fit’ is Best? • Ockham’s razor : prefer simplest hypothesis consistent with the data • What’s consistent? What’s simple?

Hypothesis • A good hypothesis will generalize well, i.e., able to predict based on unseen examples • Not-exactly-consistent may be preferable over exactly consistent • Nondeterministic behavior • Consistency even not always possible • Nondeterministic functions : trade-off complexity of hypothesis / degree of fit

Decision Trees • ‘Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm’ • Good intro to the area of inductive learning

Decision Tree • Input : object or situation described by set of attributes / features • Output [discrete or continuous] : decision / prediction • Continuous -> regression • Discrete -> classification • Boolean classification : output is binary / ‘true’ or ‘false’

Decision Tree • Performs a sequence of tests in order to reach a decision • Tree [as in : graph without closed loops] • Internal node : test of the value of single property • Branches labeled with possible test outcomes • Leaf node : specifies output value • Resembles a ‘how to’ manual

Decide whether to wait for a Table at a Restaurant • Based on the following attributes • Alternate : is there an alternative restaurant nearby? • Bar : is there a comfortable bar area to wait in? • Fri/Sat : is today Friday or Saturday? • Hungry : are we hungry? • Patrons : number of people in the restaurant [None, Some, Full] • Price : price range [$, $$, $$$] • Raining : is it raining outside? • Reservation : have we made a reservation? • Type : kind of restaurant [French, Italian, Thai, Burger] • WaitEstimate : estimated waiting time [0-10, 10-30, 30-60, >60]

Attribute-Based Representations • Examples of decisions

Decision Tree • Possible representation for hypotheses • Below is the ‘true’ tree [note Type? plays no role]

Expressiveness • Decision trees can express any function of the input attributes • E.g., for Boolean functions, truth table row path to leaf

Expressiveness • There is a consistent decision tree for any training set with one path to leaf for each example [unless f nondeterministic in x] but it probably won’t generalize to new examples • Prefer to find more compact decision trees [This Ockham again...]

Attribute-Based Representations • Is simply a lookup table • Cannot generalize to unseen examples

Decision Tree • Applying Ockham’s razor : smallest tree consistent with examples

Decision Tree • Applying Ockham’s razor : smallest tree consistent with examples • Able to generalize to unseen examples • No need to program everything out / specify everything in detail ‘true’ tree = smallest tree?

Decision Tree Learning • Unfortunately, finding the ‘smallest’ tree is intractable in general • New aim : find a ‘smallish’ tree consistent with the training examples • Idea : [recursively] choose ‘most significant’ attribute as root of [sub]tree • ‘Most significant’ : making the most difference to the classification

Choosing an Attribute Tests • Idea : a good attribute splits the examples into subsets that are [ideally] ‘all positive’ or ‘all negative’ • Patrons? is a better choice

Using Information Theory • Information content [entropy] : • I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi) • For a training set containing p positive examples and n negative examples • Specifies the minimum number of bits of information needed to encode the classification of an arbitrary member

Information Gain • Chosen attribute A divides training set E into subsets E1, … , Ev according to their values for A, where A has v distinct values • Information gain [IG] : expected reduction in entropy caused by partitioning the examples

Information Gain • Information gain [IG] : expected reduction in entropy caused by partitioning the examples • Choose the attribute with the largest IG • [Wanna know more : Google it...]

Information Gain [E.g.] • For the training set : p = n = 6, I(6/12, 6/12) = 1 bit • Consider Patrons? and Type? [and others] • Patrons has the highest IG of all attributes and so is chosen as the root • Why is IG of Type? equal to zero?

Decision Tree Learning • Plenty of other measures for ‘best’ attributes possible...

Back to The Example... • ‘Training data’

Based on the 12 examples; substantially simpler solution than ‘true’ tree More complex hypothesis isn’t justified by small amount of data Decision Tree Learned

Performance Measurement • How do we know that h ≈ f? • Or : how the h*ll do we know that our decision tree performs well? • Most often we don’t know... for sure

Performance Measurement • However • prediction quality can be estimated using theory from computational / statistical learning theory / PAC-learning • Or we could, for example, simply try h on a new test set of examples • The crux being of course that there should actually be new test set... • If no test set is available several possibilities exist for creating ‘training’ and ‘test’ sets from the available data

Performance Measurement • Learning curve : ‘%’ correct on test set as function of training set size

Bad Conduct in AI • Training on the test set! • May happen before you know it • Often very hard justifiable... if at all possible • All I can say is : try to avoid it

Ensemble-Learning-in-1-Slide • Idea : collection [ensemble] of hypotheses is used / predictions are combined • Motivation : hope that it is much less likely to misclassify [obviously!] • E.g. independence can be exploited • Examples : majority voting / boosting • Ensemble learning simply creates new, more expressive hypothesis space

Summary • In general : learning needed for unknown environments or lazy designers • Learning agent = performance element + learning element [Chapter 2] • Supervised learning : the aim is to find simple hypothesis [approximately] consistent with training examples • Decision tree learning using IG • Difficult to measure learning performance • Learning curve

Next Week • More...

Learning From Observations