Create Presentation
Download Presentation

Download Presentation
## Advanced Artificial Intelligence Lecture 3: Learning

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Advanced Artificial IntelligenceLecture 3: Learning**Bob McKay School of Computer Science and Engineering College of Engineering Seoul National University**Outline**• Defining Learning • Kinds of Learning • Generalisation and Specialisation • Some Simple Learning Algorithms**References**• Mitchell, Tom M: Machine Learning, McGraw-Hill, 1997, ISBN 0 07 115467 1**Defining a Learning System (Mitchell)**• “A program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”**Specifying a Learning System**• Specifying the task T, the performance P and the experience E defines the learning problem. Specifying the learning system requires us to define: • Exactly what knowledge is to be learnt • How this knowledge is to be represented • How this knowledge is to be learnt**Specifying What is to be Learnt**• Usually, the desired knowledge can be represented as a target valuation function V: I → D • It takes in information about the problem and gives back a desired decision • Often, it is unrealistic to expect to learn the ideal function V • All that is required is a ‘good enough’ approximation, V’: I → D**Specifying How Knowledge is to be Represented**• The function V’ must be represented symbolically, in some language L • The language may be a well-known language • Boolean expressions • Arithmetic functions • …. • Or for some systems, the language may be defined by a grammar**Specifying How the Knowledge is to be Learnt**• If the learning system is to be implemented, we must specify an algorithm A, which defines the way in which the system is to search the language L for an acceptable V’ • That is, we must specify a search algorithm**Structure of a Learning System**• Four modules • The Performance System • The Critic • The Generaliser (or sometimes Specialiser) • The Experiment Generator**Performance Module**• This is the system which actually uses the function V’ as we learn it • Learning Task • Learning to play checkers • Performance module • System for playing checkers • (I.e. makes the checkers moves)**Critic Module**• The critic module evaluates the performance of the current V’ • It produces a set of data from which the system can learn further**Generaliser/Specialiser Module**• Takes a set of data and produces a new V’ for the system to run again**Experiment Generator**• Takes the new V’ • Maybe also uses the previous history of the system • Produces a new experiment for the performance system to undertake**The Importance of Bias**• Important theoretical results from learning theory (PAC learning) tell us that learning without some presuppositions is infeasible. • Practical experience, of both machine and human learning, confirms this. • To learn effectively, we must limit the class of V’s. • Two approaches are used in machine learning: • Language bias • Search Bias • Combined Bias • Language and search bias are not mutually exclusive: most learning systems feature both**Language Bias**• The language L is restricted so that it cannot represent all possible target functions V • This is usually on the basis of some knowledge we have about the likely form of V’ • It introduces risk • Our system will fail if L does not contain an acceptable V’**Search Bias**• The order in which the system searches L is controlled, so that promising areas for V’ are searched first**The Downside:No Free Lunches**• Wolpert and MacReady’s No Free Lunch Theorem states, in effect, that averaged over all problems, all biases are equally good (or bad). • Conventional view • The choice of a learning system cannot be universal • It must be matched to the problem being solved • In most systems, the bias is not explicit • The ability to identify the language and search biases of a particular system is an important aspect of machine learning • Some more recent systems permit the explicit and flexible specification of both language and search biases**No Free Lunch:Does it Matter?**• Alternative view • We aren’t interested in all problems • We are only interested in prolems which have solutions of less than some bounded complexity • (so that we can understand the solutions) • The No Free Lunch Theorem may not apply in this case**Some Dimensions of Learning**• Induction vs Discovery: • Guided learning vs learning from raw data • Learning How vs Learning That (vs Learning a Better That) • Stochastic vs Deterministic; Symbolic vs Subsymbolic • Clean vs Noisy Data • Discrete vs continuous variables • Attribute vs Relational Learning • The Importance of Background Knowledge**Induction vs Discovery**• Has the target concept been previously identified? • Pearson: cloud classifications from satellite data • vs • Autoclass and H - R diagrams • AM and prime numbers • BACON and Boyle's Law**Guided Learning vs Learning from Raw Data**• Does the learning system require carefully selected examples and counterexamples, as in a teacher – student situation? • (allows fast learning) • CIGOL learning sort/merge • vs • Garvan institute's thyroid data**Learning How vs Learning That vs Learning a Better That**• Classifying handwritten symbols • Distinguishing vowel sounds (Sejnowski & Rosenberg) • Learning to fly a (simulated!) plane • vs • Michalski & learning diagnosis of soy diseases • vs • Mitchell & learning about chess forks**Stochastic vs Deterministic;Symbolic vs Subsymbolic**• Classifying handwritten symbols (stochastic, subsymbolic) • vs • Predicting plant distributions (stochastic, symbolic) • vs • Cloud classification (deterministic, symbolic) • vs • ? (deterministic, subsymbolic)**Clean vs Noisy Data**• Learning to diagnose errors in programs • vs • Greater gliders in the Coolangubra**Discrete vs Continuous Variables**• Quinlan's chess end games • vs • Pearson's clouds (eg cloud heights)**Attibute vs Relational Learning**• Predicting plant distributions • vs • Predicting animal distributions • (because plants can’t move, they don’t care - much - about spatial relationships)**The importance of Background Knowledge**• Learning about faults in a satellite power supply • general electric circuit theory • knowledge about the particular circuit**Generalisation and Learning**• What do we mean when we say of two propositions, S and G, that G is a generalisation of S? • Suppose skippy is a grey kangaroo. • We would regard ‘Kangaroos are grey as a generalisation of ‘Skippy is grey’. • In any world in which ‘kangaroos are grey’ is true, ‘Skippy is grey’ will also be true. • In other words, if G is a generalisation of specialisation S, then G is 'at least as true' as S, • That is, S is true in all states of the world in which G is, and perhaps in other states as well.**Generalisation and Inference**• In logic, we assume that if S is true in all worlds in which G is, then • G → S • That is, G is a generalisation of S exactly when G implies S • So we can think of learning from S as a search for a suitable G for which G → S • In propositional learning, this is often used as a definition: • G is more general than S if and only if G → S**Issues**• Equating generalisation and logical implication is only useful if the validity of an implication can be readily computed • In the propositional calculus, validity is an exponential problem • in the predicate calculus, validity is an undecidable problem • so the definition is not universally useful • (although for some parts of logic - eg learning rules - it is perfectly adequate).**A Common Misunderstanding**• Suppose we have two rules, • 1) A ∧ Β → G • 2) A ∧ Β ∧ C → G • Clearly, we would want 1 to be a generalisation of 2 • This is OK with our definition, because • ((A ^ B → G) → (A ^ B ^ C → G)) • is valid • But the confusing thing is that ((A^B^C) → (A∧Β)) is valid • Iif you only look at the hypotheses of the rule, rather than the whole rule, the implication is the wrong way around • Note that some textbooks are themselves confused about this**Defining Generalisaion**• We could try to define the properties that generalisation must satisfy, • So let's write down some axioms. We need some notation. • We will write 'S <G G' as shorthand for 'S is less general than G'. • Axioms: • Transitivity: If A <G B and B <G C then also A <G C • Antisymmetry: If A <G B then it's not true that B <G A • Top: there is a unique element, ⊥, for which it is always true that A <G⊥. • Bottom: there is a unique element, T, for which it is always true that T <GA.**Picturing Generalisaion**• We can draw a 'picture' of a generalisation hierarchy satisfying these axioms:**Specifying Generalisaion**• In a particular domain, the generalisation hierarchy may be defined in either of two ways: • By giving a general definition of what generalisation means in that domain • Example: our earlier definition in terms of implication • By directly specifying the specialisation and generalisation operators that may be used to climb up and down the links in the generalisation hierarchy**Learning and Generalisaion**• How does learning relate to generalisation? • We can view most learning as an attempt to find an appropriate generalisation that generalises the examples. • In noise free domains, we usually want the generalisation to cover all the examples. • Once we introduce noise, we want the generalisation to cover 'enough' examples, and the interesting bit is in defining what 'enough' is. • In our picture of a generalisation hierarchy, most learning algorithms can be viewed as methods for searching the hierarchy. • The examples can be pictured as locations low down in the hierarchy, and the learning algorithm attempts to find a location that is above all (or 'enough') of them in the hierarchy, but usually, no higher 'than it needs to be'**Searching the Generalisaion Hierarchy**• The commonest approaches are: • generalising search • the search is upward from the original examples, towards the more general hypotheses • specialising search • the search is downward from the most general hypothesis, towards the more special examples • Some algorithms use different approaches. Mitchell's version space approach, for example, tries to 'home in' on the right generalisation from both directions at once.**Completeness and Generalisaion**• Many approaches to axiomatising generalisation add an extra axiom: • Completeness: For any set Σ of members of the generalisation hierarchy, there is a unique 'least general generalisation' L, which satisfies two properties: • 1) for every S in Σ, S <GL • 2) if any other L' satisfies 1), then L <GL' • If this definition is hard to understand, compare it with the definition of 'Least Upper Bound' in set theory, or of 'Least Common Multiple' in arithmetic**Restricting Generalisation**• Let's go back to our original definition of generalisation: • G generalises S iff G → S • In the general predicate calculus case, this relation is uncomputable, so it's not very useful • One approach to avoiding the problem is to limit the implications allowed**Generalisation and Substitution**• Very commonly, the generalisations we want to make involve turning a constant into a variable. • So we see a particular black crow, fred, so we notice: • crow(fred) → black(fred) • and we may wish to generalise this to • ∀X(crow(X) → black(X)) • Notice that the original proposition can be recovered from the generalisation by substituting 'fred' for the variable 'X' • The original is a substitution instance of the generalisation • So we could define a new, restricted generalisation: • G subsumes S if S is a substitution instance of G • An example of our earlier definition, because a substitution instance is always implied by the original proposition.**Learning Algorithms**• For the rest of this lecture, we will work with a specific learning dataset (due to Mitchell): • Item Sky AirT Hum Wnd Wtr Fcst Enjy • 1 Sun Wrm Nml Str Wrm Sam Yes • 2 Sun Wrm High Str Wrm Sam Yes • 3 Rain Cold High Str Wrm Chng No • 4 Sun Wrm High Str Cool Chng Yes • First, we look at a really simple algorithm, Maximally Specific Learning**Maximally Specific Learning**• The learning language consists of sets of tuples, representing the values of these attributes • A ‘?’ represents that any value is acceptable for this attribute • A particular value represents that only that value is acceptable for this attribute • A ‘φ’ represents that no value is acceptable for this attribute • Thus (?, Cold, High, ?, ?, ?) represents the hypothesis that water sport is enjoyed only on cold, moist days. • Note that our language is already heavily biased: only conjunctive hypotheses (hypotheses built with ‘^’) are allowed.**Find-S**• Find-S is a simple algorithm: its initial hypothesis is that water sport is never enjoyed • It expands the hypothesis as positive data items are noted**Running Find-S**• Initial Hypothesis • The most specific hypothesis (water sports are never enjoyed): • h ← (φ,φ,φ,φ,φ,φ) • After First Data Item • Water sport is enjoyed only under the conditions of the first item: • h ← (Sun,Wrm,Nml,Str,Wrm,Sam) • After Second Data Item • Water sport is enjoyed only under the common conditions of the first two items: • h ← (Sun,Wrm,?,Str,Wrm,Sam)**Running Find-S**• After Third Data Item • Since this item is negative, it has no effect on the learning hypothesis: • h ← (Sun,Wrm,?,Str,Wrm,Sam) • After Final Data Item • Further generalises the conditions encountered: • h ← (Sun,Wrm,?,Str,?,?)**Discussion**• We have found the most specific hypothesis corresponding to the dataset and the restricted (conjunctive) language • It is not clear it is the best hypothesis • If the best hypothesis is not conjunctive (eg if we enjoy swimming if it’s warm or sunny), it will not be found • Find-S will not handle noise and inconsistencies well. • In other languages (not using pure conjunction) there may be more than one maximally specific hypothesis; Find-S will not work well here**Version Spaces**• One possible improvement on Find-S is to search many possible solutions in parallel • Consistency • A hypothesis h is consistent with a dataset D of training examples iff h gives the same answer on every element of the dataset as the dataset does • Version Space • The version space with respect to the language L and the dataset D is the set of hypotheses h in the language L which are consistent with D**List-then-Eliminate**• Obvious algorithm • The list-then-eliminate algorithm aims to find the version space in L for the given dataset D • It can thus return all hypotheses which could explain D • It works by beginning with L as its set of hypotheses H • As each item d of the dataset D is examined in turn, any hypotheses in H which are inconsistent with d are eliminated • The language L is usually large, and often infinite, so this algorithm is computationally infeasible as it stands**Version Space Representation**• One of the problems with the previous algorithm is the representation of the search space • We need to represent version spaces efficiently • General Boundary • The general boundary G with respect to language L and dataset D is the set of hypotheses h in L which are consistent with D, and for which there is no more general hypothesis in L which is consistent with D • Specific Boundary • The specific boundary S with respect to language L and dataset D is the set of hypotheses h in L which are consistent with D, and for which there is no more specific hypothesis in L which is consistent with D**Version Space Representation 2**• A version space may be represented by its general and specific boundary • That is, given the general and specific boundaries, the whole version space may be recovered • The Candidate Elimination Algorithm traces the general and specific boundaries of the version space as more examples and counter-examples of the concept are seen • Positive examples are used to generalise the specific boundary • Negative examples permit the general boundary to be specialised.**Candidate Elimination Algorithm**Set G to the set of most general hypotheses in L Set S to the set of most specific hypotheses in L For each example d in D: