Advanced artificial intelligence lecture 3 learning
1 / 54

- PowerPoint PPT Presentation

  • Updated On :

Advanced Artificial Intelligence Lecture 3: Learning. Bob McKay School of Computer Science and Engineering College of Engineering Seoul National University. Outline. Defining Learning Kinds of Learning Generalisation and Specialisation Some Simple Learning Algorithms. References.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about '' - Renfred

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Advanced artificial intelligence lecture 3 learning l.jpg

Advanced Artificial IntelligenceLecture 3: Learning

Bob McKay

School of Computer Science and Engineering

College of Engineering

Seoul National University

Outline l.jpg

  • Defining Learning

  • Kinds of Learning

  • Generalisation and Specialisation

  • Some Simple Learning Algorithms

References l.jpg

  • Mitchell, Tom M: Machine Learning, McGraw-Hill, 1997, ISBN 0 07 115467 1

Defining a learning system mitchell l.jpg
Defining a Learning System (Mitchell)

  • “A program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”

Specifying a learning system l.jpg
Specifying a Learning System

  • Specifying the task T, the performance P and the experience E defines the learning problem. Specifying the learning system requires us to define:

    • Exactly what knowledge is to be learnt

    • How this knowledge is to be represented

    • How this knowledge is to be learnt

Specifying what is to be learnt l.jpg
Specifying What is to be Learnt

  • Usually, the desired knowledge can be represented as a target valuation function V: I → D

    • It takes in information about the problem and gives back a desired decision

  • Often, it is unrealistic to expect to learn the ideal function V

    • All that is required is a ‘good enough’ approximation, V’: I → D

Specifying how knowledge is to be represented l.jpg
Specifying How Knowledge is to be Represented

  • The function V’ must be represented symbolically, in some language L

    • The language may be a well-known language

      • Boolean expressions

      • Arithmetic functions

      • ….

    • Or for some systems, the language may be defined by a grammar

Specifying how the knowledge is to be learnt l.jpg
Specifying How the Knowledge is to be Learnt

  • If the learning system is to be implemented, we must specify an algorithm A, which defines the way in which the system is to search the language L for an acceptable V’

    • That is, we must specify a search algorithm

Structure of a learning system l.jpg
Structure of a Learning System

  • Four modules

    • The Performance System

    • The Critic

    • The Generaliser (or sometimes Specialiser)

    • The Experiment Generator

Performance module l.jpg
Performance Module

  • This is the system which actually uses the function V’ as we learn it

    • Learning Task

      • Learning to play checkers

    • Performance module

      • System for playing checkers

        • (I.e. makes the checkers moves)

Critic module l.jpg
Critic Module

  • The critic module evaluates the performance of the current V’

    • It produces a set of data from which the system can learn further

Generaliser specialiser module l.jpg
Generaliser/Specialiser Module

  • Takes a set of data and produces a new V’ for the system to run again

Experiment generator l.jpg
Experiment Generator

  • Takes the new V’

    • Maybe also uses the previous history of the system

  • Produces a new experiment for the performance system to undertake

The importance of bias l.jpg
The Importance of Bias

  • Important theoretical results from learning theory (PAC learning) tell us that learning without some presuppositions is infeasible.

    • Practical experience, of both machine and human learning, confirms this.

      • To learn effectively, we must limit the class of V’s.

  • Two approaches are used in machine learning:

    • Language bias

    • Search Bias

    • Combined Bias

      • Language and search bias are not mutually exclusive: most learning systems feature both

Language bias l.jpg
Language Bias

  • The language L is restricted so that it cannot represent all possible target functions V

    • This is usually on the basis of some knowledge we have about the likely form of V’

    • It introduces risk

      • Our system will fail if L does not contain an acceptable V’

Search bias l.jpg
Search Bias

  • The order in which the system searches L is controlled, so that promising areas for V’ are searched first

The downside no free lunches l.jpg
The Downside:No Free Lunches

  • Wolpert and MacReady’s No Free Lunch Theorem states, in effect, that averaged over all problems, all biases are equally good (or bad).

  • Conventional view

    • The choice of a learning system cannot be universal

      • It must be matched to the problem being solved

  • In most systems, the bias is not explicit

    • The ability to identify the language and search biases of a particular system is an important aspect of machine learning

  • Some more recent systems permit the explicit and flexible specification of both language and search biases

No free lunch does it matter l.jpg
No Free Lunch:Does it Matter?

  • Alternative view

    • We aren’t interested in all problems

      • We are only interested in prolems which have solutions of less than some bounded complexity

        • (so that we can understand the solutions)

    • The No Free Lunch Theorem may not apply in this case

Some dimensions of learning l.jpg
Some Dimensions of Learning

  • Induction vs Discovery:

  • Guided learning vs learning from raw data

  • Learning How vs Learning That (vs Learning a Better That)

  • Stochastic vs Deterministic; Symbolic vs Subsymbolic

  • Clean vs Noisy Data

  • Discrete vs continuous variables

  • Attribute vs Relational Learning

  • The Importance of Background Knowledge

Induction vs discovery l.jpg
Induction vs Discovery

  • Has the target concept been previously identified?

    • Pearson: cloud classifications from satellite data

  • vs

    • Autoclass and H - R diagrams

    • AM and prime numbers

    • BACON and Boyle's Law

Guided learning vs learning from raw data l.jpg
Guided Learning vs Learning from Raw Data

  • Does the learning system require carefully selected examples and counterexamples, as in a teacher – student situation?

    • (allows fast learning)

    • CIGOL learning sort/merge

  • vs

    • Garvan institute's thyroid data

Learning how vs learning that vs learning a better that l.jpg
Learning How vs Learning That vs Learning a Better That

  • Classifying handwritten symbols

  • Distinguishing vowel sounds (Sejnowski & Rosenberg)

  • Learning to fly a (simulated!) plane

  • vs

    • Michalski & learning diagnosis of soy diseases

  • vs

    • Mitchell & learning about chess forks

  • Stochastic vs deterministic symbolic vs subsymbolic l.jpg
    Stochastic vs Deterministic;Symbolic vs Subsymbolic

    • Classifying handwritten symbols (stochastic, subsymbolic)

  • vs

    • Predicting plant distributions (stochastic, symbolic)

  • vs

    • Cloud classification (deterministic, symbolic)

  • vs

    • ? (deterministic, subsymbolic)

  • Clean vs noisy data l.jpg
    Clean vs Noisy Data

    • Learning to diagnose errors in programs

  • vs

    • Greater gliders in the Coolangubra

  • Discrete vs continuous variables l.jpg
    Discrete vs Continuous Variables

    • Quinlan's chess end games

  • vs

    • Pearson's clouds (eg cloud heights)

  • Attibute vs relational learning l.jpg
    Attibute vs Relational Learning

    • Predicting plant distributions

  • vs

    • Predicting animal distributions

      • (because plants can’t move, they don’t care - much - about spatial relationships)

  • The importance of background knowledge l.jpg
    The importance of Background Knowledge

    • Learning about faults in a satellite power supply

      • general electric circuit theory

      • knowledge about the particular circuit

    Generalisation and learning l.jpg
    Generalisation and Learning

    • What do we mean when we say of two propositions, S and G, that G is a generalisation of S?

      • Suppose skippy is a grey kangaroo.

      • We would regard ‘Kangaroos are grey as a generalisation of ‘Skippy is grey’.

      • In any world in which ‘kangaroos are grey’ is true, ‘Skippy is grey’ will also be true.

    • In other words, if G is a generalisation of specialisation S, then G is 'at least as true' as S,

      • That is, S is true in all states of the world in which G is, and perhaps in other states as well.

    Generalisation and inference l.jpg
    Generalisation and Inference

    • In logic, we assume that if S is true in all worlds in which G is, then

      • G → S

    • That is, G is a generalisation of S exactly when G implies S

      • So we can think of learning from S as a search for a suitable G for which G → S

    • In propositional learning, this is often used as a definition:

      • G is more general than S if and only if G → S

    Issues l.jpg

    • Equating generalisation and logical implication is only useful if the validity of an implication can be readily computed

      • In the propositional calculus, validity is an exponential problem

      • in the predicate calculus, validity is an undecidable problem

    • so the definition is not universally useful

      • (although for some parts of logic - eg learning rules - it is perfectly adequate).

    A common misunderstanding l.jpg
    A Common Misunderstanding

    • Suppose we have two rules,

      • 1) A ∧ Β → G

      • 2) A ∧ Β ∧ C → G

    • Clearly, we would want 1 to be a generalisation of 2

    • This is OK with our definition, because

      • ((A ^ B → G) → (A ^ B ^ C → G))

    • is valid

      • But the confusing thing is that ((A^B^C) → (A∧Β)) is valid

        • Iif you only look at the hypotheses of the rule, rather than the whole rule, the implication is the wrong way around

        • Note that some textbooks are themselves confused about this

    Defining generalisaion l.jpg
    Defining Generalisaion

    • We could try to define the properties that generalisation must satisfy,

    • So let's write down some axioms. We need some notation.

      • We will write 'S <G G' as shorthand for 'S is less general than G'.

    • Axioms:

      • Transitivity: If A <G B and B <G C then also A <G C

      • Antisymmetry: If A <G B then it's not true that B <G A

      • Top: there is a unique element, ⊥, for which it is always true that A <G⊥.

      • Bottom: there is a unique element, T, for which it is always true that T <GA.

    Picturing generalisaion l.jpg
    Picturing Generalisaion

    • We can draw a 'picture' of a generalisation hierarchy satisfying these axioms:

    Specifying generalisaion l.jpg
    Specifying Generalisaion

    • In a particular domain, the generalisation hierarchy may be defined in either of two ways:

      • By giving a general definition of what generalisation means in that domain

        • Example: our earlier definition in terms of implication

      • By directly specifying the specialisation and generalisation operators that may be used to climb up and down the links in the generalisation hierarchy

    Learning and generalisaion l.jpg
    Learning and Generalisaion

    • How does learning relate to generalisation?

      • We can view most learning as an attempt to find an appropriate generalisation that generalises the examples.

      • In noise free domains, we usually want the generalisation to cover all the examples.

      • Once we introduce noise, we want the generalisation to cover 'enough' examples, and the interesting bit is in defining what 'enough' is.

    • In our picture of a generalisation hierarchy, most learning algorithms can be viewed as methods for searching the hierarchy.

      • The examples can be pictured as locations low down in the hierarchy, and the learning algorithm attempts to find a location that is above all (or 'enough') of them in the hierarchy, but usually, no higher 'than it needs to be'

    Searching the generalisaion hierarchy l.jpg
    Searching the Generalisaion Hierarchy

    • The commonest approaches are:

      • generalising search

        • the search is upward from the original examples, towards the more general hypotheses

      • specialising search

        • the search is downward from the most general hypothesis, towards the more special examples

      • Some algorithms use different approaches. Mitchell's version space approach, for example, tries to 'home in' on the right generalisation from both directions at once.

    Completeness and generalisaion l.jpg
    Completeness and Generalisaion

    • Many approaches to axiomatising generalisation add an extra axiom:

      • Completeness: For any set Σ of members of the generalisation hierarchy, there is a unique 'least general generalisation' L, which satisfies two properties:

        • 1) for every S in Σ, S <GL

        • 2) if any other L' satisfies 1), then L <GL'

      • If this definition is hard to understand, compare it with the definition of 'Least Upper Bound' in set theory, or of 'Least Common Multiple' in arithmetic

    Restricting generalisation l.jpg
    Restricting Generalisation

    • Let's go back to our original definition of generalisation:

      • G generalises S iff G → S

    • In the general predicate calculus case, this relation is uncomputable, so it's not very useful

    • One approach to avoiding the problem is to limit the implications allowed

    Generalisation and substitution l.jpg
    Generalisation and Substitution

    • Very commonly, the generalisations we want to make involve turning a constant into a variable.

      • So we see a particular black crow, fred, so we notice:

        • crow(fred) → black(fred)

      • and we may wish to generalise this to

        • ∀X(crow(X) → black(X))

    • Notice that the original proposition can be recovered from the generalisation by substituting 'fred' for the variable 'X'

      • The original is a substitution instance of the generalisation

      • So we could define a new, restricted generalisation:

        • G subsumes S if S is a substitution instance of G

    • An example of our earlier definition, because a substitution instance is always implied by the original proposition.

    Learning algorithms l.jpg
    Learning Algorithms

    • For the rest of this lecture, we will work with a specific learning dataset (due to Mitchell):

      • Item Sky AirT Hum Wnd Wtr Fcst Enjy

      • 1 Sun Wrm Nml Str Wrm Sam Yes

      • 2 Sun Wrm High Str Wrm Sam Yes

      • 3 Rain Cold High Str Wrm Chng No

      • 4 Sun Wrm High Str Cool Chng Yes

    • First, we look at a really simple algorithm, Maximally Specific Learning

    Maximally specific learning l.jpg
    Maximally Specific Learning

    • The learning language consists of sets of tuples, representing the values of these attributes

      • A ‘?’ represents that any value is acceptable for this attribute

      • A particular value represents that only that value is acceptable for this attribute

      • A ‘φ’ represents that no value is acceptable for this attribute

      • Thus (?, Cold, High, ?, ?, ?) represents the hypothesis that water sport is enjoyed only on cold, moist days.

    • Note that our language is already heavily biased: only conjunctive hypotheses (hypotheses built with ‘^’) are allowed.

    Find s l.jpg

    • Find-S is a simple algorithm: its initial hypothesis is that water sport is never enjoyed

      • It expands the hypothesis as positive data items are noted

    Running find s l.jpg
    Running Find-S

    • Initial Hypothesis

      • The most specific hypothesis (water sports are never enjoyed):

      • h ← (φ,φ,φ,φ,φ,φ)

    • After First Data Item

      • Water sport is enjoyed only under the conditions of the first item:

      • h ← (Sun,Wrm,Nml,Str,Wrm,Sam)

    • After Second Data Item

      • Water sport is enjoyed only under the common conditions of the first two items:

      • h ← (Sun,Wrm,?,Str,Wrm,Sam)

    Running find s44 l.jpg
    Running Find-S

    • After Third Data Item

      • Since this item is negative, it has no effect on the learning hypothesis:

      • h ← (Sun,Wrm,?,Str,Wrm,Sam)

    • After Final Data Item

      • Further generalises the conditions encountered:

      • h ← (Sun,Wrm,?,Str,?,?)

    Discussion l.jpg

    • We have found the most specific hypothesis corresponding to the dataset and the restricted (conjunctive) language

    • It is not clear it is the best hypothesis

      • If the best hypothesis is not conjunctive (eg if we enjoy swimming if it’s warm or sunny), it will not be found

      • Find-S will not handle noise and inconsistencies well.

      • In other languages (not using pure conjunction) there may be more than one maximally specific hypothesis; Find-S will not work well here

    Version spaces l.jpg
    Version Spaces

    • One possible improvement on Find-S is to search many possible solutions in parallel

    • Consistency

      • A hypothesis h is consistent with a dataset D of training examples iff h gives the same answer on every element of the dataset as the dataset does

    • Version Space

      • The version space with respect to the language L and the dataset D is the set of hypotheses h in the language L which are consistent with D

    List then eliminate l.jpg

    • Obvious algorithm

      • The list-then-eliminate algorithm aims to find the version space in L for the given dataset D

      • It can thus return all hypotheses which could explain D

    • It works by beginning with L as its set of hypotheses H

      • As each item d of the dataset D is examined in turn, any hypotheses in H which are inconsistent with d are eliminated

    • The language L is usually large, and often infinite, so this algorithm is computationally infeasible as it stands

    Version space representation l.jpg
    Version Space Representation

    • One of the problems with the previous algorithm is the representation of the search space

      • We need to represent version spaces efficiently

    • General Boundary

      • The general boundary G with respect to language L and dataset D is the set of hypotheses h in L which are consistent with D, and for which there is no more general hypothesis in L which is consistent with D

    • Specific Boundary

      • The specific boundary S with respect to language L and dataset D is the set of hypotheses h in L which are consistent with D, and for which there is no more specific hypothesis in L which is consistent with D

    Version space representation 2 l.jpg
    Version Space Representation 2

    • A version space may be represented by its general and specific boundary

    • That is, given the general and specific boundaries, the whole version space may be recovered

    • The Candidate Elimination Algorithm traces the general and specific boundaries of the version space as more examples and counter-examples of the concept are seen

      • Positive examples are used to generalise the specific boundary

      • Negative examples permit the general boundary to be specialised.

    Candidate elimination algorithm l.jpg
    Candidate Elimination Algorithm

    Set G to the set of most general hypotheses in L

    Set S to the set of most specific hypotheses in L

    For each example d in D:

    Candidate elimination algorithm51 l.jpg
    Candidate Elimination Algorithm

    If d is a positive example

    Remove from G any hypothesis inconsistent with d

    For each hypothesis s in S that is not consistent with d

    Remove s from S

    Add to S all minimal generalisations h of s such that h is consistent with d, and some member of G is more general than h

    Remove from S any hypothesis that is more general than another hypothesis in S

    Candidate elimination algorithm52 l.jpg
    Candidate Elimination Algorithm

    If d is a negative example

    Remove from S any hypothesis inconsistent with d

    For each hypothesis g in G that is not consistent with d

    Remove g from G

    Add to G all minimal specialisations h of g such that h is consistent with d, and some member of S is more specific than h

    Remove from G any hypothesis that is less general than another hypothesis in G

    Summary l.jpg

    • Defining Learning

    • Kinds of Learning

    • Generalisation and Specialisation

    • Some Simple Learning Algorithms

      • Find-S

      • Version Spaces

        • List-then-Eliminate

        • Candidate Elimination