advanced artificial intelligence lecture 3 learning l.
Skip this Video
Loading SlideShow in 5 Seconds..
Advanced Artificial Intelligence Lecture 3: Learning PowerPoint Presentation
Download Presentation
Advanced Artificial Intelligence Lecture 3: Learning

Loading in 2 Seconds...

play fullscreen
1 / 54

Advanced Artificial Intelligence Lecture 3: Learning - PowerPoint PPT Presentation

  • Uploaded on

Advanced Artificial Intelligence Lecture 3: Learning. Bob McKay School of Computer Science and Engineering College of Engineering Seoul National University. Outline. Defining Learning Kinds of Learning Generalisation and Specialisation Some Simple Learning Algorithms. References.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Advanced Artificial Intelligence Lecture 3: Learning' - Renfred

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
advanced artificial intelligence lecture 3 learning

Advanced Artificial IntelligenceLecture 3: Learning

Bob McKay

School of Computer Science and Engineering

College of Engineering

Seoul National University

  • Defining Learning
  • Kinds of Learning
  • Generalisation and Specialisation
  • Some Simple Learning Algorithms
  • Mitchell, Tom M: Machine Learning, McGraw-Hill, 1997, ISBN 0 07 115467 1
defining a learning system mitchell
Defining a Learning System (Mitchell)
  • “A program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”
specifying a learning system
Specifying a Learning System
  • Specifying the task T, the performance P and the experience E defines the learning problem. Specifying the learning system requires us to define:
    • Exactly what knowledge is to be learnt
    • How this knowledge is to be represented
    • How this knowledge is to be learnt
specifying what is to be learnt
Specifying What is to be Learnt
  • Usually, the desired knowledge can be represented as a target valuation function V: I → D
    • It takes in information about the problem and gives back a desired decision
  • Often, it is unrealistic to expect to learn the ideal function V
    • All that is required is a ‘good enough’ approximation, V’: I → D
specifying how knowledge is to be represented
Specifying How Knowledge is to be Represented
  • The function V’ must be represented symbolically, in some language L
    • The language may be a well-known language
      • Boolean expressions
      • Arithmetic functions
      • ….
    • Or for some systems, the language may be defined by a grammar
specifying how the knowledge is to be learnt
Specifying How the Knowledge is to be Learnt
  • If the learning system is to be implemented, we must specify an algorithm A, which defines the way in which the system is to search the language L for an acceptable V’
    • That is, we must specify a search algorithm
structure of a learning system
Structure of a Learning System
  • Four modules
    • The Performance System
    • The Critic
    • The Generaliser (or sometimes Specialiser)
    • The Experiment Generator
performance module
Performance Module
  • This is the system which actually uses the function V’ as we learn it
    • Learning Task
      • Learning to play checkers
    • Performance module
      • System for playing checkers
        • (I.e. makes the checkers moves)
critic module
Critic Module
  • The critic module evaluates the performance of the current V’
    • It produces a set of data from which the system can learn further
generaliser specialiser module
Generaliser/Specialiser Module
  • Takes a set of data and produces a new V’ for the system to run again
experiment generator
Experiment Generator
  • Takes the new V’
    • Maybe also uses the previous history of the system
  • Produces a new experiment for the performance system to undertake
the importance of bias
The Importance of Bias
  • Important theoretical results from learning theory (PAC learning) tell us that learning without some presuppositions is infeasible.
    • Practical experience, of both machine and human learning, confirms this.
      • To learn effectively, we must limit the class of V’s.
  • Two approaches are used in machine learning:
    • Language bias
    • Search Bias
    • Combined Bias
      • Language and search bias are not mutually exclusive: most learning systems feature both
language bias
Language Bias
  • The language L is restricted so that it cannot represent all possible target functions V
    • This is usually on the basis of some knowledge we have about the likely form of V’
    • It introduces risk
      • Our system will fail if L does not contain an acceptable V’
search bias
Search Bias
  • The order in which the system searches L is controlled, so that promising areas for V’ are searched first
the downside no free lunches
The Downside:No Free Lunches
  • Wolpert and MacReady’s No Free Lunch Theorem states, in effect, that averaged over all problems, all biases are equally good (or bad).
  • Conventional view
    • The choice of a learning system cannot be universal
      • It must be matched to the problem being solved
  • In most systems, the bias is not explicit
    • The ability to identify the language and search biases of a particular system is an important aspect of machine learning
  • Some more recent systems permit the explicit and flexible specification of both language and search biases
no free lunch does it matter
No Free Lunch:Does it Matter?
  • Alternative view
    • We aren’t interested in all problems
      • We are only interested in prolems which have solutions of less than some bounded complexity
        • (so that we can understand the solutions)
    • The No Free Lunch Theorem may not apply in this case
some dimensions of learning
Some Dimensions of Learning
  • Induction vs Discovery:
  • Guided learning vs learning from raw data
  • Learning How vs Learning That (vs Learning a Better That)
  • Stochastic vs Deterministic; Symbolic vs Subsymbolic
  • Clean vs Noisy Data
  • Discrete vs continuous variables
  • Attribute vs Relational Learning
  • The Importance of Background Knowledge
induction vs discovery
Induction vs Discovery
  • Has the target concept been previously identified?
    • Pearson: cloud classifications from satellite data
  • vs
    • Autoclass and H - R diagrams
    • AM and prime numbers
    • BACON and Boyle's Law
guided learning vs learning from raw data
Guided Learning vs Learning from Raw Data
  • Does the learning system require carefully selected examples and counterexamples, as in a teacher – student situation?
    • (allows fast learning)
    • CIGOL learning sort/merge
  • vs
    • Garvan institute's thyroid data
learning how vs learning that vs learning a better that
Learning How vs Learning That vs Learning a Better That
    • Classifying handwritten symbols
    • Distinguishing vowel sounds (Sejnowski & Rosenberg)
    • Learning to fly a (simulated!) plane
  • vs
    • Michalski & learning diagnosis of soy diseases
  • vs
    • Mitchell & learning about chess forks
stochastic vs deterministic symbolic vs subsymbolic
Stochastic vs Deterministic;Symbolic vs Subsymbolic
    • Classifying handwritten symbols (stochastic, subsymbolic)
  • vs
    • Predicting plant distributions (stochastic, symbolic)
  • vs
    • Cloud classification (deterministic, symbolic)
  • vs
    • ? (deterministic, subsymbolic)
clean vs noisy data
Clean vs Noisy Data
    • Learning to diagnose errors in programs
  • vs
    • Greater gliders in the Coolangubra
discrete vs continuous variables
Discrete vs Continuous Variables
    • Quinlan's chess end games
  • vs
    • Pearson's clouds (eg cloud heights)
attibute vs relational learning
Attibute vs Relational Learning
    • Predicting plant distributions
  • vs
    • Predicting animal distributions
      • (because plants can’t move, they don’t care - much - about spatial relationships)
the importance of background knowledge
The importance of Background Knowledge
  • Learning about faults in a satellite power supply
    • general electric circuit theory
    • knowledge about the particular circuit
generalisation and learning
Generalisation and Learning
  • What do we mean when we say of two propositions, S and G, that G is a generalisation of S?
    • Suppose skippy is a grey kangaroo.
    • We would regard ‘Kangaroos are grey as a generalisation of ‘Skippy is grey’.
    • In any world in which ‘kangaroos are grey’ is true, ‘Skippy is grey’ will also be true.
  • In other words, if G is a generalisation of specialisation S, then G is 'at least as true' as S,
    • That is, S is true in all states of the world in which G is, and perhaps in other states as well.
generalisation and inference
Generalisation and Inference
  • In logic, we assume that if S is true in all worlds in which G is, then
    • G → S
  • That is, G is a generalisation of S exactly when G implies S
    • So we can think of learning from S as a search for a suitable G for which G → S
  • In propositional learning, this is often used as a definition:
    • G is more general than S if and only if G → S
  • Equating generalisation and logical implication is only useful if the validity of an implication can be readily computed
    • In the propositional calculus, validity is an exponential problem
    • in the predicate calculus, validity is an undecidable problem
  • so the definition is not universally useful
    • (although for some parts of logic - eg learning rules - it is perfectly adequate).
a common misunderstanding
A Common Misunderstanding
  • Suppose we have two rules,
    • 1) A ∧ Β → G
    • 2) A ∧ Β ∧ C → G
  • Clearly, we would want 1 to be a generalisation of 2
  • This is OK with our definition, because
    • ((A ^ B → G) → (A ^ B ^ C → G))
  • is valid
    • But the confusing thing is that ((A^B^C) → (A∧Β)) is valid
      • Iif you only look at the hypotheses of the rule, rather than the whole rule, the implication is the wrong way around
      • Note that some textbooks are themselves confused about this
defining generalisaion
Defining Generalisaion
  • We could try to define the properties that generalisation must satisfy,
  • So let's write down some axioms. We need some notation.
    • We will write 'S <G G' as shorthand for 'S is less general than G'.
  • Axioms:
    • Transitivity: If A <G B and B <G C then also A <G C
    • Antisymmetry: If A <G B then it's not true that B <G A
    • Top: there is a unique element, ⊥, for which it is always true that A <G⊥.
    • Bottom: there is a unique element, T, for which it is always true that T <GA.
picturing generalisaion
Picturing Generalisaion
  • We can draw a 'picture' of a generalisation hierarchy satisfying these axioms:
specifying generalisaion
Specifying Generalisaion
  • In a particular domain, the generalisation hierarchy may be defined in either of two ways:
    • By giving a general definition of what generalisation means in that domain
      • Example: our earlier definition in terms of implication
    • By directly specifying the specialisation and generalisation operators that may be used to climb up and down the links in the generalisation hierarchy
learning and generalisaion
Learning and Generalisaion
  • How does learning relate to generalisation?
    • We can view most learning as an attempt to find an appropriate generalisation that generalises the examples.
    • In noise free domains, we usually want the generalisation to cover all the examples.
    • Once we introduce noise, we want the generalisation to cover 'enough' examples, and the interesting bit is in defining what 'enough' is.
  • In our picture of a generalisation hierarchy, most learning algorithms can be viewed as methods for searching the hierarchy.
    • The examples can be pictured as locations low down in the hierarchy, and the learning algorithm attempts to find a location that is above all (or 'enough') of them in the hierarchy, but usually, no higher 'than it needs to be'
searching the generalisaion hierarchy
Searching the Generalisaion Hierarchy
  • The commonest approaches are:
    • generalising search
      • the search is upward from the original examples, towards the more general hypotheses
    • specialising search
      • the search is downward from the most general hypothesis, towards the more special examples
    • Some algorithms use different approaches. Mitchell's version space approach, for example, tries to 'home in' on the right generalisation from both directions at once.
completeness and generalisaion
Completeness and Generalisaion
  • Many approaches to axiomatising generalisation add an extra axiom:
    • Completeness: For any set Σ of members of the generalisation hierarchy, there is a unique 'least general generalisation' L, which satisfies two properties:
      • 1) for every S in Σ, S <GL
      • 2) if any other L' satisfies 1), then L <GL'
    • If this definition is hard to understand, compare it with the definition of 'Least Upper Bound' in set theory, or of 'Least Common Multiple' in arithmetic
restricting generalisation
Restricting Generalisation
  • Let's go back to our original definition of generalisation:
    • G generalises S iff G → S
  • In the general predicate calculus case, this relation is uncomputable, so it's not very useful
  • One approach to avoiding the problem is to limit the implications allowed
generalisation and substitution
Generalisation and Substitution
  • Very commonly, the generalisations we want to make involve turning a constant into a variable.
    • So we see a particular black crow, fred, so we notice:
      • crow(fred) → black(fred)
    • and we may wish to generalise this to
      • ∀X(crow(X) → black(X))
  • Notice that the original proposition can be recovered from the generalisation by substituting 'fred' for the variable 'X'
    • The original is a substitution instance of the generalisation
    • So we could define a new, restricted generalisation:
      • G subsumes S if S is a substitution instance of G
  • An example of our earlier definition, because a substitution instance is always implied by the original proposition.
learning algorithms
Learning Algorithms
  • For the rest of this lecture, we will work with a specific learning dataset (due to Mitchell):
    • Item Sky AirT Hum Wnd Wtr Fcst Enjy
    • 1 Sun Wrm Nml Str Wrm Sam Yes
    • 2 Sun Wrm High Str Wrm Sam Yes
    • 3 Rain Cold High Str Wrm Chng No
    • 4 Sun Wrm High Str Cool Chng Yes
  • First, we look at a really simple algorithm, Maximally Specific Learning
maximally specific learning
Maximally Specific Learning
  • The learning language consists of sets of tuples, representing the values of these attributes
    • A ‘?’ represents that any value is acceptable for this attribute
    • A particular value represents that only that value is acceptable for this attribute
    • A ‘φ’ represents that no value is acceptable for this attribute
    • Thus (?, Cold, High, ?, ?, ?) represents the hypothesis that water sport is enjoyed only on cold, moist days.
  • Note that our language is already heavily biased: only conjunctive hypotheses (hypotheses built with ‘^’) are allowed.
find s
  • Find-S is a simple algorithm: its initial hypothesis is that water sport is never enjoyed
    • It expands the hypothesis as positive data items are noted
running find s
Running Find-S
  • Initial Hypothesis
    • The most specific hypothesis (water sports are never enjoyed):
    • h ← (φ,φ,φ,φ,φ,φ)
  • After First Data Item
    • Water sport is enjoyed only under the conditions of the first item:
    • h ← (Sun,Wrm,Nml,Str,Wrm,Sam)
  • After Second Data Item
    • Water sport is enjoyed only under the common conditions of the first two items:
    • h ← (Sun,Wrm,?,Str,Wrm,Sam)
running find s44
Running Find-S
  • After Third Data Item
    • Since this item is negative, it has no effect on the learning hypothesis:
    • h ← (Sun,Wrm,?,Str,Wrm,Sam)
  • After Final Data Item
    • Further generalises the conditions encountered:
    • h ← (Sun,Wrm,?,Str,?,?)
  • We have found the most specific hypothesis corresponding to the dataset and the restricted (conjunctive) language
  • It is not clear it is the best hypothesis
    • If the best hypothesis is not conjunctive (eg if we enjoy swimming if it’s warm or sunny), it will not be found
    • Find-S will not handle noise and inconsistencies well.
    • In other languages (not using pure conjunction) there may be more than one maximally specific hypothesis; Find-S will not work well here
version spaces
Version Spaces
  • One possible improvement on Find-S is to search many possible solutions in parallel
  • Consistency
    • A hypothesis h is consistent with a dataset D of training examples iff h gives the same answer on every element of the dataset as the dataset does
  • Version Space
    • The version space with respect to the language L and the dataset D is the set of hypotheses h in the language L which are consistent with D
list then eliminate
  • Obvious algorithm
    • The list-then-eliminate algorithm aims to find the version space in L for the given dataset D
    • It can thus return all hypotheses which could explain D
  • It works by beginning with L as its set of hypotheses H
    • As each item d of the dataset D is examined in turn, any hypotheses in H which are inconsistent with d are eliminated
  • The language L is usually large, and often infinite, so this algorithm is computationally infeasible as it stands
version space representation
Version Space Representation
  • One of the problems with the previous algorithm is the representation of the search space
    • We need to represent version spaces efficiently
  • General Boundary
    • The general boundary G with respect to language L and dataset D is the set of hypotheses h in L which are consistent with D, and for which there is no more general hypothesis in L which is consistent with D
  • Specific Boundary
    • The specific boundary S with respect to language L and dataset D is the set of hypotheses h in L which are consistent with D, and for which there is no more specific hypothesis in L which is consistent with D
version space representation 2
Version Space Representation 2
  • A version space may be represented by its general and specific boundary
  • That is, given the general and specific boundaries, the whole version space may be recovered
  • The Candidate Elimination Algorithm traces the general and specific boundaries of the version space as more examples and counter-examples of the concept are seen
    • Positive examples are used to generalise the specific boundary
    • Negative examples permit the general boundary to be specialised.
candidate elimination algorithm
Candidate Elimination Algorithm

Set G to the set of most general hypotheses in L

Set S to the set of most specific hypotheses in L

For each example d in D:

candidate elimination algorithm51
Candidate Elimination Algorithm

If d is a positive example

Remove from G any hypothesis inconsistent with d

For each hypothesis s in S that is not consistent with d

Remove s from S

Add to S all minimal generalisations h of s such that h is consistent with d, and some member of G is more general than h

Remove from S any hypothesis that is more general than another hypothesis in S

candidate elimination algorithm52
Candidate Elimination Algorithm

If d is a negative example

Remove from S any hypothesis inconsistent with d

For each hypothesis g in G that is not consistent with d

Remove g from G

Add to G all minimal specialisations h of g such that h is consistent with d, and some member of S is more specific than h

Remove from G any hypothesis that is less general than another hypothesis in G

  • Defining Learning
  • Kinds of Learning
  • Generalisation and Specialisation
  • Some Simple Learning Algorithms
    • Find-S
    • Version Spaces
      • List-then-Eliminate
      • Candidate Elimination