Learning
This presentation is the property of its rightful owner.
Sponsored Links
1 / 78

Learning PowerPoint PPT Presentation


  • 180 Views
  • Uploaded on
  • Presentation posted in: General

Learning. R&N: ch 19, ch 20. Types of Learning. Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern discovery Reinforcement Learning – learning MDPs, online learning. Supervised Learning.

Download Presentation

Learning

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Learning

Learning

R&N: ch 19, ch 20


Types of learning

Types of Learning

  • Supervised Learning - classification, prediction

  • Unsupervised Learning – clustering, segmentation, pattern discovery

  • Reinforcement Learning – learning MDPs, online learning


Supervised learning

Supervised Learning

  • Someone gives you a bunch of examples, telling you what each one is

  • Eventually, you figure out the mapping from properties (features) of the examples to their type (classification)


Supervised learning1

Supervised Learning

  • Logic-based approaches:

    • learn a function f(X)  (true,false)

  • Statistical/Probabilistic approaches

    • Learn a probability distribution p(Y|X)


Outline

Outline

  • Logic-based Approaches

    • Decision Trees (lecture 24)

    • Version Spaces (today)

    • PAC learning (we won’t cover, )

  • Instance-based approaches

  • Statistical Approaches


Predicate learning methods

Need to provide H

with some “structure”

Explicit representationof hypothesis space H

Predicate-Learning Methods

  • Version space


Version spaces

Version Spaces

  • The “version space” is the set of all hypotheses that are consistent with the training instances processed so far.

  • An algorithm:

    • V := H ;; the version space V is ALL hypotheses H

    • For each example e:

      • Eliminate any member of V that disagrees with e

      • If V is empty, FAIL

    • Return V as the set of consistent hypotheses


Version spaces the problem

Version Spaces: The Problem

  • PROBLEM: V is huge!!

  • Suppose you have N attributes, each with k possible values

  • Suppose you allow a hypothesis to be any disjunction of instances

  • There are kN possible instances  |H| = 2kN

  • If N=5 and k=2, |H| = 232!!


Version spaces the tricks

Version Spaces: The Tricks

  • First Trick: Don’t allow arbitrary disjunctions

    • Organize the feature values into a hierarchy of allowed disjunctions, e.g.

any-color

dark

pale

black

blue

white

yellow

  • Now there are only 7 “abstract values” instead of 16 disjunctive combinations (e.g., “black or white” isn’t allowed)

  • Second Trick: Define a partial ordering on H (“general to specific”) and only keep track of the upper bound and lower bound of the version space

  • RESULT: An incremental, possibly efficient algorithm!


  • Rewarded card example

    Rewarded Card Example

    (r=1) v … v (r=10) v (r=J) v (r=Q) v (r=K)  ANY-RANK(r)(r=1) v … v (r=10)  NUM(r) (r=J) v (r=Q) v (r=K)  FACE(r)(s=) v (s=) v (s=) v (s=)  ANY-SUIT(s)(s=) v (s=)  BLACK(s)(s=) v (s=)  RED(s)

    • A hypothesis is any sentence of the form:R(r)  S(s)  IN-CLASS([r,s])

    • where:

    • R(r) is ANY-RANK(r), NUM(r), FACE(r), or (r=j)

    • S(s) is ANY-SUIT(s), BLACK(s), RED(s), or (s=k)


    Simplified representation

    Simplified Representation

    • For simplicity, we represent a concept by rs, with:

    • r  {a, n, f, 1, …, 10, j, q, k}

    • s  {a, b, r, , , , }For example:

    • n represents: NUM(r)  (s=)  IN-CLASS([r,s])

    • aa represents:

    • ANY-RANK(r)  ANY-SUIT(s)  IN-CLASS([r,s])


    Extension of a hypothesis

    Extension of a Hypothesis

    The extension of a hypothesis h is the set of objects that satisfies h

    • Examples:

    • The extension of f is: {j, q, k}

    • The extension of aa is the set of all cards


    More general specific relation

    More General/Specific Relation

    • Let h1 and h2 be two hypotheses in H

    • h1 is more general than h2 iff the extension of h1 is a proper superset of the extension of h2

    • Examples:

    • aa is more general than f

    • f is more general than q

    • fr and nr are not comparable


    More general specific relation1

    More General/Specific Relation

    • Let h1 and h2 be two hypotheses in H

    • h1 is more general than h2 iff the extension of h1 is a proper superset of the extension of h2

    • The inverse of the “more general” relation is the “more specific” relation

    • The “more general” relation defines a partial ordering on the hypotheses in H


    Example subset of partial order

    aa

    na

    ab

    4a

    nb

    a

    4b

    n

    4

    Example: Subset of Partial Order


    Construction of ordering relation

    a

    a

    n

    b

    f

    r

    1

    10

    j

    k

    Construction of Ordering Relation


    G boundary s boundary of v

    G-Boundary / S-Boundary of V

    • A hypothesis in V is most general iff no hypothesis in V is more general

    • G-boundary G of V: Set of most general hypotheses in V


    G boundary s boundary of v1

    G-Boundary / S-Boundary of V

    • A hypothesis in V is most general iff no hypothesis in V is more general

    • G-boundary G of V: Set of most general hypotheses in V

    • A hypothesis in V is most specific iff no hypothesis in V is more specific

    • S-boundary S of V: Set of most specific hypotheses in V


    Example g s boundaries of v

    aa

    aa

    na

    ab

    4a

    nb

    a

    4b

    n

    1

    4

    4

    k

    Example: G-/S-Boundaries of V

    G

    We replace every hypothesis in S whose extension does not contain 4 by its generalization set

    Now suppose that 4 is given as a positive example

    S


    Example g s boundaries of v1

    Example: G-/S-Boundaries of V

    aa

    na

    ab

    Here, both G and S have size 1.

    This is not the case in general!

    4a

    nb

    a

    4b

    n

    4


    Example g s boundaries of v2

    Generalization

    set of 4

    Example: G-/S-Boundaries of V

    The generalization setof an hypothesis h is theset of the hypotheses that are immediately moregeneral than h

    aa

    na

    ab

    4a

    nb

    a

    4b

    n

    Let 7 be the next (positive) example

    4


    Example g s boundaries of v3

    Example: G-/S-Boundaries of V

    aa

    na

    ab

    4a

    nb

    a

    4b

    n

    Let 7 be the next (positive) example

    4


    Example g s boundaries of v4

    Specializationset of aa

    Example: G-/S-Boundaries of V

    aa

    na

    ab

    nb

    a

    n

    Let 5 be the next (negative) example


    Example g s boundaries of v5

    Example: G-/S-Boundaries of V

    G and S, and all hypotheses in between form exactly the version space

    ab

    nb

    a

    n


    Example g s boundaries of v6

    No

    Yes

    Maybe

    Example: G-/S-Boundaries of V

    At this stage …

    ab

    nb

    a

    n

    Do 8, 6, jsatisfy CONCEPT?


    Example g s boundaries of v7

    Example: G-/S-Boundaries of V

    ab

    nb

    a

    n

    Let 2 be the next (positive) example


    Example g s boundaries of v8

    Example: G-/S-Boundaries of V

    ab

    nb

    Let j be the next (negative) example


    Example g s boundaries of v9

    Example: G-/S-Boundaries of V

    + 4 7 2

    –5 j

    nb

    NUM(r)  BLACK(s)  IN-CLASS([r,s])


    Example g s boundaries of v10

    Example: G-/S-Boundaries of V

    Let us return to the

    version space …

    … and let 8 be the next (negative) example

    ab

    nb

    a

    The only most specific

    hypothesis disagrees with

    this example, so no

    hypothesis in H agrees with all examples

    n


    Example g s boundaries of v11

    Example: G-/S-Boundaries of V

    Let us return to the

    version space …

    … and let jbe the next (positive) example

    ab

    nb

    a

    The only most general

    hypothesis disagrees with

    this example, so no

    hypothesis in H agrees with all examples

    n


    Version space update

    Version Space Update

    • x  new example

    • If x is positive then (G,S)  POSITIVE-UPDATE(G,S,x)

    • Else(G,S)  NEGATIVE-UPDATE(G,S,x)

    • If G or S is empty then return failure


    Positive update g s x

    POSITIVE-UPDATE(G,S,x)

    • Eliminate all hypotheses in G that do not agree with x


    Positive update g s x1

    Using the generalization sets of the hypotheses

    POSITIVE-UPDATE(G,S,x)

    • Eliminate all hypotheses in G that do not agree with x

    • Minimally generalize all hypotheses in S until they are consistent with x


    Positive update g s x2

    This step was not needed in the card example

    POSITIVE-UPDATE(G,S,x)

    • Eliminate all hypotheses in G that do not agree with x

    • Minimally generalize all hypotheses in S until they are consistent with x

    • Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G


    Positive update g s x3

    POSITIVE-UPDATE(G,S,x)

    • Eliminate all hypotheses in G that do not agree with x

    • Minimally generalize all hypotheses in S until they are consistent with x

    • Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G

    • Remove from S every hypothesis that is more general than another hypothesis in S

    • Return (G,S)


    Negative update g s x

    NEGATIVE-UPDATE(G,S,x)

    • Eliminate all hypotheses in S that do not agree with x

    • Minimally specialize all hypotheses in G until they are consistent with x

    • Remove from G every hypothesis that is neither more general than nor equal to a hypothesis in S

    • Remove from G every hypothesis that is more specific than another hypothesis in G

    • Return (G,S)


    Example selection strategy aka active learning

    Example-Selection Strategy (aka Active Learning)

    • Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example

    • Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses

    • Then a single hypothesis will be isolated in O(log |H|) steps


    Example

    Example

    aa

    na

    ab

    • 9?

    • j?

    • j?

    nb

    a

    n


    Example selection strategy

    Example-Selection Strategy

    • Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example

    • Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses

    • Then a single hypothesis will be isolated in O(log |H|) steps

    • But picking the object that eliminates half the version space may be expensive


    Noise

    Noise

    • If some examples are misclassified, the version space may collapse

    • Possible solution:Maintain several G- and S-boundaries, e.g., consistent with all examples, all examples but one, etc…


    Version spaces1

    Version Spaces

    • Useful for understanding logical (consistency-based) approaches to learning

    • Practical if strong hypothesis bias (concept hierarchies, conjunctive hypothesis)

    • Don’t handle noise well


    Vsl vs dtl

    VSL vs DTL

    • Decision tree learning (DTL) is more efficient if all examples are given in advance; else, it may produce successive hypotheses, each poorly related to the previous one

    • Version space learning (VSL) is incremental

    • DTL can produce simplified hypotheses that do not agree with all examples

    • DTL has been more widely used in practice


    Statistical approaches

    Statistical Approaches

    • Instance-based Learning (20.4)

    • Statistical Learning (20.1)

    • Neural Networks (20.5)


    Nearest neighbor methods

    k=1

    k=6

    x

    Nearest Neighbor Methods

    • To classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class


    Issues

    Issues

    • Distance measure

      • Most common: euclidean

      • Better distance measures: normalize each variable by standard deviation

    • Choosing k

      • Increasing k reduces variance, increases bias

    • For high-dimensional space, problem that the nearest neighbor may not be very close at all!

    • Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets.

    • Indexing the data can help; for example KD trees


    Bayesian learning

    Bayesian Learning


    Example candy bags

    Example: Candy Bags

    • Candy comes in two flavors: cherry () and lime ()

    • Candy is wrapped, can’t tell which flavor until opened

    • There are 5 kinds of bags of candy:

      • H1= all cherry

      • H2= 75% cherry, 25% lime

      • H3= 50% cherry, 50% lime

      • H4= 25% cherry, 75% lime

      • H5= 100% lime

    • Given a new bag of candy, predict H

    • Observations: D1, D2 ,D3, …


    Bayesian learning1

    Bayesian Learning

    • Calculate the probability of each hypothesis, given the data, and make prediction weighted by this probability (i.e. use all the hypothesis, not just the single best)

    • Now, if we want to predict some unknown quantity X


    Bayesian learning cont

    Bayesian Learning cont.

    • Calculating P(h|d)

    • Assume the observations are i.i.d.—independent and identically distributed

    likelihood

    prior


    Example1

    Example:

    • Hypothesis Prior over h1, …, h5 is {0.1,0.2,0.4,0.2,0.1}

    • Data:

    • Q1: After seeing d1, what is P(hi|d1)?

    • Q2: After seeing d1, what is P(d2= |d1)?


    Statistical learning approaches

    Statistical Learning Approaches

    • Bayesian

    • MAP – maximum a posteriori **

    • MLE – assume uniform prior over H


    Na ve bayes

    Naïve Bayes

    • aka Idiot Bayes

    • particularly simple BN

    • makes overly strong independence assumptions

    • but works surprisingly well in practice…


    Bayesian diagnosis

    we need:

    Bayesian Diagnosis

    • suppose we want to make a diagnosis D and there are n possible mutually exclusive diagnosis d1, …, dn

    • suppose there are m boolean symptoms, E1, …, Em

    how do we make diagnosis?


    Na ve bayes assumption

    Naïve Bayes Assumption

    • Assume each piece of evidence (symptom) is independent give the diagnosis

    • then

    what is the structure of the corresponding BN?


    Na ve bayes example

    Naïve Bayes Example

    • possible diagnosis: Allergy, Cold and OK

    • possible symptoms: Sneeze, Cough and Fever

    my symptoms are: sneeze & cough, what isthe diagnosis?


    Learning the probabilities

    Learning the Probabilities

    • aka parameter estimation

    • we need

      • P(di) – prior

      • P(ek|di) – conditional probability

    • use training data to estimate


    Maximum likelihood estimate mle

    Maximum Likelihood Estimate (MLE)

    • use frequencies in training set to estimate:

    where nx is shorthand for the counts of events in

    training set


    Example2

    Example:

    what is:

    P(Allergy)?

    P(Sneeze| Allergy)?

    P(Cough| Allergy)?


    Laplace estimate smoothing

    Laplace Estimate (smoothing)

    • use smoothing to eliminate zeros:

    many other smoothing

    schemes…

    where n is number of possible values for d

    and e is assumed to have 2 possible values


    Comments

    Comments

    • Generally works well despite blanket assumption of independence

    • Experiments show competitive with decision trees on some well known test sets (UCI)

    • handles noisy data


    Learning more complex bayesian networks

    Learning more complex Bayesian networks

    • Two subproblems:

    • learning structure: combinatorial search over space of networks

    • learning parameters values: easy if all of the variables are observed in the training set; harder if there are ‘hidden variables’


    Neural networks

    Neural Networks


    Neural function

    Neural function

    • Brain function (thought) occurs as the result of the firing of neurons

    • Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters

    • Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds

    • Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength

    • There are about 1011 neurons and about 1014 synapses in the human brain


    Biology of a neuron

    Biology of a neuron


    Brain structure

    Brain structure

    • Different areas of the brain have different functions

      • Some areas seem to have the same function in all humans (e.g., Broca’s region); the overall layout is generally consistent

      • Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly

    • We don’t know how different functions are “assigned” or acquired

      • Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors)

      • Partly the result of experience (learning)

    • We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought”

    • Our neural networks are not nearly as complex or intricate as the actual brain structure


    Comparison of computing power

    Comparison of computing power

    • Computers are way faster than neurons…

    • But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel

    • Neural networks are designed to be massively parallel

    • The brain is effectively a billion times faster


    Neural networks1

    Neural networks

    • Neural networks are made up of nodes or units, connected by links

    • Each link has an associated weight and activation level

    • Each node has an input function (typically summing over weighted inputs), an activation function, and an output


    Neural unit

    Neural unit


    Model neuron

    Model Neuron

    • Neuron modeled a unit j

    • weights on input unit I to j, wji

    • net input to unit j is:

    • threshold Tj

    • oj is 1 if netj > Tj


    Neural computation

    Neural Computation

    • McCollough and Pitt (1943)showed how LTU can be use to compute logical functions

      • AND?

      • OR?

      • NOT?

    • Two layers of LTUs can represent any boolean function


    Learning rules

    Learning Rules

    • Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule

      • assumes binary valued input/outputs

      • assumes a single linear threshold unit


    Perceptron learning rule

    Perceptron Learning rule

    • If the target output for unit j is tj

    • Equivalent to the intuitive rules:

      • If output is correct, don’t change the weights

      • If output is low (oj=0, tj=1), increment weights for all the inputs which are 1

      • If output is high (oj=1, tj=0), decrement weights for all inputs which are 1

      • Must also adjust threshold. Or equivalently asuume there is a weight wj0 for an extra input unit that has o0=1


    Perceptron learning algorithm

    Perceptron Learning Algorithm

    • Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct

      • Initialize the weights to all zero (or random)

      • Until outputs for all training examples are correct

        • for each training example e do

          • compute the current output oj

          • compare it to the target tj and update weights

    • each execution of outer loop is an epoch

    • for multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold

    • Q: when will the algorithm terminate?


    Learning

    Perceptron Video


    Representation limitations of a perceptron

    Representation Limitations of a Perceptron

    • Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data, I.e. the positive and negative examples are separable by a hyperplane in n-dimensional space


    Perceptron learnability

    Perceptron Learnability

    • Perceptron Convergence Theorem: If there are a set of weights that are consistent with the training data (I.e. the data is linearly separable), the perceptron learning algorithm will converge (Minksy & Papert, 1969)

    • Unfortunately, many functions (like parity) cannot be represented by LTU


    Layered feed forward network

    Layered feed-forward network

    Output units

    Hidden units

    Input units


    Executing neural networks

    “Executing” neural networks

    • Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level

    • Working forward through the network, the input function of each unit is applied to compute the input value

      • Usually this is just the weighted sum of the activation on the links feeding into this node

    • The activation function transforms this input function into a final value

      • Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node


  • Login