This presentation is the property of its rightful owner.
1 / 78

# Learning PowerPoint PPT Presentation

Learning. R&N: ch 19, ch 20. Types of Learning. Supervised Learning - classification, prediction Unsupervised Learning – clustering, segmentation, pattern discovery Reinforcement Learning – learning MDPs, online learning. Supervised Learning.

Learning

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Learning

R&N: ch 19, ch 20

### Types of Learning

• Supervised Learning - classification, prediction

• Unsupervised Learning – clustering, segmentation, pattern discovery

• Reinforcement Learning – learning MDPs, online learning

### Supervised Learning

• Someone gives you a bunch of examples, telling you what each one is

• Eventually, you figure out the mapping from properties (features) of the examples to their type (classification)

### Supervised Learning

• Logic-based approaches:

• learn a function f(X)  (true,false)

• Statistical/Probabilistic approaches

• Learn a probability distribution p(Y|X)

### Outline

• Logic-based Approaches

• Decision Trees (lecture 24)

• Version Spaces (today)

• PAC learning (we won’t cover, )

• Instance-based approaches

• Statistical Approaches

Need to provide H

with some “structure”

Explicit representationof hypothesis space H

### Predicate-Learning Methods

• Version space

### Version Spaces

• The “version space” is the set of all hypotheses that are consistent with the training instances processed so far.

• An algorithm:

• V := H ;; the version space V is ALL hypotheses H

• For each example e:

• Eliminate any member of V that disagrees with e

• If V is empty, FAIL

• Return V as the set of consistent hypotheses

### Version Spaces: The Problem

• PROBLEM: V is huge!!

• Suppose you have N attributes, each with k possible values

• Suppose you allow a hypothesis to be any disjunction of instances

• There are kN possible instances  |H| = 2kN

• If N=5 and k=2, |H| = 232!!

### Version Spaces: The Tricks

• First Trick: Don’t allow arbitrary disjunctions

• Organize the feature values into a hierarchy of allowed disjunctions, e.g.

any-color

dark

pale

black

blue

white

yellow

• Now there are only 7 “abstract values” instead of 16 disjunctive combinations (e.g., “black or white” isn’t allowed)

• Second Trick: Define a partial ordering on H (“general to specific”) and only keep track of the upper bound and lower bound of the version space

• RESULT: An incremental, possibly efficient algorithm!

• ### Rewarded Card Example

(r=1) v … v (r=10) v (r=J) v (r=Q) v (r=K)  ANY-RANK(r)(r=1) v … v (r=10)  NUM(r) (r=J) v (r=Q) v (r=K)  FACE(r)(s=) v (s=) v (s=) v (s=)  ANY-SUIT(s)(s=) v (s=)  BLACK(s)(s=) v (s=)  RED(s)

• A hypothesis is any sentence of the form:R(r)  S(s)  IN-CLASS([r,s])

• where:

• R(r) is ANY-RANK(r), NUM(r), FACE(r), or (r=j)

• S(s) is ANY-SUIT(s), BLACK(s), RED(s), or (s=k)

### Simplified Representation

• For simplicity, we represent a concept by rs, with:

• r  {a, n, f, 1, …, 10, j, q, k}

• s  {a, b, r, , , , }For example:

• n represents: NUM(r)  (s=)  IN-CLASS([r,s])

• aa represents:

• ANY-RANK(r)  ANY-SUIT(s)  IN-CLASS([r,s])

### Extension of a Hypothesis

The extension of a hypothesis h is the set of objects that satisfies h

• Examples:

• The extension of f is: {j, q, k}

• The extension of aa is the set of all cards

### More General/Specific Relation

• Let h1 and h2 be two hypotheses in H

• h1 is more general than h2 iff the extension of h1 is a proper superset of the extension of h2

• Examples:

• aa is more general than f

• f is more general than q

• fr and nr are not comparable

### More General/Specific Relation

• Let h1 and h2 be two hypotheses in H

• h1 is more general than h2 iff the extension of h1 is a proper superset of the extension of h2

• The inverse of the “more general” relation is the “more specific” relation

• The “more general” relation defines a partial ordering on the hypotheses in H

aa

na

ab

4a

nb

a

4b

n

4

a

a

n

b

f

r

1

10

j

k

### G-Boundary / S-Boundary of V

• A hypothesis in V is most general iff no hypothesis in V is more general

• G-boundary G of V: Set of most general hypotheses in V

### G-Boundary / S-Boundary of V

• A hypothesis in V is most general iff no hypothesis in V is more general

• G-boundary G of V: Set of most general hypotheses in V

• A hypothesis in V is most specific iff no hypothesis in V is more specific

• S-boundary S of V: Set of most specific hypotheses in V

aa

aa

na

ab

4a

nb

a

4b

n

1

4

4

k

### Example: G-/S-Boundaries of V

G

We replace every hypothesis in S whose extension does not contain 4 by its generalization set

Now suppose that 4 is given as a positive example

S

### Example: G-/S-Boundaries of V

aa

na

ab

Here, both G and S have size 1.

This is not the case in general!

4a

nb

a

4b

n

4

Generalization

set of 4

### Example: G-/S-Boundaries of V

The generalization setof an hypothesis h is theset of the hypotheses that are immediately moregeneral than h

aa

na

ab

4a

nb

a

4b

n

Let 7 be the next (positive) example

4

### Example: G-/S-Boundaries of V

aa

na

ab

4a

nb

a

4b

n

Let 7 be the next (positive) example

4

Specializationset of aa

### Example: G-/S-Boundaries of V

aa

na

ab

nb

a

n

Let 5 be the next (negative) example

### Example: G-/S-Boundaries of V

G and S, and all hypotheses in between form exactly the version space

ab

nb

a

n

No

Yes

Maybe

### Example: G-/S-Boundaries of V

At this stage …

ab

nb

a

n

Do 8, 6, jsatisfy CONCEPT?

### Example: G-/S-Boundaries of V

ab

nb

a

n

Let 2 be the next (positive) example

### Example: G-/S-Boundaries of V

ab

nb

Let j be the next (negative) example

### Example: G-/S-Boundaries of V

+ 4 7 2

–5 j

nb

NUM(r)  BLACK(s)  IN-CLASS([r,s])

### Example: G-/S-Boundaries of V

version space …

… and let 8 be the next (negative) example

ab

nb

a

The only most specific

hypothesis disagrees with

this example, so no

hypothesis in H agrees with all examples

n

### Example: G-/S-Boundaries of V

version space …

… and let jbe the next (positive) example

ab

nb

a

The only most general

hypothesis disagrees with

this example, so no

hypothesis in H agrees with all examples

n

### Version Space Update

• x  new example

• If x is positive then (G,S)  POSITIVE-UPDATE(G,S,x)

• Else(G,S)  NEGATIVE-UPDATE(G,S,x)

• If G or S is empty then return failure

### POSITIVE-UPDATE(G,S,x)

• Eliminate all hypotheses in G that do not agree with x

Using the generalization sets of the hypotheses

### POSITIVE-UPDATE(G,S,x)

• Eliminate all hypotheses in G that do not agree with x

• Minimally generalize all hypotheses in S until they are consistent with x

This step was not needed in the card example

### POSITIVE-UPDATE(G,S,x)

• Eliminate all hypotheses in G that do not agree with x

• Minimally generalize all hypotheses in S until they are consistent with x

• Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G

### POSITIVE-UPDATE(G,S,x)

• Eliminate all hypotheses in G that do not agree with x

• Minimally generalize all hypotheses in S until they are consistent with x

• Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G

• Remove from S every hypothesis that is more general than another hypothesis in S

• Return (G,S)

### NEGATIVE-UPDATE(G,S,x)

• Eliminate all hypotheses in S that do not agree with x

• Minimally specialize all hypotheses in G until they are consistent with x

• Remove from G every hypothesis that is neither more general than nor equal to a hypothesis in S

• Remove from G every hypothesis that is more specific than another hypothesis in G

• Return (G,S)

### Example-Selection Strategy (aka Active Learning)

• Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example

• Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses

• Then a single hypothesis will be isolated in O(log |H|) steps

aa

na

ab

• 9?

• j?

• j?

nb

a

n

### Example-Selection Strategy

• Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example

• Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses

• Then a single hypothesis will be isolated in O(log |H|) steps

• But picking the object that eliminates half the version space may be expensive

### Noise

• If some examples are misclassified, the version space may collapse

• Possible solution:Maintain several G- and S-boundaries, e.g., consistent with all examples, all examples but one, etc…

### Version Spaces

• Useful for understanding logical (consistency-based) approaches to learning

• Practical if strong hypothesis bias (concept hierarchies, conjunctive hypothesis)

• Don’t handle noise well

### VSL vs DTL

• Decision tree learning (DTL) is more efficient if all examples are given in advance; else, it may produce successive hypotheses, each poorly related to the previous one

• Version space learning (VSL) is incremental

• DTL can produce simplified hypotheses that do not agree with all examples

• DTL has been more widely used in practice

### Statistical Approaches

• Instance-based Learning (20.4)

• Statistical Learning (20.1)

• Neural Networks (20.5)

k=1

k=6

x

### Nearest Neighbor Methods

• To classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class

### Issues

• Distance measure

• Most common: euclidean

• Better distance measures: normalize each variable by standard deviation

• Choosing k

• Increasing k reduces variance, increases bias

• For high-dimensional space, problem that the nearest neighbor may not be very close at all!

• Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets.

• Indexing the data can help; for example KD trees

### Example: Candy Bags

• Candy comes in two flavors: cherry () and lime ()

• Candy is wrapped, can’t tell which flavor until opened

• There are 5 kinds of bags of candy:

• H1= all cherry

• H2= 75% cherry, 25% lime

• H3= 50% cherry, 50% lime

• H4= 25% cherry, 75% lime

• H5= 100% lime

• Given a new bag of candy, predict H

• Observations: D1, D2 ,D3, …

### Bayesian Learning

• Calculate the probability of each hypothesis, given the data, and make prediction weighted by this probability (i.e. use all the hypothesis, not just the single best)

• Now, if we want to predict some unknown quantity X

### Bayesian Learning cont.

• Calculating P(h|d)

• Assume the observations are i.i.d.—independent and identically distributed

likelihood

prior

### Example:

• Hypothesis Prior over h1, …, h5 is {0.1,0.2,0.4,0.2,0.1}

• Data:

• Q1: After seeing d1, what is P(hi|d1)?

• Q2: After seeing d1, what is P(d2= |d1)?

### Statistical Learning Approaches

• Bayesian

• MAP – maximum a posteriori **

• MLE – assume uniform prior over H

### Naïve Bayes

• aka Idiot Bayes

• particularly simple BN

• makes overly strong independence assumptions

• but works surprisingly well in practice…

we need:

### Bayesian Diagnosis

• suppose we want to make a diagnosis D and there are n possible mutually exclusive diagnosis d1, …, dn

• suppose there are m boolean symptoms, E1, …, Em

how do we make diagnosis?

### Naïve Bayes Assumption

• Assume each piece of evidence (symptom) is independent give the diagnosis

• then

what is the structure of the corresponding BN?

### Naïve Bayes Example

• possible diagnosis: Allergy, Cold and OK

• possible symptoms: Sneeze, Cough and Fever

my symptoms are: sneeze & cough, what isthe diagnosis?

### Learning the Probabilities

• aka parameter estimation

• we need

• P(di) – prior

• P(ek|di) – conditional probability

• use training data to estimate

### Maximum Likelihood Estimate (MLE)

• use frequencies in training set to estimate:

where nx is shorthand for the counts of events in

training set

### Example:

what is:

P(Allergy)?

P(Sneeze| Allergy)?

P(Cough| Allergy)?

### Laplace Estimate (smoothing)

• use smoothing to eliminate zeros:

many other smoothing

schemes…

where n is number of possible values for d

and e is assumed to have 2 possible values

• Generally works well despite blanket assumption of independence

• Experiments show competitive with decision trees on some well known test sets (UCI)

• handles noisy data

### Learning more complex Bayesian networks

• Two subproblems:

• learning structure: combinatorial search over space of networks

• learning parameters values: easy if all of the variables are observed in the training set; harder if there are ‘hidden variables’

## Neural Networks

### Neural function

• Brain function (thought) occurs as the result of the firing of neurons

• Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters

• Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds

• Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength

• There are about 1011 neurons and about 1014 synapses in the human brain

### Brain structure

• Different areas of the brain have different functions

• Some areas seem to have the same function in all humans (e.g., Broca’s region); the overall layout is generally consistent

• Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly

• We don’t know how different functions are “assigned” or acquired

• Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors)

• Partly the result of experience (learning)

• We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought”

• Our neural networks are not nearly as complex or intricate as the actual brain structure

### Comparison of computing power

• Computers are way faster than neurons…

• But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel

• Neural networks are designed to be massively parallel

• The brain is effectively a billion times faster

### Neural networks

• Neural networks are made up of nodes or units, connected by links

• Each link has an associated weight and activation level

• Each node has an input function (typically summing over weighted inputs), an activation function, and an output

### Model Neuron

• Neuron modeled a unit j

• weights on input unit I to j, wji

• net input to unit j is:

• threshold Tj

• oj is 1 if netj > Tj

### Neural Computation

• McCollough and Pitt (1943)showed how LTU can be use to compute logical functions

• AND?

• OR?

• NOT?

• Two layers of LTUs can represent any boolean function

### Learning Rules

• Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule

• assumes binary valued input/outputs

• assumes a single linear threshold unit

### Perceptron Learning rule

• If the target output for unit j is tj

• Equivalent to the intuitive rules:

• If output is correct, don’t change the weights

• If output is low (oj=0, tj=1), increment weights for all the inputs which are 1

• If output is high (oj=1, tj=0), decrement weights for all inputs which are 1

• Must also adjust threshold. Or equivalently asuume there is a weight wj0 for an extra input unit that has o0=1

### Perceptron Learning Algorithm

• Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct

• Initialize the weights to all zero (or random)

• Until outputs for all training examples are correct

• for each training example e do

• compute the current output oj

• compare it to the target tj and update weights

• each execution of outer loop is an epoch

• for multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold

• Q: when will the algorithm terminate?

Perceptron Video

### Representation Limitations of a Perceptron

• Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data, I.e. the positive and negative examples are separable by a hyperplane in n-dimensional space

### Perceptron Learnability

• Perceptron Convergence Theorem: If there are a set of weights that are consistent with the training data (I.e. the data is linearly separable), the perceptron learning algorithm will converge (Minksy & Papert, 1969)

• Unfortunately, many functions (like parity) cannot be represented by LTU

Output units

Hidden units

Input units

### “Executing” neural networks

• Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level

• Working forward through the network, the input function of each unit is applied to compute the input value

• Usually this is just the weighted sum of the activation on the links feeding into this node

• The activation function transforms this input function into a final value

• Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node