- 237 Views
- Uploaded on
- Presentation posted in: General

Learning

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Learning

R&N: ch 19, ch 20

- Supervised Learning - classification, prediction
- Unsupervised Learning – clustering, segmentation, pattern discovery
- Reinforcement Learning – learning MDPs, online learning

- Someone gives you a bunch of examples, telling you what each one is
- Eventually, you figure out the mapping from properties (features) of the examples to their type (classification)

- Logic-based approaches:
- learn a function f(X) (true,false)

- Statistical/Probabilistic approaches
- Learn a probability distribution p(Y|X)

- Logic-based Approaches
- Decision Trees (lecture 24)
- Version Spaces (today)
- PAC learning (we won’t cover, )

- Instance-based approaches
- Statistical Approaches

Need to provide H

with some “structure”

Explicit representationof hypothesis space H

- Version space

- The “version space” is the set of all hypotheses that are consistent with the training instances processed so far.
- An algorithm:
- V := H ;; the version space V is ALL hypotheses H
- For each example e:
- Eliminate any member of V that disagrees with e
- If V is empty, FAIL

- Return V as the set of consistent hypotheses

- PROBLEM: V is huge!!
- Suppose you have N attributes, each with k possible values
- Suppose you allow a hypothesis to be any disjunction of instances
- There are kN possible instances |H| = 2kN
- If N=5 and k=2, |H| = 232!!

- First Trick: Don’t allow arbitrary disjunctions
- Organize the feature values into a hierarchy of allowed disjunctions, e.g.

any-color

dark

pale

black

blue

white

yellow

- Now there are only 7 “abstract values” instead of 16 disjunctive combinations (e.g., “black or white” isn’t allowed)

(r=1) v … v (r=10) v (r=J) v (r=Q) v (r=K) ANY-RANK(r)(r=1) v … v (r=10) NUM(r) (r=J) v (r=Q) v (r=K) FACE(r)(s=) v (s=) v (s=) v (s=) ANY-SUIT(s)(s=) v (s=) BLACK(s)(s=) v (s=) RED(s)

- A hypothesis is any sentence of the form:R(r) S(s) IN-CLASS([r,s])
- where:
- R(r) is ANY-RANK(r), NUM(r), FACE(r), or (r=j)
- S(s) is ANY-SUIT(s), BLACK(s), RED(s), or (s=k)

- For simplicity, we represent a concept by rs, with:
- r {a, n, f, 1, …, 10, j, q, k}
- s {a, b, r, , , , }For example:
- n represents: NUM(r) (s=) IN-CLASS([r,s])
- aa represents:
- ANY-RANK(r) ANY-SUIT(s) IN-CLASS([r,s])

The extension of a hypothesis h is the set of objects that satisfies h

- Examples:
- The extension of f is: {j, q, k}
- The extension of aa is the set of all cards

- Let h1 and h2 be two hypotheses in H
- h1 is more general than h2 iff the extension of h1 is a proper superset of the extension of h2

- Examples:
- aa is more general than f
- f is more general than q
- fr and nr are not comparable

- Let h1 and h2 be two hypotheses in H
- h1 is more general than h2 iff the extension of h1 is a proper superset of the extension of h2
- The inverse of the “more general” relation is the “more specific” relation
- The “more general” relation defines a partial ordering on the hypotheses in H

aa

na

ab

4a

nb

a

4b

n

4

a

a

n

b

f

r

1

10

j

k

…

…

- A hypothesis in V is most general iff no hypothesis in V is more general
- G-boundary G of V: Set of most general hypotheses in V

- A hypothesis in V is most general iff no hypothesis in V is more general
- G-boundary G of V: Set of most general hypotheses in V
- A hypothesis in V is most specific iff no hypothesis in V is more specific
- S-boundary S of V: Set of most specific hypotheses in V

aa

aa

na

ab

4a

nb

a

4b

n

…

…

1

4

4

k

G

We replace every hypothesis in S whose extension does not contain 4 by its generalization set

Now suppose that 4 is given as a positive example

S

aa

na

ab

Here, both G and S have size 1.

This is not the case in general!

4a

nb

a

4b

n

4

Generalization

set of 4

The generalization setof an hypothesis h is theset of the hypotheses that are immediately moregeneral than h

aa

na

ab

4a

nb

a

4b

n

Let 7 be the next (positive) example

4

aa

na

ab

4a

nb

a

4b

n

Let 7 be the next (positive) example

4

Specializationset of aa

aa

na

ab

nb

a

n

Let 5 be the next (negative) example

G and S, and all hypotheses in between form exactly the version space

ab

nb

a

n

No

Yes

Maybe

At this stage …

ab

nb

a

n

Do 8, 6, jsatisfy CONCEPT?

ab

nb

a

n

Let 2 be the next (positive) example

ab

nb

Let j be the next (negative) example

+ 4 7 2

–5 j

nb

NUM(r) BLACK(s) IN-CLASS([r,s])

Let us return to the

version space …

… and let 8 be the next (negative) example

ab

nb

a

The only most specific

hypothesis disagrees with

this example, so no

hypothesis in H agrees with all examples

n

Let us return to the

version space …

… and let jbe the next (positive) example

ab

nb

a

The only most general

hypothesis disagrees with

this example, so no

hypothesis in H agrees with all examples

n

- x new example
- If x is positive then (G,S) POSITIVE-UPDATE(G,S,x)
- Else(G,S) NEGATIVE-UPDATE(G,S,x)
- If G or S is empty then return failure

- Eliminate all hypotheses in G that do not agree with x

Using the generalization sets of the hypotheses

- Eliminate all hypotheses in G that do not agree with x
- Minimally generalize all hypotheses in S until they are consistent with x

This step was not needed in the card example

- Eliminate all hypotheses in G that do not agree with x
- Minimally generalize all hypotheses in S until they are consistent with x
- Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G

- Eliminate all hypotheses in G that do not agree with x
- Minimally generalize all hypotheses in S until they are consistent with x
- Remove from S every hypothesis that is neither more specific than nor equal to a hypothesis in G
- Remove from S every hypothesis that is more general than another hypothesis in S
- Return (G,S)

- Eliminate all hypotheses in S that do not agree with x
- Minimally specialize all hypotheses in G until they are consistent with x
- Remove from G every hypothesis that is neither more general than nor equal to a hypothesis in S
- Remove from G every hypothesis that is more specific than another hypothesis in G
- Return (G,S)

- Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example
- Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses
- Then a single hypothesis will be isolated in O(log |H|) steps

aa

na

ab

- 9?
- j?
- j?

nb

a

n

- Suppose that at each step the learning procedure has the possibility to select the object (card) of the next example
- Let it pick the object such that, whether the example is positive or not, it will eliminate one-half of the remaining hypotheses
- Then a single hypothesis will be isolated in O(log |H|) steps
- But picking the object that eliminates half the version space may be expensive

- If some examples are misclassified, the version space may collapse
- Possible solution:Maintain several G- and S-boundaries, e.g., consistent with all examples, all examples but one, etc…

- Useful for understanding logical (consistency-based) approaches to learning
- Practical if strong hypothesis bias (concept hierarchies, conjunctive hypothesis)
- Don’t handle noise well

- Decision tree learning (DTL) is more efficient if all examples are given in advance; else, it may produce successive hypotheses, each poorly related to the previous one
- Version space learning (VSL) is incremental
- DTL can produce simplified hypotheses that do not agree with all examples
- DTL has been more widely used in practice

- Instance-based Learning (20.4)
- Statistical Learning (20.1)
- Neural Networks (20.5)

k=1

k=6

x

- To classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class

- Distance measure
- Most common: euclidean
- Better distance measures: normalize each variable by standard deviation

- Choosing k
- Increasing k reduces variance, increases bias

- For high-dimensional space, problem that the nearest neighbor may not be very close at all!
- Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets.
- Indexing the data can help; for example KD trees

- Candy comes in two flavors: cherry () and lime ()
- Candy is wrapped, can’t tell which flavor until opened
- There are 5 kinds of bags of candy:
- H1= all cherry
- H2= 75% cherry, 25% lime
- H3= 50% cherry, 50% lime
- H4= 25% cherry, 75% lime
- H5= 100% lime

- Given a new bag of candy, predict H
- Observations: D1, D2 ,D3, …

- Calculate the probability of each hypothesis, given the data, and make prediction weighted by this probability (i.e. use all the hypothesis, not just the single best)
- Now, if we want to predict some unknown quantity X

- Calculating P(h|d)
- Assume the observations are i.i.d.—independent and identically distributed

likelihood

prior

- Hypothesis Prior over h1, …, h5 is {0.1,0.2,0.4,0.2,0.1}
- Data:
- Q1: After seeing d1, what is P(hi|d1)?
- Q2: After seeing d1, what is P(d2= |d1)?

- Bayesian
- MAP – maximum a posteriori **
- MLE – assume uniform prior over H

- aka Idiot Bayes
- particularly simple BN
- makes overly strong independence assumptions
- but works surprisingly well in practice…

we need:

- suppose we want to make a diagnosis D and there are n possible mutually exclusive diagnosis d1, …, dn
- suppose there are m boolean symptoms, E1, …, Em

how do we make diagnosis?

- Assume each piece of evidence (symptom) is independent give the diagnosis
- then

what is the structure of the corresponding BN?

- possible diagnosis: Allergy, Cold and OK
- possible symptoms: Sneeze, Cough and Fever

my symptoms are: sneeze & cough, what isthe diagnosis?

- aka parameter estimation
- we need
- P(di) – prior
- P(ek|di) – conditional probability

- use training data to estimate

- use frequencies in training set to estimate:

where nx is shorthand for the counts of events in

training set

what is:

P(Allergy)?

P(Sneeze| Allergy)?

P(Cough| Allergy)?

- use smoothing to eliminate zeros:

many other smoothing

schemes…

where n is number of possible values for d

and e is assumed to have 2 possible values

- Generally works well despite blanket assumption of independence
- Experiments show competitive with decision trees on some well known test sets (UCI)
- handles noisy data

- Two subproblems:
- learning structure: combinatorial search over space of networks
- learning parameters values: easy if all of the variables are observed in the training set; harder if there are ‘hidden variables’

Neural Networks

- Brain function (thought) occurs as the result of the firing of neurons
- Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters
- Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds
- Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength
- There are about 1011 neurons and about 1014 synapses in the human brain

- Different areas of the brain have different functions
- Some areas seem to have the same function in all humans (e.g., Broca’s region); the overall layout is generally consistent
- Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly

- We don’t know how different functions are “assigned” or acquired
- Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors)
- Partly the result of experience (learning)

- We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought”
- Our neural networks are not nearly as complex or intricate as the actual brain structure

- Computers are way faster than neurons…
- But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel
- Neural networks are designed to be massively parallel
- The brain is effectively a billion times faster

- Neural networks are made up of nodes or units, connected by links
- Each link has an associated weight and activation level
- Each node has an input function (typically summing over weighted inputs), an activation function, and an output

- Neuron modeled a unit j
- weights on input unit I to j, wji
- net input to unit j is:
- threshold Tj
- oj is 1 if netj > Tj

- McCollough and Pitt (1943)showed how LTU can be use to compute logical functions
- AND?
- OR?
- NOT?

- Two layers of LTUs can represent any boolean function

- Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule
- assumes binary valued input/outputs
- assumes a single linear threshold unit

- If the target output for unit j is tj

- Equivalent to the intuitive rules:
- If output is correct, don’t change the weights
- If output is low (oj=0, tj=1), increment weights for all the inputs which are 1
- If output is high (oj=1, tj=0), decrement weights for all inputs which are 1
- Must also adjust threshold. Or equivalently asuume there is a weight wj0 for an extra input unit that has o0=1

- Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct
- Initialize the weights to all zero (or random)
- Until outputs for all training examples are correct
- for each training example e do
- compute the current output oj
- compare it to the target tj and update weights

- for each training example e do

- each execution of outer loop is an epoch
- for multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold
- Q: when will the algorithm terminate?

Perceptron Video

- Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data, I.e. the positive and negative examples are separable by a hyperplane in n-dimensional space

- Perceptron Convergence Theorem: If there are a set of weights that are consistent with the training data (I.e. the data is linearly separable), the perceptron learning algorithm will converge (Minksy & Papert, 1969)
- Unfortunately, many functions (like parity) cannot be represented by LTU

Output units

Hidden units

Input units

- Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level
- Working forward through the network, the input function of each unit is applied to compute the input value
- Usually this is just the weighted sum of the activation on the links feeding into this node

- The activation function transforms this input function into a final value
- Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node