Computational Learning Theory. Introduction The PAC Learning Framework Finite Hypothesis Spaces Examples of PAC Learnable Concepts. Introduction. Computational learning theory (or CLT): Provides a theoretical analysis of learning
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
There are typically three areas comprised by CLT:
Sample Complexity. How many examples we need
to find a good hypothesis?
Computational Complexity. How much computational
power we need to find a good hypothesis?
Mistake Bound. How many mistakes we will make
before finding a good hypothesis?
Let’s start with a simple problem:
Assume a two dimensional space with positive and
negative examples. Our goal is to find a rectangle that
includes the positive examples but not the negatives
(input space is R2):
true concept


+

+

+
+

+

+



The true error:
true concept c
Region A


+

+

+
+

+

+
hypothesis h



Region B
True error is the probability of regions A and B.
Region A : false negatives
Region B : false positives
PAC learning stands for probably approximate correct.
Roughly, it tells us a class of concepts C (defined over an
input space with examples of size n) is PAC learnable by
a learning algorithm L, if for arbitrary small δ and ε, and
for all concepts c in C, and for all distributions D over the
input space, there is a 1δ probability that the hypothesis h
selected from space H by learning algorithm L is
approximately correct (has error less than ε). L must run
in time polynomial in 1/ ε , 1/ δ, n, and the size of c.
true concept c


+

+

+
+

+

+
hypothesis h (most specific)



Assume a learning algorithm that outputs the rectangle
that is most specific (touches the positive examples at the
border of the rectangle).
Question: Is this class of problems (rectangles in R2) PAC
learnable by L?
error
true concept c


+

+

+
+

+

+
hypothesis h (most specific)



The error is the probability of the area between h and the
true target rectangle c. How many example do we need to
make this error less than ε?
error
true concept c


+

+

+
+

+

+
hypothesis h (most specific)



In general, the probability that m independent examples
have NOT fallen within the error region is (1 ε) mwhich we
want to be less than δ.
In other words, we want:
(1 ε) m <= δ
Since (1x) <= e–xwe have that
e –εm <= δ or
m >= (1/ ε) ln (1/ δ)
The result grows linearly in 1/ ε andlogarithmically 1/ δ
error
true concept c


+

+

+
+

+

+
hypothesis h (most specific)



The analysis above can be applied by considering each of the
four stripes of the error region. Can you finish the analysis
considering each stripe separately and then joining their
probabilities?
Definition:
A consistent learner outputs the hypothesis h in H that
perfectly fits the training examples (if possible).
How many examples do we need to be approximately
correct in finding a hypothesis output by a consistent
learner that has low error?
This is the same as asking how many examples we need to
make the Version Space contain no hypothesis with error
greater than ε.
When a Version Space VS is such that no hypothesis has error
greater than ε, we say the version space is εexhausted.
How many examples do we need to make a version space VS
be εexhausted?
The probability that the version space is not εexhausted
after seeing m examples is the same as asking the probability
than no hypothesis in VS has error greater than ε. Since
the size of the VS is less than the size of the whole hypothesis
space H, then that probability is clearly less than
He –εm
If we make this less than δ , then we have that
m >= 1/ ε (ln H + ln (1/ δ ))
What happens if our hypothesis space H does not
contain the target concept c?;
Then clearly we can never find a hypothesis h with
zero error.
Here we want an algorithm that simply outputs
the hypothesis with minimum training error.
Can we say that concepts described by conjunctions of
Boolean literals are PAC learnable?
First, how large is the hypothesis space when we have n
Boolean attributes?
Answer: H = 3n
If we substitute this in our analysis of sample complexity
for finite hypothesis spaces we have that:
m >= 1/ ε (n ln 3 + ln (1/ δ) )
Thus the set of conjunctions of Boolean literals is
PAC learnable.
Consider now the class of functions of kterm DNF expressions.
These are expressions of the form
T1 V T2 V … V Tk where V stands for disjunction and
each term Ti is a conjunction of n Boolean attributes.
The size of H is k3n Using the equation for the sample
complexity of finite hypothesis spaces:
m >= 1/ ε (n ln 3 + ln (1/ δ) + ln k)
Although the sample complexity is polynomial in the
main parameters the problem is known to be NP complete.
But it is interesting to see that another family of
functions, the class of kterm CNF expressions is
PAC learnable.
This is interesting because the class of kterm CNF
expressions is strictly larger than the class of kterm
DNF expressions.