Computational Learning Theory and Kernel Methods

1 / 46

# Computational Learning Theory and Kernel Methods - PowerPoint PPT Presentation

Computational Learning Theory and Kernel Methods. Tianyi Jiang March 8, 2004. General Research Question. “ Under what conditions is successful learning possible and impossible? ” “ Under what conditions is a particular learning algorithm assured of learning successfully? ”

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Computational Learning Theory and Kernel Methods' - chandler-torres

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
General Research Question

“Under what conditions is successful learning possible and

impossible?”

“Under what conditions is a particular learning algorithm

assured of learning successfully?”

-Mitchell, ‘97

Computational Learning Theory
• Sample Complexity
• Computational Complexity
• Mistake Bound
• -Mitchell, ‘97
Problem Setting

Instance Space: X, with a stable distribution D

Concept Class: C, s.t. c: X  {0,1}

Hypothesis Space: H

General Learner: L

h -

-

- c

+

+

Where c and h disagree

Error of a Hypothesis
PAC Learnability

True Error:

• Difficulties in getting 0 error:
• Multiple hypothesis consistent with training examples
• Training examples can mislead the Learner
PAC-Learnable

Learner L will output a hypothesis h with probability

(1-) s.t.

in time that is polynomial in

where

n = size of a training example

size(c) = encoding length of c in C

Consistent Learner & Version Space

Consistent Learner – Outputs hypotheses that perfectly

fit the training data whenever possible

Version Space:

VSH,E is -exhausted with respect to c and Dif:

Version Space

Hypothesis space H ( =.21)

.

error=.3

r=.4

.

error=.1

r=.2

.

error=.2

r=0

VSH,E

.

error=.1

r=0

.

error=.3

r=.1

.

error=.2

r=.3

Sample Complexity for Finite Hypothesis Spaces

Theorem - -exhausting the version space:

If H is finite, the probability that VSH,D is NOT -exhausted

(with respect to c) is:

|H|e- m

where m1, sequence of i.r.d. examples of some target

concept c; 0   1

Upper bound on sufficient number of training examples

If we set probability of failure below some level, 

then…

… however, too loose of a bound due to |H|

Agnostic Learning

What if concept c  H?

Agnostic Learner: simply finds the h with min. training error

Find upper bound on m s.t.

Where hbest = h with lowest training error

Upper bound on sufficient number of training examples - errorE(hbest) 0

From Chernoff Bounds, we have:

then…

thus…

Example:

Given a consistent learner and a target concept of

conjunctions of up to 10 Boolean literals, how many

training examples are needed to learn a hypothesis

with error < .1 95% of the time?

|H|=?

=?

=?

Example:

Given a consistent learner and a target concept of

conjunctions of up to 10 Boolean literals, how many

training examples are needed to learn a hypothesis

with error < .1 95% of the time?

|H|=310

=.1

=.05

Sample Complexity for Infinite Hypothesis Spaces

Consider subset of instances: S  X,

and h  H s.t. h imposed dichotomy on S:

2 subsets: {x  S | h(x)=1 } & {x  S | h(x)=0 }

Thus for any instance set S, there are

2|S| possible dichotomies.

Definition: A set of instance S is shattered by hypothesis

space H iff for every dichotomy of S there

exist some h  H consistent with this dichotomy

Vapnik-Chervonenkis Dimension

Definition: VC(H), is the size of the largest finite subset

of X shattered by H.

If arbitrarily large finite sets of X can be

shattered by H, then VC(H)=

For any finite H, VC(H)  log2|H|

Example of VC Dimension

Along a line…

In a plane…

VC dimensions in Rn

Theorem: Consider some set of m points in Rn. Choose

any one of the points as origin. Then the m

points can be shattered by oriented hyperplanes

iff the position vectors of the remaining points

are linearly independent.

So VC dimension of the set of oriented hyperplanes in

R10 is ?

Bounds on m with VC Dimension

Upper Bound:

VC(H)  log2|H|

Lower Bound:

Mistake Bound Model of Learning

“How many mistakes will the learner make in its

predictions before it learns the target concept?”

The best algorithm in worst case scenario (hardest

target concept, hardest training sequence) will make

Opt(C) mistakes, where

Linear Support Vector Machines

Consider a binary classification problem:

Training data: {xi, yi}, i=1,…,; yi {-1, +1}; xi  Rd

Points x lie on the separating hyperplane satisfy:

wx+b=0

where w is normal to the hyperplane

|b|/||w|| is the perpendicular distance to origin

||w|| is the Euclidean norm of w

Linear Support Vector Machine, Definitions

Let d+ (d-) be the shortest distance from the separating

hyperplane to the closest positive (negative) example

Margin of a separating hyperplane= d+ + d-

=1/||W||+1/||W||=2/||w||

Constraints:

Problem of Maximizing the Margins

H1 and H2 are parallel, & with no training points in

between

Thus we reformulate the problem as:

Maximize margin by minimizing ||W||2

s.t.

Ties to Least Squares

y

b

x

Loss Function:

Lagrangian Formulation
• Transform constraints into Lagrange multipliers
• Training data will only appear in dot products form
• Let

be positive Lagrange multipliers

We have the Lagrangian:

Transform the convex quadratic programming problem

Observations: minimizing LP w.r.t. w, b, and

simultaneously require that

subject to

is a convex quadratic programming problem

that can be easily solved in its Dual form

LP’s Dual: Maximize LP, subject to gradients of LP w.r.t.

w and b vanish, and i0

• There is a Lagrangian multiplier i for every training point
• In the solution, points for which i > 0 are called “support
• vectors”. They lie on either H1 or H2
• Support vectors are critical elements of the training set,
• they lie closest to the “boundary”
• If all other points are removed or moved around (but not
• crossing H1 or H2), the same separating hyperplane would
• be found
Prediction
• Solving the SVM problem is equivalent to finding a
• solution for the Karush-Kuhn-Tucker (KTT) conditions
• (KTT conditions are satisfied at the solution of any
• constrained optimization problem)
• Once we solved for w, b, we predict x to be
• sign(wx+b)
Linear SVM: The Non-Separable Case

We account for outliers by introducing slack conditions:

We penalize outliers by changing the cost function to:

Linear SVM Classification Examples

Linearly Separable

Linearly Non-Separable

Nonlinear SVM

Observation: data appear as dot products in the training

problem

So we can use a mapping function , to map data into

a high dimensional space where points are linearly

separable:

To make things easier, we define a kernel function K s.t.

Nonlinear SVM (cont.)

Kernel functions can compute dot products in the high

dimensional space without explicitly work with 

Example:

Rather than computing w, we make prediction on x via:

Example of  mapping

Image, in , of the square [-1,1]x[-1,1]  R2

under the mapping 

Example Kernel Functions

Kernel functions must satisfy the Mercer’s condition, or

simple, the Hessian Matrix

must be positive semidefinite. (non-negative eigenvalues)

Example Kernels:

Linearly Separable

Linearly Non-Separable

Multi-Class SVM
• One-against-all
• One-against-one (majority vote)
• One-against-one (DAGSVM)
Global Solution and Uniqueness
• Every local solution is also global (property of any
• convex programming problem)
• Solution is guaranteed unique if the objective function
• is strictly convex (Hessian matrix is positive definite)
Complexity and Scalability
• Curse of dimensionality:
• The proliferation of parameters causing
• intractable complexity
• The proliferation of parameters causing overfitting
• SVM circumvent these via the use of
• Kernel functions (trick) that computes at O(dL)
• Support vectors that focus on the “boundary”
Structural Risk Minimization

Empirical Risk:

Expected Risk:

Structural Risk Minimization

Nested subsets of functions, ordered by VC dimensions