Bounds on the Rate of Convergence of Learning Processes(chapter 3)

In the Name of God • Statistical Learning Theory • Bounds on the Rate of Convergence of Learning Processes(chapter 3) Supervisor : Dr. Shiry By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Introduction In this chapter: • we consider upper bounds on the rate of uniform convergence. (lower bounds are not as important for controlling the learning processes as the upper bounds)

Introduction Bounds on the rate of convergence: • Distribution-dependent bounds (based on the annealed entropy function) • Distribution-independent bounds (based on the growth function) Boundsare nonconstructivethe VC dimension of set of functions (a scalar value that can be evaluated for any set of functions ) Constructive distribution-independent bounds

THE BASIC INEQUALITIES • : a set of indicator functions • : the corresponding VC entropy • : the annealed entropy • :the growth function • theorem: basic inequalities in the theory of bounds

THE BASIC INEQUALITIES • The bounds are nontrivial if (In chapter 2 we called this condition the second milestone of learning theory.)

THE BASIC INEQUALITIES • Theorem 3.1 estimates the rate of uniform convergence with respect to the norm of the deviation between probability and frequency. • maximal difference occurs for the events with maximal variance. • (Bernoulli case ) variance : therefore the maximum of the variance is achieved for the events with probahility: the largest deviations are associated with functions that possess large risk.

THE BASIC INEQUALITIES • Theorem 3.2 considered relative uniform convergence. (we will obtain a bound on the risk where the confidence interval is determined by the rate of uniform convergence) • the uniform relative convergence: • upper bound on the risk obtained using Theorem 3.2 is much better than the upper bound on the risk obtained on the basis of Theorem 3.1.

THE BASIC INEQUALITIES • The bounds obtained in Theorems 3.1 and 3-2 are distribution-dependent • To construct distribution independent bounds it is sufficient to note that for any distribution function F(z) the growth function is not less than the annealed entropy. • for any distribution function F(z):

THE BASIC INEQUALITIES • These inequalities are nontrivial if (necessary and sufficient conditions for distribution-free uniform convergence) • if condition 3.5 is violated, then there exist probability measures F(z) on Z for which uniform convergence does not take place.

Generalization for the set of real functions • Let be a set of real functions, with • Let us construct a set of indicators functions by: Where are indicator functions, the set of indicators coincides with this set of functions.

Generalization In generalization we distinguish three cases: • Totally bounded functions • Totally bounded nonnegative functions • Nonnegative (not necessarily bounded) functions • The following bounds are nontrivial if

Generalization • Totally bounded functions • Totally bounded nonnegative functions

Generalization • Nonnegative functions Let be a set of functions such that for some p > 2 the pth normalized moments of the random variables exist:the Therefore: Where:

Distribution – independent bounds • The above bounds were distribution-dependent • To obtain distribution- independent bounds one replaces the annealed entropy with the growth function. • The following inequalities are nontrivial if

Distribution – independent bounds • For the set of totally bounded • For the set of nonnegative totally bounded functions • For the set of nonnegative real functions whose pthnormalized moment exists for some p > 2,

Bounds on the generalization ability of learning machines • What actual risk R(αl) is provided by the function Q(z,αl) that achieves minimal empirical risk Remp(αl)? • How close is this risk to the minimal possible infα (R(α)), αϵΛ, for the given set of functions? • using the following notation, the bounds are nontrivial when

Describe distribution-independent bounds (another form) • For the set of totally bounded functions • with probability at least for all functions of : • with probability at least for the function that minimizes the empirical risk :

Describe distribution-independent bounds (another form) • For the set of totally bounded nonnegative functions • with probahility at least for all functions : • with probability of at least for the function that minimizes the empirical risk :

Describe distribution-independent bounds (another form) • For the set of unbounded nonnegative functions We are given a pair (p,Ƭ) such that : • With probability at least for all functions where (u)+ = max(u,0). • With probability at least for the function that minimizes the empirical risk

The structure of the growth function • To make the above bounds constructive one has to find a way to evaluate the annealed entropy and/or the growth function for the given set of functions. • We will find constructive bounds by using the concept of VC dimension of the set of functions. • There is remarkable connection between the concept of VC dimension and the growth function.

The structure of the growth function • theorem Any growth function either satisfies the equality Or is bounded by the inequality • Definition We will say that the VC dimension of the set of indicator functions is infinite if the growth function for this set of functions is linear. the corresponding growth function is bounded by a logarithmic function with coefficient h.

VC dimension • the finiteness of the VC dimension of the set of indicator functions is a sufficient condition for consistency of the ERM method independent of the probability measure and implies a fast rate of convergence. • It is a necessary and sufficient condition for distribution-independent consistency of ERM learning machines. • The VC dimension of a set of indicator functions It is the maximum number h of vectors z1,...,zhthat can be separated into two classes in all possible ways using functions of the set. • If for any n there exists a set of n vectors that can be shattered by the set of functions, then the VC dimension is equal to infinity.

VC dimension • The VC dimension of a set of real functions Let be a set of real functions bounded by constants A and B. Considering the set of indicators where θ(z) is the step function The VC dimension of a set of real functions is defined to be the VC dimension of the set of corresponding indicators with parameters and .

VC dimension-Example • The VC dimension of the set of linear indicator functions in n-dimensioiialcoordinate space is equal to h=n+1 since by using functions of this set one can shatter at most n + 1 vectors. The VC dimension of the set of linear function in n-dimensional coordinate space is equal to h = n + 1, because the VC dimension of the corresponding linear indicator functions is equal to « + 1.

VC dimension-Example • The VC dimension of the set of functions is infinite. The points on the line can be shattered by functions from this set to separate these data into two classes determined by the sequence . it is sufficient to choose the value of the parameter αto be • by choosing an appropriate co- coefficientαone can for any number of appropriately chosen points approximate values of any function bounded by (-1,+1) using .

VC dimension • The VC dimension of a set of functions does not coincide with the number of parameters. It can be either larger or smaller than the number of parameters. • In the following: we will present the bounds on the risk functional that in Chapter 4 we use for constructing the methods for controlling the generalization ability of learning machines.

Constructive distribution – independent bounds • Considering sets of functions that possess a finite VC dimension h • Therefore, in all inequalities of the above Section the following constructive expression can be used (in the case of the finite VC dimension) • We also will consider the case where the set of loss functions contains a finite number of elements

Constructive distribution – independent bounds • For the set of totally bounded functions • with probability at least for all functions • with probability at least for the function that minimizes the empirical risk:

Constructive distribution – independent bounds • The set of totally bounded nonnegative functions • with probability at least for all functions • with probability at least for the function that minimizes the empirical risk:

Constructive distribution – independent bounds • The set of unbounded nonnegative functions • with probability at least for all functions • with probability at least for the function that minimizes the empirical risk:

Refrences • Vapnik, Vladimir,”The Nature of Statistical Learning Theory”, 2000

Bounds on the Rate of Convergence of Learning Processes(chapter 3)

Bounds on the Rate of Convergence of Learning Processes(chapter 3)

Presentation Transcript

Bounds on the probability of the union of events

The Rate of Convergence of AdaBoost

“On The Brink Of War”

The Effect of the Type of Bag on the Rate of Diffusion

Probabilistic Convergence and Bounds

Chapter 3: Processes

Chapter 3: Processes

Criteria of Convergence

The Convergence of KM and e-Learning

On the Algebraic Structure of Convergence

Chapter 3: Processes

Chapter 3 behavior of interest rate

Box out of bounds under

The Convergence of HPC and Deep Learning

Chapter 3: Processes

Chapter 3: Processes

Chapter 3: Processes

Chapter 3 Processes