Create Presentation
Download Presentation

Download Presentation

Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

Download Presentation
## Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Learning, Generalization, and Regularization: A Glimpse of**Statistical Machine Learning Theory – Part 2 Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center**Contents**VC Theory: Reminders from the last lecture. ERM Consistency (asymptotic results). ERM generalization bounds (non-asymptotic results). Structural Risk Minimization (SRM). Other theoretical framework for machine learning.**Empirical Risk Minimization (ERM)**Lost function:**Empirical Risk Minimization (ERM)**Risk Function:**Empirical Risk Minimization (ERM)**Risk Minimization Principle:**Empirical Risk Minimization (ERM)**Empirical Risk Minimization Principle:**Regularization**Regularized Empirical Risk:**Empirical Risk Minimization (ERM)**Questions for ERM (Statistical Learning Theory): Is ERM consistent? (consistency) (Weak convergence of ERM solution to the true one) How fast is the convergence? How to control the generalization?**ERM Consistency**R[f] is estimator of true solution with sample size n, and Remp[f] is the estimator of R[f]. So we have an estimator as combination of the two. Is it consistent?**ERM Consistency**Consistency Definition for ERM:**ERM Consistency**Consistency Definition for ERM:**ERM Consistency**Do we need both limits hold true? Counter example: Q(z,) are indicator functions. Each function of this set is equal to 1 for all z except a finite number of intervals of measure where it is equal to 0. The parameters define the intervals at which the function is equal to zero. The set of functions Q(z, ) is such that for any finite number of points z1,...,zl, one can find a function that takes the value of zero on this set of points. Let F(z) be the uniform distribution function on the interval [0,1].**ERM Consistency**Do we need both limits hold true? We have:**ERM Consistency**Strict Consistency: Problem of minorization of function sets consistency is satisfied trivially.**ERM Consistency**Strict Consistency: (note: only the second limit is needed)**ERM Consistency**Two sided Empirical Processes: Uniform convergence:**ERM Consistency**One-sided Empirical Processes:**ERM Consistency**Concentration Inequality: Hoeffding’s Inequality**ERM Consistency**Concentration Inequality: Hoeffding’s Inequality Hoeffding’s inequality is distribution independent. It describes the rate of convergence of frequencies to their probability. Where a=1, b=0 it reduces to Chernoff’s inequality. It and its generalization have been used extensively for analyzing randomized algorithms and learning theory.**ERM Consistency**Key Theorem of Learning:**ERM Consistency**Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold.**ERM Consistency**Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold. For uniform convergence:**ERM Consistency**Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: Given a sample: {z1,z2,…,zl}, for each we have a binary vector: q()=(Q(z1, 1),…,Q(zl, l)) Each q() is a vertex in a hypercube.**ERM Consistency**Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: N(z1,…,zl) is the number of distinguish vertices, we have N(z1,…,zl)2l. Def.H(z1,…,zl) = ln N(z1,…,zl) be random entropy And the entropy of the function set is defined as:**ERM Consistency**Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions**ERM Consistency**Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Consider set of functions Q that |Q(z,)| <C Similar to indicator functions, given a sample Z=z1,…,zl for each vector q()=(Q(z1, ),…,Q(zl, )) is a point in a hypercube**ERM Consistency**Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions**ERM Consistency**Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Define N(;z1,…,zl) be the number of vectors in the minimal -net of the set vector q() (with varies). Random -entropy of Q(z,) is defined as: H(;z1,…,zl) = ln N(;z1,…,zl) -entropy is defined as:**ERM Consistency**Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions**ERM Consistency**Uniform Convergence of Frequencies to Prob: Case 4: Functions with bounded expectations**ERM Consistency**Uniform Convergence of Frequencies to Prob: Case 4: Functions with bounded expectations**ERM Consistency**Conditions of One-Sided Convergence:**ERM Consistency**Conditions of One-Sided Convergence:**ERM Consistency**Three milestones in learning theory: For pattern recognition (space of indicator functions), we have: Entropy: H(l) = E ln N(z1,…,zl) Annealed Entropy: Hann(l) = ln E N(z1,…,zl) Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l) Hann(l) G(l)**ERM Consistency**Three milestones in learning theory: First milestone: - sufficient condition for consistency: Second milestone: - sufficient condition for fast convergence rate: Third milestone: sufficient and necessary condition for consistency for any measure and fast convergence rate:**ERM Generalization Bounds**Non-asymptotic results: Consistency is asymptotic results, it does not tell the speed of convergence or the confidence of results of ERM**ERM Generalization Bounds**Non-asymptotic results: Note that for finite case when Q(x,) contains only N (indicator) functions. For this case (using Chernoff’s inequality): ERM is consistent and ERM converges fast.**ERM Generalization Bounds**Non-asymptotic results: with probability 1-: With probability 1-2 :**ERM Generalization Bounds**Indicator Functions: - Distribution Dependent**ERM Generalization Bounds**Indicator Functions: - Distribution Dependent with probability 1-: With probability 1-2 : Where:**ERM Generalization Bounds**Indicator Functions: Distribution Independent Reminder: Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l) Hann(l) G(l) G(l) does not depend on distribution so if we substitute G(l) for H(l), we will get distribution free bounds of generalization error.**ERM Generalization Bounds**Indicator Functions: VC dimension**ERM Generalization Bounds**Indicator Functions: VC dimension Example: linear functions**ERM Generalization Bounds**Indicator Functions: VC dimension Example: linear functions**ERM Generalization Bounds**Indicator Functions: VC dimension Example: linear functions**ERM Generalization Bounds**Indicator Functions: VC dimension Example:**ERM Generalization Bounds**Indicator Functions: VC dimension Example: