Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center - PowerPoint PPT Presentation

learning generalization and regularization a glimpse of statistical machine learning theory part 2 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center PowerPoint Presentation
Download Presentation
Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

play fullscreen
1 / 68
Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center
263 Views
Download Presentation
dezso
Download Presentation

Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Learning, Generalization, and Regularization: A Glimpse of Statistical Machine Learning Theory – Part 2 Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

  2. Contents VC Theory: Reminders from the last lecture. ERM Consistency (asymptotic results). ERM generalization bounds (non-asymptotic results). Structural Risk Minimization (SRM). Other theoretical framework for machine learning.

  3. Probabilistic Setting of ML

  4. Empirical Risk Minimization (ERM) Lost function:

  5. Empirical Risk Minimization (ERM) Risk Function:

  6. Empirical Risk Minimization (ERM) Risk Minimization Principle:

  7. Empirical Risk Minimization (ERM) Empirical Risk Minimization Principle:

  8. Regularization Regularized Empirical Risk:

  9. Empirical Risk Minimization (ERM) Questions for ERM (Statistical Learning Theory): Is ERM consistent? (consistency) (Weak convergence of ERM solution to the true one) How fast is the convergence? How to control the generalization?

  10. ERM Consistency R[f] is estimator of true solution with sample size n, and Remp[f] is the estimator of R[f]. So we have an estimator as combination of the two. Is it consistent?

  11. ERM Consistency

  12. ERM Consistency

  13. ERM Consistency Consistency Definition for ERM:

  14. ERM Consistency Consistency Definition for ERM:

  15. ERM Consistency Do we need both limits hold true? Counter example: Q(z,) are indicator functions. Each function of this set is equal to 1 for all z except a finite number of intervals of measure  where it is equal to 0. The parameters  define the intervals at which the function is equal to zero. The set of functions Q(z, ) is such that for any finite number of points z1,...,zl, one can find a function that takes the value of zero on this set of points. Let F(z) be the uniform distribution function on the interval [0,1].

  16. ERM Consistency Do we need both limits hold true? We have:

  17. ERM Consistency Strict Consistency: Problem of minorization of function sets  consistency is satisfied trivially.

  18. ERM Consistency Strict Consistency: (note: only the second limit is needed)

  19. ERM Consistency Two sided Empirical Processes: Uniform convergence:

  20. ERM Consistency One-sided Empirical Processes:

  21. ERM Consistency Concentration Inequality: Hoeffding’s Inequality

  22. ERM Consistency Concentration Inequality: Hoeffding’s Inequality Hoeffding’s inequality is distribution independent. It describes the rate of convergence of frequencies to their probability. Where a=1, b=0 it reduces to Chernoff’s inequality. It and its generalization have been used extensively for analyzing randomized algorithms and learning theory.

  23. ERM Consistency Key Theorem of Learning:

  24. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold.

  25. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold. For uniform convergence:

  26. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: Given a sample: {z1,z2,…,zl}, for each  we have a binary vector: q()=(Q(z1, 1),…,Q(zl, l)) Each q() is a vertex in a hypercube.

  27. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: N(z1,…,zl) is the number of distinguish vertices, we have N(z1,…,zl)2l. Def.H(z1,…,zl) = ln N(z1,…,zl) be random entropy And the entropy of the function set is defined as:

  28. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions

  29. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Consider set of functions Q that |Q(z,)| <C Similar to indicator functions, given a sample Z=z1,…,zl for each  vector q()=(Q(z1, ),…,Q(zl, )) is a point in a hypercube

  30. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions

  31. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Define N(;z1,…,zl) be the number of vectors in the minimal -net of the set vector q() (with  varies). Random -entropy of Q(z,) is defined as: H(;z1,…,zl) = ln N(;z1,…,zl) -entropy is defined as:

  32. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions

  33. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 4: Functions with bounded expectations

  34. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 4: Functions with bounded expectations

  35. ERM Consistency Conditions of One-Sided Convergence:

  36. ERM Consistency Conditions of One-Sided Convergence:

  37. ERM Consistency Three milestones in learning theory: For pattern recognition (space of indicator functions), we have: Entropy: H(l) = E ln N(z1,…,zl) Annealed Entropy: Hann(l) = ln E N(z1,…,zl) Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l)  Hann(l)  G(l)

  38. ERM Consistency Three milestones in learning theory: First milestone: - sufficient condition for consistency: Second milestone: - sufficient condition for fast convergence rate: Third milestone: sufficient and necessary condition for consistency for any measure and fast convergence rate:

  39. ERM Generalization Bounds Non-asymptotic results: Consistency is asymptotic results, it does not tell the speed of convergence or the confidence of results of ERM

  40. ERM Generalization Bounds Non-asymptotic results: Note that for finite case when Q(x,) contains only N (indicator) functions. For this case (using Chernoff’s inequality): ERM is consistent and ERM converges fast.

  41. ERM Generalization Bounds Non-asymptotic results: with probability 1-: With probability 1-2 :

  42. ERM Generalization Bounds Indicator Functions: - Distribution Dependent

  43. ERM Generalization Bounds Indicator Functions: - Distribution Dependent with probability 1-: With probability 1-2 : Where:

  44. ERM Generalization Bounds Indicator Functions: Distribution Independent Reminder: Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l)  Hann(l)  G(l) G(l) does not depend on distribution so if we substitute G(l) for H(l), we will get distribution free bounds of generalization error.

  45. ERM Generalization Bounds Indicator Functions: VC dimension

  46. ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions

  47. ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions

  48. ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions

  49. ERM Generalization Bounds Indicator Functions: VC dimension Example:

  50. ERM Generalization Bounds Indicator Functions: VC dimension Example: