# Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center - PowerPoint PPT Presentation Download Presentation Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center Download Presentation ## Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Learning, Generalization, and Regularization: A Glimpse of Statistical Machine Learning Theory – Part 2 Assoc. Prof. Nguyen Xuan Hoai, HANU IT R&D Center

2. Contents VC Theory: Reminders from the last lecture. ERM Consistency (asymptotic results). ERM generalization bounds (non-asymptotic results). Structural Risk Minimization (SRM). Other theoretical framework for machine learning.

3. Probabilistic Setting of ML

4. Empirical Risk Minimization (ERM) Lost function:

5. Empirical Risk Minimization (ERM) Risk Function:

6. Empirical Risk Minimization (ERM) Risk Minimization Principle:

7. Empirical Risk Minimization (ERM) Empirical Risk Minimization Principle:

8. Regularization Regularized Empirical Risk:

9. Empirical Risk Minimization (ERM) Questions for ERM (Statistical Learning Theory): Is ERM consistent? (consistency) (Weak convergence of ERM solution to the true one) How fast is the convergence? How to control the generalization?

10. ERM Consistency R[f] is estimator of true solution with sample size n, and Remp[f] is the estimator of R[f]. So we have an estimator as combination of the two. Is it consistent?

11. ERM Consistency

12. ERM Consistency

13. ERM Consistency Consistency Definition for ERM:

14. ERM Consistency Consistency Definition for ERM:

15. ERM Consistency Do we need both limits hold true? Counter example: Q(z,) are indicator functions. Each function of this set is equal to 1 for all z except a finite number of intervals of measure  where it is equal to 0. The parameters  define the intervals at which the function is equal to zero. The set of functions Q(z, ) is such that for any finite number of points z1,...,zl, one can find a function that takes the value of zero on this set of points. Let F(z) be the uniform distribution function on the interval [0,1].

16. ERM Consistency Do we need both limits hold true? We have:

17. ERM Consistency Strict Consistency: Problem of minorization of function sets  consistency is satisfied trivially.

18. ERM Consistency Strict Consistency: (note: only the second limit is needed)

19. ERM Consistency Two sided Empirical Processes: Uniform convergence:

20. ERM Consistency One-sided Empirical Processes:

21. ERM Consistency Concentration Inequality: Hoeffding’s Inequality

22. ERM Consistency Concentration Inequality: Hoeffding’s Inequality Hoeffding’s inequality is distribution independent. It describes the rate of convergence of frequencies to their probability. Where a=1, b=0 it reduces to Chernoff’s inequality. It and its generalization have been used extensively for analyzing randomized algorithms and learning theory.

23. ERM Consistency Key Theorem of Learning:

24. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold.

25. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 1: Finite set of functions Let our set of events contain a finite number N of events Ak = {z : Q(z,k) > 0} , k = 1,2, ..,N. For this set, uniform convergence does hold. For uniform convergence:

26. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: Given a sample: {z1,z2,…,zl}, for each  we have a binary vector: q()=(Q(z1, 1),…,Q(zl, l)) Each q() is a vertex in a hypercube.

27. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions Entropy of a set of function: N(z1,…,zl) is the number of distinguish vertices, we have N(z1,…,zl)2l. Def.H(z1,…,zl) = ln N(z1,…,zl) be random entropy And the entropy of the function set is defined as:

28. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 2: Sets of Indicator Functions

29. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Consider set of functions Q that |Q(z,)| <C Similar to indicator functions, given a sample Z=z1,…,zl for each  vector q()=(Q(z1, ),…,Q(zl, )) is a point in a hypercube

30. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions

31. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions Define N(;z1,…,zl) be the number of vectors in the minimal -net of the set vector q() (with  varies). Random -entropy of Q(z,) is defined as: H(;z1,…,zl) = ln N(;z1,…,zl) -entropy is defined as:

32. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 3: Real-valued Bounded Functions

33. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 4: Functions with bounded expectations

34. ERM Consistency Uniform Convergence of Frequencies to Prob: Case 4: Functions with bounded expectations

35. ERM Consistency Conditions of One-Sided Convergence:

36. ERM Consistency Conditions of One-Sided Convergence:

37. ERM Consistency Three milestones in learning theory: For pattern recognition (space of indicator functions), we have: Entropy: H(l) = E ln N(z1,…,zl) Annealed Entropy: Hann(l) = ln E N(z1,…,zl) Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l)  Hann(l)  G(l)

38. ERM Consistency Three milestones in learning theory: First milestone: - sufficient condition for consistency: Second milestone: - sufficient condition for fast convergence rate: Third milestone: sufficient and necessary condition for consistency for any measure and fast convergence rate:

39. ERM Generalization Bounds Non-asymptotic results: Consistency is asymptotic results, it does not tell the speed of convergence or the confidence of results of ERM

40. ERM Generalization Bounds Non-asymptotic results: Note that for finite case when Q(x,) contains only N (indicator) functions. For this case (using Chernoff’s inequality): ERM is consistent and ERM converges fast.

41. ERM Generalization Bounds Non-asymptotic results: with probability 1-: With probability 1-2 :

42. ERM Generalization Bounds Indicator Functions: - Distribution Dependent

43. ERM Generalization Bounds Indicator Functions: - Distribution Dependent with probability 1-: With probability 1-2 : Where:

44. ERM Generalization Bounds Indicator Functions: Distribution Independent Reminder: Growth Function: G(l) = ln sup N(z1,…,zl) We have: H(l)  Hann(l)  G(l) G(l) does not depend on distribution so if we substitute G(l) for H(l), we will get distribution free bounds of generalization error.

45. ERM Generalization Bounds Indicator Functions: VC dimension

46. ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions

47. ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions

48. ERM Generalization Bounds Indicator Functions: VC dimension Example: linear functions

49. ERM Generalization Bounds Indicator Functions: VC dimension Example:

50. ERM Generalization Bounds Indicator Functions: VC dimension Example: