机器学习

机器学习 陈昱北京大学计算机科学技术研究所信息安全工程研究中心

课程基本信息 • 主讲教师：陈昱 chen_yu@pku.edu.cn Tel：82529680 • 助教：程再兴，Tel：62763742 wataloo@hotmail.com • 课程网页： http://www.icst.pku.edu.cn/course/jiqixuexi/jqxx2011.mht

Ch5 Evaluating Hypotheses • Givenobserved accuracy of hypothesis over limited sample of data, how well does this estimate its accuracy over additional sample? (hypothesis accuracy) • Given that hypothesis h outperforms h’ over some sample of data, how probable is that h outperforms h’ in general? (difference between hypotheses) • When data is limited, what is best way to use the data to both learn a hypothesis and estimate its accuracy? (comparing learning algorithms)

Agenda • Estimating hypothesis accuracy • Basics of sampling theory • Deriving confidence interval (general approach) • Difference between hypotheses • Comparing learning algorithm

Learning Problem Setting • Space of possible instances X (e.g. set of all people) over which target functions may be defined. • Assume that different instances in X may be encountered with different frequencies. • Modeling above assumption as: unknown probability distribution D that defines the probability of encountering each instance in X • Training examples are provided by drawing instances independently from X, according to D.

Bias & Variance • In case of limited data, when we try to estimate the accuracy of a learned hypothesis, two difficulties arise: • Bias：The training examples typically provide an optimistically biased estimate of accuracy of learned hypo over future examples (overfitting problem) • Variance: Even if the hypo accuracy is measured over an unbiased set of testing examples, the makeup of testing set could still effect the measurement of the accuracy of learned hypo

Qs in Focus • Given a hypo h and a data sample containing n examples drawing at random according to distribution D, what is best estimate of accuracy of h over future instances drawn from D? • What is probable error in above estimate?

Sample Error & True Error • Sample error of hypo h w.r.t. target function f and data set S of n sample is • True error of hypo h w.r.t. target function f and distribution D is • So the two Qs become: How well errorS(h) estimates errorD(h)?

Confidence Interval for Discrete-Valued Hypo • Assume • sample S contains n examples drawn independent of another, and independent of h, according to distribution D, and • n≧30 • Then • given no other information, the most probable value of errorD(h) is errorS(h); Furthermore, • with approximately 95% probability, errorD(h) lie in

Binomial Probability Distribution • Probability P(r) of r heads in n coin flips, given Pr(head in one flip)=p: • Expected value of binomial distribution X=b(n,p) is: E[X]=np • Variance of X is Var(X)=np(1-p) • Standard deviation of X is σΧ=sqrt(np(1-p))

Example • Remark: Bell-shape figure

Compute errorS(h) • Assume h misclassified r sample from set S of n samples, then

Normal Distribution • 80% of area of probability density function N(μ,σ) lies in μ±1.28σ • N% of area of probability density function N(μ,σ) lies in μ±zNσ

Approximation of errorS(h) • When n is large enough, errorS(h) can be approximated by Normal distribution with same expected value & variance, i.e. N(errorD(h), errorD(h)(1-errorD(h))/n) (Corollary of Central Limit Theorem) • The rule of thumb is that, • n≥30, or • n ×errorD(h)(1-errorD(h))≥5

Confidence Interval of Estimation of errorD(h) • It follows that with around N% probability, errorS(h) lies in interval errorD(h)±zN sqrt[errorD(h)(1-errorD(h))/n] • Equivalently, errorD(h) lies in interval errorS(h)±zN sqrt[errorD(h)(1-errorD(h))/n], which can be approximated by (贝努里大数定律) errorS(h)±zN sqrt[errorS(h)(1-errorS(h))/n] • Therefore we have derived confidence interval for discrete-valued hypo

Two-Sided & One-Sided Bounds • Sometimes it is desirable to transfer two-sided bound into one-sided bound, for example, when we are interested in Q “What is probability that errorD(h) is at most U (certain upper bound)? • Transfer two-sided bound into one-sided bound using symmetry of normal distribution (fig 5.1 in textbook)

Qs in Focus • Given a hypo h and a data sample containing n examples drawing at random according to distribution D, what is best estimate of accuracy of h over future instances drawn from D? A: Prefer unbiased estimator with minimum variance • What is probable error in above estimate? A: Derive confidence interval

General Approach • Pick up parameter p to be estimated • e.g. errorD(h) • Choose an estimator, desirable unbiased plus minimum variance • e.g. errorS(h) with large n • Determine probability distribution that governs estimator • Find interval (L,U) such that N% of probability mass falls in the interval

Central Limit Theorem • Consider a set of independent, identically distributed (i.i.d) random variable Y1…Yn, all governed by an arbitrary probability distribution D with mean μ and finite variance σ2. Define the sample mean • Central Limit Theorem: As n→∞, the distribution governing approaches N(μ, σ2/n).

Approximate errorS(h) by Normal Distribution • In Central Limit Theorem take distribution D to be Bernoulli experiment with p to be errorD(h), and we are done!

Ch2 Evaluating Hypotheses • Givenobserved accuracy of hypothesis over limited sample of data, how well does this estimate its accuracy over additional sample? (hypothesis accuracy, done!) • Given that hypothesis h outperforms h’ over some sample, how probable is that h outperforms h’ in general? (difference between hypotheses, this section) • When data is limited, what is best way to use the data to both learn a hypothesis and estimate its accuracy? (comparing learning algorithms)

Difference in Error Test h1 on sample S1, test h2 on S2 • Pick up parameter to be estimated: d≡errorD(h1)-errorD(h2) • Choose an estimator Property of • Unbiased estimator • When n is large enough e.g.≧30, it can be approximated by difference of two Normal distribution, also a normal distribution, with mean=d, and in case that these two tests are independent, var=var(errorS1(h1))+var(errorS2(h2)). • ……

Difference in Error (2) • Remark: when S1=S2, the estimator usually becomes smaller (elimination of difference in composition of two sample sets)

Hypothesis Tesing • Consider question “What is the probability that errorD(h1)≧errorD(h2)” instead • E.g. S1, S2 of size 100, errorS1(h1)=0.3, errorS2(h2)=0.2, hence • Pr(d>0) is equivalent to one-sided interval • 1.64σ corresponds to a two-sided interval with confidence level 90%, i.e. one-sided interval with confidence level 95%.

Ch2 Evaluating Hypotheses • Givenobserved accuracy of hypothesis over limited sample of data, how well does this estimate its accuracy over additional sample? (hypothesis accuracy) • Given that hypothesis h outperforms h’ over some sample, how probable is that h outperforms h’ in general? (difference between hypotheses) • When data is limited, what is best way to use the data to both learn a hypothesis and estimate its accuracy? (comparing learning algorithms)

Qs in Focus Let LA and LB be two learning algorithms • What is an appropriate test for comparing LA and LB? • How can we determine whether an observed difference is statistically significant?

Statement of Problem We want to estimate where L(S) is the hypothesis output by learner L using training set S. Remark: The difference of errors is averaged over all training set of size n randomly drawn from D In practice, given limited data D0, what is a good estimator? • PartitionD0 into training set S0 and testing set T0, and measure • Ever better, repeat above many times and average the results

Procedure • Partition D0 into k disjoint subsets T1, T2, …, Tk of equal size of at least 30. • For i from 1 to k, do use Ti for testing • Si ←{D0 -Ti} • hA←LA (Si) • hB←LB(Si) • δi←errorTi(hA)-errorTi(hB) • Return average of δi as the estimation

Estimator • The approximate N% confidence interval for estimating d using is given by

Paired t Tests • To understand justification for confidence level given in previous page, consider the following estimation problem: • We are given observed values of a set of i.i.d random variables Y1, Y2, …, Yk . • Wish to estimate expected value of these Yi • Use sample mean as the estimator

Problem with Limited Data D0 • δ1 … δk are not i.i.d, because they are based on overlapping sets of training examples drawn from D0 rather than full distribution D. • View the algorithm in page 33 as producing estimation for instead.

HW • 5.4 (10pt, Due Monday, 10-24) • 5.6 (10pt, Due Monday, 10-24)

机器学习

机器学习

Presentation Transcript