0 likes | 17 Views
Supervised learning involves learning an unknown mapping through a set of input-output pairs. This type of learning can be used to classify data or predict numerical values. Factors like sample size, hypothesis set, and learning algorithm influence the learnability of the process. Probably Approximately Correct (PAC) learning and Vapnik-Chervonenkis (VC) dimension help understand the complexities involved. Noise and model complexity are also important considerations in supervised learning.
E N D
CH. 2: Supervised Learning 2.1 Introduction Supervised learning learns an unknown mapping f through a set of (input i, output o) pairs. f can be a numerical function or a classifier. N i {( , )} , i i o Given a training sample, where X 1 i ( , ) i i o is a training example, figure out f. i 1
Example: Learn the class c of “family cars”, i.e., learn a discriminant that separates c from different classes of cars. T x A car is represented as where x1: price, x2: engine power ( , ) x x 1 2 1 0 class otherwise c Let r Given a sample {( , )}N i x S r 1 i i 2
Due to a car x is represented as a 2D vector , the family car class c forms a rectangle in the 2D-space ( , ). x x ( , )T x x x 1 2 1 2 A hypothesis set H deduced from class c is then defined as { | ( i i H h h p e i i ) x p 1 1 2 i i ( )} x e 1 2 2 3
Problem: Given S and H, the learning algorithm A attempts to find a hypothesis that minimizes error ) ) (empirical error). t E h r x h H t t N 11( ( 4
2.2 Learnability In the supervised learning process, three items: S (sample), H (hypothesis set), and A (learning algorithm) will influence the learnability of the process. Probably approximately correct (PAC) learning and Vapnik-Chervonenkis (VC) dimension will give the answers to the complexities of sample and hypothesis set, respectively. 5
2.2.1 Probably Approximately Correct (PAC) Learning -- Answer to the sample complexity, i.e., the number of training examples required to achieve a satisfied (probably approximate correct PAC) answer. Hypotheses s, g, and Version Space V 6
Hypotheses s and g can be determined from the training set X by first looking for s from outside moving inward; then look for g from s heading outward. Version space: { | is between and } h h V s g In general, choose the h V hypothesis with largest margin between s and h. 7
PAC learnable – using the tightest rectangle as the hypothesis class, i.e., h = s, how many training examples N should have, such that with probability h has error at most ? Mathematically, 1 at least ) 1 ( P ch where : ch the region of difference between c and h. 8
In order that the probability of a positive car falling in (i.e., error) is at most ch Probability that fall in a strip is upper bounded by /4 Probability that miss (i.e., correct) a strip 1 /4 /4) (1 N Probability that N instances miss a strip Probability that N instances miss 4 strips 4(1 /4) N This probability should be upper bounded by 4(1 /4) N 9
(1 /4) /4 N 4(1 /4) N log(4/ ) - (A) 4/ log(1 /4) N (1 /4) N x 1 log(1 e x x ) x /4) From log(1 ), /4 log(1 x x /4), /4) - (B) /4 log(1 /4 log(1 N N N (A), (B) /4 log(4/ ) (4/ )log(4/ ) N , which depends on given , 10
2.2.2 Vapnik-Chervonenkis (VC) Dimension --A measure of hypothesis complexity. N points can be labeled in 2Nways as +/– (or 1/0), e.g., N = 3, 2N= 8 labeling , if that separates +/– examples, h H H shatters N points, e.g., H: line or rectangle hypothesis class both can shatter 3 points VC(H): the maximum number of points that can be shattered by H 11
Examples: i) The VC dimension of the ‘line’ hypothesis class is 3 in 2D space, i.e., VC(line) = 3 Cannot be shattered by “line” hypothesis class Can be shattered by “line” hypothesis class 12
Only rectangles covering 2 (no 1, 3, 4) points are shown (ii) An axis-aligned (AA) rectangle shatters 4 points, i.e., VC(AA rectangle) = 4 5 points can not be shattered by a rectangle (iii) VC(triangle) = 7 13
Assignment: Show that for any finite H, VC( H ) log | |. H 2 Extension -- Each point can be labelled as one of K labels. N points can be labelled in KNways. , if of different labels, H shatters N points. VC(H): the maximum number of points that can be shattered by H labeling that separates the examples h H Assignment (monotonicity property): For any two hypothesis sets if H "line" "triangle", H , then VC( H ) VC( ) H H H "line" 14
2.6 Model Selection -- determine one from a given set of models Example: Learn a Boolean function from examples 15
Each training example removes half the hypotheses, e.g., input , output 0. This removes because their outputs are 1. ( , x x ) (0,1) 1 2 , , , , h h h h h h h h , , , 5 6 7 8 13 14 15 16 The above example illustrates that a learning starts with all possible models (hypotheses) and as more examples are seen, those inconsistent models are removed. 2.3 Noise and Model Complexity Noise due to error, imprecision, uncertainty, etc.. Complicated hypotheses are generally necessary to cope with noise. 16
2) Regression 1) Classification However, simpler hypotheses make more sense because simple to use, easy to check, train, and explain, and good generalization 17
2.4 Learning Multiple Classes Multiple Classes, Cii = 1, ..., K x x 1 0 C C t { , } , x r X t t N Training set: r i t 1 t t i j i Treat a K-class classification problem as K 2-class problems, i.e., train hypotheses the total error 1 t i E h x x 1 0 C C t x ( ) , 1, , h i K i t that minimize t i j i N K x 11( ( ) ). r t t i i arg min ( , , ) KE h h Problem: 1 K , 1, , h i , 18 i
2.5 Regression f x ( ) () r f Find , s.t. t t tr R { , } , x r Training set: X t t N 1 t Let g be the estimate of f. 1 N 2 N x ( | ) ( ) E g X r g t t Expected total error: 1 t ( ) g x wx w For linear model: 1 0 1 N 2 N ( , | ) ( ) E w w X r wx w t t 0 1 1 0 1 t argmin ( , | ) w wE w w X Problem: 0 1 , 0 1 19
2 N ( ) r w x w w t t E w w X ( , w | ) 1 N 1 0 Let 0 1 1 t 1 1 2 N ( ) 0 r w x w x t t t N 1 0 1 t N ( ) 0 r w x w x t t t 1 0 where x 1 t N N N ( ) x 0 r x w w x 2 t t t t / x N t t 1 0 1 1 1 t t t N N ( ) x 0 ----- (1) r x w w Nx 2 t t t 1 0 1 1 t t 2 N ( ) r w x w t t E w w X ( , w | ) 1 N 1 0 0 0 1 1 t w 0 0 20
N ( ) 0 r w x w t t where r 1 0 1 t N N / r N t 0 r w x Nw t t t 1 0 1 1 t Nr t 0 ----- (2) w Nx Nw 1 0 and w w Solve (1) and (2) for 1 0 x r xrN Nx t t t = , w w r wx ( ) x 1 0 1 2 2 t t ( ) g x w x w x w 2 For quadratic model: 2 1 0 , and w w w Solve for 2 1 0 21
Example: 22
2.7 Dimensions of a Supervised Learning Algorithm Given a sample: { , } x X r t t N 1 t g ( | ): x ( ) g 1. Model defines the hypothesis set H ( | | ))) E x ) ( , ( L r g 2. Error function ( ): E X t t t ( ): L where loss functin expresses the difference between (label) and ( r g | ) (return) x t t 3. Optimization procedure: argmin ( | ) E X * 23
2.8 Notes Ill-posed problem: training examples are not sufficient to lead to a unique solution Inductive bias: additional information and prior knowledge for making learning possible Model selection: how to choose the right bias Generalization: how well a model trained on the training set predicts the right output for new instances 24
Underfitting: hypothesis set H is less complex than the function underlying the data, e.g., fit a line to data sampled from a 3rd order polynomial Overfitting: hypothesis set H is more complex than the function, e.g., fit a 3rdorder polynomial to data sampled from a line 25
Triple trade-off: trade-off between 3 factors 1. The size of training set, N, 2. The complexity (or capacity) of H, O(H), 3. Generalization error, E, on new data As ( O H As , ; N E ) , first , then E Cross-validation: data are divided into i) training set, ii) validation set, and iii) test set. Training set: induces a hypothesis Validation set: tests the generalization ability of the induced hypothesis Test set: provides the expected error of the induced hypothesis 26
Summary • Supervised Learning: Given S and H, A attempts to t t find a that minimizes h H N t x ( | ) E h S 1( ( h ) ) r 1 27
• Learnability depends on the complexities of sample S, hypothesis H, and learning algorithm A. • PAC Learning model -- answers to the sample complexity ) 1 sampling size holds for ( P ch 0 and 0, (4/ )log(4/ ). N • Vapnik-Chervonenkis (VC) Dimension -- answers to the hypothesis complexity VC(H): the maximum number of points that can be shattered by H 28
H shatters N points: N points can be labeled in KN ways, if each point can be labeled as one of K different labels. the examples of different labels. that can separate labeling, h H Noise and Model Complexity • Complicated hypotheses are generally necessary to cope with noise. However, simpler hypotheses are easy to use, check, train, explain, and generalize. 29
• Dimensions of a Supervised Learning Algorithm { , } x Given a sample X r t t N 1 t g ( | ): x ( ) g 1. Model defines the hypothesis set H E ( | | ))) x ( ) ) ( , ( L r g 2. Error function : E X t t t ( ): L where loss functin expresses the difference between (label) and ( r g | ) (return) x t t 3. Optimization procedure: argmin ( | ) E X * 30
Ill-posed problem -> Inductive bias -> Model selection -> Generalization -> Underfitting -> Overfitting -> Triple trade-off: trade-off between 3 factors 1. The size of training set, N, 2. The complexity (or capacity) of H, O(H), 3. Generalization error, E, on new data ) , first , then E As ( O H As , ; N E 31
Cross-validation: Data are divided into i) training set, ii) validation set, and iii) test set. Training set: for inducing a hypothesis Validation set: for testing the generalization ability of the induced hypothesis Test set: for providing the expected error of the induced hypothesis 32