Data Analysis in Class Projects: How to Process Data, Split Sets, and Avoid Overfitting

Mebi 591D –BHI Kaggle Class Project Setup http://winter2014-mebi591d-kaggleclass.weebly.com/

Project Setup (I.) • You have your data, now what? • Literature Review • Basic statistics of data • How many training/test instances • How many categories? • Frequency of features • How many non-zero features, etc….. Due next week: 5-10 min, present your problem + lit + data

Examples Dogs vs. cats (features given) • 1000 dogs, 278 cats • 10000 features • 1356 non-zero features • Dog categorization (images given) • 7 classes (#s of each) • #black and white images • # of images with {1 dog, 2 dogs, 3 dogs, etc}

Project Setup (II.) 3. Split Data - training set vs. test set - if you don’t have test answers > label yourself > randomly split out your own test set - make sure to sample each class 4. Build system using training set only! - Why shouldn’t I evaluate on my test set?

evaluating on same set as testing will lead to 100% accuracy – but you are overtraining!! True function (due to unseen examples and noise) will look differently! Overfitting • … Another reason why you should not evaluate on your data you’ve trained on Fit to a polynomial: Y = a0 + a1*x + a2*x2 + … + a20*x20 Evaluate on ALL Data What would happen?

Error vs. Complexity true error Error training error Complexity

Tasks • Literature review (next week) • Data statistics (next week) • Split data • Training vs. split • (optional) development set • (optional) cross-validation sets

Data Analysis in Class Projects: How to Process Data, Split Sets, and Avoid Overfitting