Introduction to Machine Learning

Introduction to Machine Learning Laurent Orseau AgroParisTech laurent.orseau@agroparistech.fr EFREI 2010-2011 Based on slides by Antoine Cornuejols

Overview • Introduction to Induction (Laurent Orseau) • Neural Networks • Support Vector Machines • Decision Trees • Introduction to Data-Mining (Christine Martin) • Association Rules • Clustering • Genetic Algorithms

Overview: Introduction • Introduction to Induction • Examples of applications • Learning types • Supervised Learning • Reinforcement Learning • No-supervised Learning • Machine Learning Theory • What questions to ask?

Introduction

Introduction What is Machine Learning ? • Memory • Knowledge acquisition • Neurosciences • Short-term (working) • Keep 7±2 objects at a time • Long-term • Procedural • Action sequences • Declarative • Semantic (concepts) • Episodic (facts) • Learning Types • By heart • From rules • By imitation / demonstration • By trial & error • Knowledge reuse • In similar situations

Introduction What is Machine Learning? • "The field of study that gives computers the ability to learn without being explicitly programmed " Arthur Samuel, 1959 Samuel's Checkers > Schaeffer 2007 (solved) + TD-Gammon, Tesauro 1992

Introduction What is Machine Learning? Given: • Experience E, • A class of tasksT • A performancemeasure P, A computer is said to learn if its performance on a task of T measured by P increases with experience E Tom Mitchell, 1997

Introduction Terms related to Machine Learning • Robotic • Automatic Google Cars, Nao • Prediction / forecasting • Stock exchange, pollution peaks, … • Recognition • Face, language, writing, moves, … • Optimization • Subway speed, traveling salesman, … • Regulation • Heat, traffic, fridge temperature, … • Autonomy • Robots, hand prosthesis • Automatic problem solving • Adaptation • User preferences, robot in changing environment • Induction • Generalization • Automatic discovery • …

Some applications

Applications Learning to cook • Learning by imitation / demonstration • Procedural Learning (motor precision) • Object recognition

Applications DARPA Grand challenge (2005)

Applications > DARPA Grand Challenge 200km of desert Natural and artificial dangers No driver No remote control

Applications > DARPA Grand Challenge 5 Finalists

Applications > DARPA Grand Challenge Recognition of the road

“Face Recognition: Component-based versus Global Approaches” (B. Heisele, P. Ho, J. Wu and T. Poggio), Computer Vision and Image Understanding, Vol. 91, No. 1/2, 6-21, 2003. Applications Learning to label images:Face recognition

Applications > Reconnaissance d'images Feature combinations

Applications Hand prosthesis • Recognition of pronator and supinator signals • Imperfect sensors • Noise • Uncertainty

Applications Autonomous robot rover on Mars

Supervised Learning Learning by heart? UNEXPLOITABLE • Generalize How to encode forms?

Introduction to Machine Learning Theory

Introduction to Machine Learning theory • Supervised Learning • Reinforcement Learning • Unsupervised Learning (CM) • Genetic Algorithms (CM)

Supervised Learning • Set of examples xilabeled ui • Find a hypothesis h so that: h(xi) = ui ? h(xi): predicted label • Best hypothesis h* ?

Supervised Learning Supervised Learning: 1st Example • Houses: Price / m² • Searching for h • Nearest neighbors? • Linear, polynomial regression? • More information • Localization (x, y ? or symbolic variable?), age of building, neighborhood, swimming-pool, local taxes, temporal evolution,…?

Supervised Learning Problem Prediction du prix du m² pour une maison donnee. • Modeling • Data gathering • Learning • Validation • Use in real case Ideal Practice

Supervised Learning 1) Modeling • Input space • What is the meaningful information? • Variables • Output space • What is to be predicted? • Hypothesis space • Input –(computation) Output • What (kind of) computation?

Supervised Learning > 1) Modeling 1-a) Input space: Variables • What is the meaningful information? • Should we get as much as possible? • Information quality? • Noise • Quantity • Cost of information gathering? • Economic • Time • Risk (invasive?) • Ethic • Law (CNIL) • Definition domain of each variable? • Symbolic, bounded numeric, not bounded, etc.

Supervised Learning > 1) Modeling > a) Variables Price of m²: Variables • Localization • Continuous: (x, y) longitude latitude ? • Symbolic: city name? • Age of building • Year of creation? • Relative to present or to creation date? • Nature of soil • Swimming-pool?

Supervised Learning > 1) Modeling 1-b) Output space • What do we want on output? • Symbolic classes? (classification) • Boolean Yes/No (concept learning) • Multi-valued A/B/C/D/… • Numeric? (regression) • [0 ; 1] ? • [-∞ ; +∞] ? • How many outputs? • Multi-valued  Multi-class ? • 1 output for each class • Learn a model for each output? • More "free" • Learn 1 model for all outputs? • Each model can use others' information

Supervised Learning > 1) Modeling 1-c) Hypothesis space • Critical! • Depends on the learning algorithm • Linear Regression: space = ax + b • Parameters: a and b • Polynomial regression • # parameters = polynomial degree • Neural Networks, SVM, Gen Algo, … • …

Choice of hypothesis space Estimation Error Approximation Error Total Error

Supervised Learning > 1) Modeling > c) Hypothesis space Choice of hypothesis space • Space too "poor"  Inadequate solutions • Ex: model sin(x) with y=ax+b • Space too "rich" •  risk of overfitting • Defined by set of parameters • High # params learning more difficult • But prefer a richer hypothesis space! • Use of generic methods • Add regularization

Supervised Learning 2) Data gathering • Gathering • Electronic sensors • Simulation • Polls • Automated on the Internet • … • Get highest quantity of data • Collect cost • Data as "pure" as possible • Avoid all noise • Noise in variables • Noise in labels! • 1 example = 1 value for each variable • missing value = useless example?

Supervised Learning > 2) Data gathering Gathered data measured Output / Class / Label Inputs / Variables But true label y unreachable !

Supervised Learning > 2) Data gathering Data preprocessing • Clean up data • ex: Reduce background noise • Transform data • Final format adapted to task • Ex: Fourier Transform of radio signaltime/amplitude  frequency/amplitude

Supervised Learning 3) Learning • Choice of program parameters • Choice of inductive test • Running the learning program • Performance test If bad, return to a)…

Supervised Learning > 3) Learning a) Choice of program parameters • Max allocated computation time • Max accepted error • Learning parameters • Specific to model • Knowledge introduction • Initialize parameters to "ok" values? • …

ò ( ) R ( h ) = l h ( x ), y dP ( x , y ) ´ X Y Supervised Learning > 3) Learning b) Choice of inductive test Goal: find hypothesis hH minimizing real risk(risk expectancy, generalization error) Joint probability law over XY Loss function predicted label true label y (or desired u)

ò ( ) R ( h ) = l h ( x ), y dP ( x , y ) ´ X Y Supervised Learning > 3) Learning > b) Inductive test Real risk • Goal: Minimize real risk • Real risk is not known, in particular P(X,Y). • Discrimination • Regression

Supervised Learning > 3) Learning > b) Inductive test Empirical Risk Minimization • ERM principle • Find hH minimizing empirical risk • Least error on training set

"error" Learning curve Supervised Learning > 3) Learning > b) Inductive test > Empirical risk Learning curve • Data quantity is important! Training set size

Supervised Learning > 3) Learning > b) Inductive test > Empirical risk Test / Validation • Measures overfitting /generalization • Acquired knowledge can be reused in new circumstances? • Do NOT validate over training set! • Validation over additional test set • Cross Validation • Useful when few data • leave-p-out

Supervised Learning > 3) Learning > b) Inductive test > Empirical risk Overfitting Overfitting Real Risk Emprirical Risk Data quantity

Supervised Learning > 3) Learning > b) Inductive test > Empirical risk Regularization • Limit overfitting before measuring it on test set • Add penalization in inductive test • Ex: • Penalize large number • Penalize resource use • …

Supervised Learning > 3) Learning > b) Inductive test Maximum a posteriori • Bayesian approach • We suppose there exists a prior probability distribution over space H:pH(h) Maximum A Posteriori principle(MAP): • Search for most probable h after observing data S • Ex: Observation of sheep color • h = "A sheep is white"

Supervised Learning > 3) Learning > b) Inductive test Minimum Description Length Principle • Occam Razor "Prefer simplest hypotheses" • Simplicity: size of h  Maximum compression • Maximum a posteriori with pH(h) = 2-d(h) • d(h): length in bits of h • Compression  generalization

Supervised Learning > 3) Learning c) Running the learning program • Search for h • Use examples of training set • One by one • All together • Minimize inductive test

Supervised Learning > 3) Learning > c) Running the program Finding the parameters of the model • Explore hypothesis space H • Best hypothesis given inductive test? • Fundamentally depends on H • Structured exploration • Local exploration • No exploration

Supervised Learning > 3) Learning > c) Running the program > Exploring H Structured exploration • Structured by generality relation (partial order) • Version space • ILP (Inductive Logic Programming) • EBL (Explanation Based Learning) • Grammatical inference • Program enumeration

Supervised Learning > 3) Learning > c) Running the program > Exploring H Representation of the version space Structured by: • Upper bound: G-set • Lower bound: S-set • G-set = Set of all most general hypotheses consistent with known examples • S-set = Set of all most specific hypotheses consistent with known examples

Supervised Learning > 3-c) > Exploring H > Version space Learning… … by iterated updates of the version space Idea: update S-set and G-set after each new example Candidate elimination algorithm • Example: rectangles (cf. blackboard…)

Introduction to Machine Learning