Introduction to Machine Learning Laurent Orseau AgroParisTech email@example.com EFREI 2010-2011 Based on slides by Antoine Cornuejols
Overview • Introduction to Induction (Laurent Orseau) • Neural Networks • Support Vector Machines • Decision Trees • Introduction to Data-Mining (Christine Martin) • Association Rules • Clustering • Genetic Algorithms
Overview: Introduction • Introduction to Induction • Examples of applications • Learning types • Supervised Learning • Reinforcement Learning • No-supervised Learning • Machine Learning Theory • What questions to ask?
Introduction What is Machine Learning ? • Memory • Knowledge acquisition • Neurosciences • Short-term (working) • Keep 7±2 objects at a time • Long-term • Procedural • Action sequences • Declarative • Semantic (concepts) • Episodic (facts) • Learning Types • By heart • From rules • By imitation / demonstration • By trial & error • Knowledge reuse • In similar situations
Introduction What is Machine Learning? • "The field of study that gives computers the ability to learn without being explicitly programmed " Arthur Samuel, 1959 Samuel's Checkers > Schaeffer 2007 (solved) + TD-Gammon, Tesauro 1992
Introduction What is Machine Learning? Given: • Experience E, • A class of tasksT • A performancemeasure P, A computer is said to learn if its performance on a task of T measured by P increases with experience E Tom Mitchell, 1997
Introduction Terms related to Machine Learning • Robotic • Automatic Google Cars, Nao • Prediction / forecasting • Stock exchange, pollution peaks, … • Recognition • Face, language, writing, moves, … • Optimization • Subway speed, traveling salesman, … • Regulation • Heat, traffic, fridge temperature, … • Autonomy • Robots, hand prosthesis • Automatic problem solving • Adaptation • User preferences, robot in changing environment • Induction • Generalization • Automatic discovery • …
Applications Learning to cook • Learning by imitation / demonstration • Procedural Learning (motor precision) • Object recognition
Applications DARPA Grand challenge (2005)
Applications > DARPA Grand Challenge 200km of desert Natural and artificial dangers No driver No remote control
Applications > DARPA Grand Challenge 5 Finalists
Applications > DARPA Grand Challenge Recognition of the road
“Face Recognition: Component-based versus Global Approaches” (B. Heisele, P. Ho, J. Wu and T. Poggio), Computer Vision and Image Understanding, Vol. 91, No. 1/2, 6-21, 2003. Applications Learning to label images:Face recognition
Applications > Reconnaissance d'images Feature combinations
Applications Hand prosthesis • Recognition of pronator and supinator signals • Imperfect sensors • Noise • Uncertainty
Applications Autonomous robot rover on Mars
Supervised Learning Learning by heart? UNEXPLOITABLE • Generalize How to encode forms?
Introduction to Machine Learning theory • Supervised Learning • Reinforcement Learning • Unsupervised Learning (CM) • Genetic Algorithms (CM)
Supervised Learning • Set of examples xilabeled ui • Find a hypothesis h so that: h(xi) = ui ? h(xi): predicted label • Best hypothesis h* ?
Supervised Learning Supervised Learning: 1st Example • Houses: Price / m² • Searching for h • Nearest neighbors? • Linear, polynomial regression? • More information • Localization (x, y ? or symbolic variable?), age of building, neighborhood, swimming-pool, local taxes, temporal evolution,…?
Supervised Learning Problem Prediction du prix du m² pour une maison donnee. • Modeling • Data gathering • Learning • Validation • Use in real case Ideal Practice
Supervised Learning 1) Modeling • Input space • What is the meaningful information? • Variables • Output space • What is to be predicted? • Hypothesis space • Input –(computation) Output • What (kind of) computation?
Supervised Learning > 1) Modeling 1-a) Input space: Variables • What is the meaningful information? • Should we get as much as possible? • Information quality? • Noise • Quantity • Cost of information gathering? • Economic • Time • Risk (invasive?) • Ethic • Law (CNIL) • Definition domain of each variable? • Symbolic, bounded numeric, not bounded, etc.
Supervised Learning > 1) Modeling > a) Variables Price of m²: Variables • Localization • Continuous: (x, y) longitude latitude ? • Symbolic: city name? • Age of building • Year of creation? • Relative to present or to creation date? • Nature of soil • Swimming-pool?
Supervised Learning > 1) Modeling 1-b) Output space • What do we want on output? • Symbolic classes? (classification) • Boolean Yes/No (concept learning) • Multi-valued A/B/C/D/… • Numeric? (regression) • [0 ; 1] ? • [-∞ ; +∞] ? • How many outputs? • Multi-valued Multi-class ? • 1 output for each class • Learn a model for each output? • More "free" • Learn 1 model for all outputs? • Each model can use others' information
Supervised Learning > 1) Modeling 1-c) Hypothesis space • Critical! • Depends on the learning algorithm • Linear Regression: space = ax + b • Parameters: a and b • Polynomial regression • # parameters = polynomial degree • Neural Networks, SVM, Gen Algo, … • …
Choice of hypothesis space Estimation Error Approximation Error Total Error
Supervised Learning > 1) Modeling > c) Hypothesis space Choice of hypothesis space • Space too "poor" Inadequate solutions • Ex: model sin(x) with y=ax+b • Space too "rich" • risk of overfitting • Defined by set of parameters • High # params learning more difficult • But prefer a richer hypothesis space! • Use of generic methods • Add regularization
Supervised Learning 2) Data gathering • Gathering • Electronic sensors • Simulation • Polls • Automated on the Internet • … • Get highest quantity of data • Collect cost • Data as "pure" as possible • Avoid all noise • Noise in variables • Noise in labels! • 1 example = 1 value for each variable • missing value = useless example?
Supervised Learning > 2) Data gathering Gathered data measured Output / Class / Label Inputs / Variables But true label y unreachable !
Supervised Learning > 2) Data gathering Data preprocessing • Clean up data • ex: Reduce background noise • Transform data • Final format adapted to task • Ex: Fourier Transform of radio signaltime/amplitude frequency/amplitude
Supervised Learning 3) Learning • Choice of program parameters • Choice of inductive test • Running the learning program • Performance test If bad, return to a)…
Supervised Learning > 3) Learning a) Choice of program parameters • Max allocated computation time • Max accepted error • Learning parameters • Specific to model • Knowledge introduction • Initialize parameters to "ok" values? • …
ò ( ) R ( h ) = l h ( x ), y dP ( x , y ) ´ X Y Supervised Learning > 3) Learning b) Choice of inductive test Goal: find hypothesis hH minimizing real risk(risk expectancy, generalization error) Joint probability law over XY Loss function predicted label true label y (or desired u)
ò ( ) R ( h ) = l h ( x ), y dP ( x , y ) ´ X Y Supervised Learning > 3) Learning > b) Inductive test Real risk • Goal: Minimize real risk • Real risk is not known, in particular P(X,Y). • Discrimination • Regression
Supervised Learning > 3) Learning > b) Inductive test Empirical Risk Minimization • ERM principle • Find hH minimizing empirical risk • Least error on training set
"error" Learning curve Supervised Learning > 3) Learning > b) Inductive test > Empirical risk Learning curve • Data quantity is important! Training set size
Supervised Learning > 3) Learning > b) Inductive test > Empirical risk Test / Validation • Measures overfitting /generalization • Acquired knowledge can be reused in new circumstances? • Do NOT validate over training set! • Validation over additional test set • Cross Validation • Useful when few data • leave-p-out
Supervised Learning > 3) Learning > b) Inductive test > Empirical risk Overfitting Overfitting Real Risk Emprirical Risk Data quantity
Supervised Learning > 3) Learning > b) Inductive test > Empirical risk Regularization • Limit overfitting before measuring it on test set • Add penalization in inductive test • Ex: • Penalize large number • Penalize resource use • …
Supervised Learning > 3) Learning > b) Inductive test Maximum a posteriori • Bayesian approach • We suppose there exists a prior probability distribution over space H:pH(h) Maximum A Posteriori principle(MAP): • Search for most probable h after observing data S • Ex: Observation of sheep color • h = "A sheep is white"
Supervised Learning > 3) Learning > b) Inductive test Minimum Description Length Principle • Occam Razor "Prefer simplest hypotheses" • Simplicity: size of h Maximum compression • Maximum a posteriori with pH(h) = 2-d(h) • d(h): length in bits of h • Compression generalization
Supervised Learning > 3) Learning c) Running the learning program • Search for h • Use examples of training set • One by one • All together • Minimize inductive test
Supervised Learning > 3) Learning > c) Running the program Finding the parameters of the model • Explore hypothesis space H • Best hypothesis given inductive test? • Fundamentally depends on H • Structured exploration • Local exploration • No exploration
Supervised Learning > 3) Learning > c) Running the program > Exploring H Structured exploration • Structured by generality relation (partial order) • Version space • ILP (Inductive Logic Programming) • EBL (Explanation Based Learning) • Grammatical inference • Program enumeration
Supervised Learning > 3) Learning > c) Running the program > Exploring H Representation of the version space Structured by: • Upper bound: G-set • Lower bound: S-set • G-set = Set of all most general hypotheses consistent with known examples • S-set = Set of all most specific hypotheses consistent with known examples
Supervised Learning > 3-c) > Exploring H > Version space Learning… … by iterated updates of the version space Idea: update S-set and G-set after each new example Candidate elimination algorithm • Example: rectangles (cf. blackboard…)