Machine Learning: Making Computer Science Scientific

Machine Learning: Making Computer Science Scientific Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.cs.orst.edu/~tgd

Acknowledgements • VLSI Wafer Testing • Tony Fountain • Robot Navigation • Didac Busquets • Carles Sierra • Ramon Lopez de Mantaras • NSF grants IIS-0083292 and ITR-085836

Outline • Three scenarios where standard software engineering methods fail • Machine learning methods applied to these scenarios • Fundamental questions in machine learning • Statistical thinking in computer science

Scenario 1: Reading Checks Find and read “courtesy amount” on checks:

Possible Methods: • Method 1: Interview humans to find out what steps they follow in reading checks • Method 2: Collect examples of checks and the correct amounts. Train a machine learning system to recognize the amounts

Scenario 2: VLSI Wafer Testing • Wafer test: Functional test of each die (chip) while on the wafer

Which Chips (and how many) should be tested? • Tradeoff: • Test all chips on wafer? • Avoid cost of packaging bad chips • Incur cost of testing all chips • Test none of the chips on the wafer? • May package some bad chips • No cost of testing on wafer

Possible Methods • Method 1: Guess the right tradeoff point • Method 2: Learn a probabilistic model that captures the probability that each chip will be bad • Plug this model into a Bayesian decision making procedure to optimize expected profit

Scenario 3: Allocating mobile robot camera Binocular No GPS

Camera tradeoff • Mobile robot uses camera both for obstacle avoidance and landmark-based navigation • Tradeoff: • If camera is used only for navigation, robot collides with objects • If camera is used only for obstacle avoidance, robot gets lost

Possible Methods • Method 1: Manually write a program to allocate the camera • Method 2: Experimentally learn a policy for switching between obstacle avoidance and landmark tracking

Challenges for SE Methodology • Standard SE methods fail when… • System requirements are hard to collect • The system must resolve difficult tradeoffs

(1) System requirements are hard to collect • There are no human experts • Cellular telephone fraud • Human experts are inarticulate • Handwriting recognition • The requirements are changing rapidly • Computer intrusion detection • Each user has different requirements • E-mail filtering

(2) The system must resolve difficult tradeoffs • VLSI Wafer testing • Tradeoff point depends on probability of bad chips, relative costs of testing versus packaging • Camera Allocation for Mobile Robot • Tradeoff depends on probability of obstacles, number and quality of landmarks

Machine Learning: Replacing guesswork with data • In all of these cases, the standard SE methodology requires engineers to make guesses • Guessing how to do character recognition • Guessing the tradeoff point for wafer test • Guessing the tradeoff for camera allocation • Machine Learning provides a way of making these decisions based on data

Outline • Three scenarios where software engineering methods fail • Machine learning methods applied to these scenarios • Fundamental questions in machine learning • Statistical thinking in computer science

Basic Machine Learning Methods • Supervised Learning • Density Estimation • Reinforcement Learning

1 0 6 3 8 Supervised Learning Training Examples New Examples Learning Algorithm Classifier 8

AT&T/NCR Check Reading System Recognition transformer is a neural network trained on 500,000 examples of characters The entire system is trained given entire checks as input and dollar amounts as output LeCun, Bottou, Bengio & Haffner (1998) Gradient-Based Learning Applied to Document Recognition

Check Reader Performance • 82% of machine-printed checks correctly recognized • 1% of checks incorrectly recognized • 17% “rejected” – check is presented to a person for manual reading • Fielded by NCR in June 1996; reads millions of checks per month

Supervised Learning Summary • Desired classifier is a function y = f(x) • Training examples are desired input-output pairs (xi,yi)

Density Estimation Training Examples Partially-tested wafer Learning Algorithm Density Estimator P(chipi is bad) = 0.42

W . . . C1 C2 C3 C209 On-Wafer Testing System • Trained density estimator on 600 wafers from mature product (HP; Corvallis, OR) • Probability model is “naïve Bayes” mixture model with four components (trained with EM)

One-Step Value of Information • Choose the larger of • Expected profit if we predict remaining chips, package, and re-test • Expected profit if we test chip Ci, then predict remaining chips, package, and re-test [for all Ci not yet tested]

On-Wafer Chip Test Results 3.8% increase in profit

Density Estimation Summary • Desired output is a joint probability distribution P(C1, C2, …, C203) • Training examples are points X= (C1, C2, …, C203) sampled from this distribution

agent Reinforcement Learning state s Environment reward r action a Agent’s goal: Choose actions to maximize total reward Action Selection Rule is called a “policy”: a = p(s)

Reinforcement Learning for Robot Navigation • Learning from rewards and punishments in the environment • Give reward for reaching goal • Give punishment for getting lost • Give punishment for collisions

Experimental Results:% trials robot reaches goal Busquets, Lopez de Mantaras, Sierra, Dietterich (2002)

Reinforcement Learning Summary • Desired output is an action selection policy p • Training examples are <s,a,r,s’> tuples collected by the agent interacting with the environment

Fundamental Issues in Machine Learning • Incorporating Prior Knowledge • Incorporating Learned Structures into Larger Systems • Making Reinforcement Learning Practical • Triple Tradeoff: accuracy, sample size, hypothesis complexity

Incorporating Prior Knowledge • How can we incorporate our prior knowledge into the learning algorithm? • Difficult for decision trees, neural networks, support-vector machines, etc. • Mismatch between form of our knowledge and the way the algorithms work • Easier for Bayesian networks • Express knowledge as constraints on the network

Incorporating Learned Structures into Larger Systems • Success story: Digit recognizer incorporated into check reader • Challenges: • Larger system may make several coordinated decisions, but learning system treated each decision as independent • Larger system may have complex cost function: Errors in thousands place versus the cents place: $7,236.07

Making Reinforcement Learning Practical • Current reinforcement learning methods do not scale well to large problems • Need robust reinforcement learning methodologies

The Triple Tradeoff • Fundamental relationship between • amount of training data • size and complexity of hypothesis space • accuracy of the learned hypothesis • Explains many phenomena observed in machine learning systems

Learning Algorithms • Set of data points • Class H of hypotheses • Optimization problem: Find the hypothesis h in H that best fits the data Training Data h Hypothesis Space

Triple Tradeoff Amount of Data – Hypothesis Complexity – Accuracy N = 1000 Accuracy N = 100 N = 10 Hypothesis Space Complexity

Triple Tradeoff (2) H3 Hypothesis Complexity H2 Accuracy H1 Number of training examples N

Intuition • With only a small amount of data, we can only discriminate between a small number of different hypotheses • As we get more data, we have more evidence, so we can consider more alternative hypotheses • Complex hypotheses give better fit to the data

Fixed versus Variable-Sized Hypothesis Spaces • Fixed size • Ordinary linear regression • Bayes net with fixed structure • Neural networks • Variable size • Decision trees • Bayes nets with variable structure • Support vector machines

Corollary 1:Fixed H will underfit H2 underfit Accuracy H1 Number of training examples N

Corollary 2:Variable-sized H will overfit overfit Accuracy N = 100 Hypothesis Space Complexity

Ideal Learning Algorithm: Adapt complexity to data N = 1000 Accuracy N = 100 N = 10 Hypothesis Space Complexity

Adapting Hypothesis Complexity to Data Complexity • Find hypothesis h to minimize error(h) + l complexity(h) • Many methods for adjusting l • Cross-validation • MDL

The Data Explosion • NASA Data • 284 Terabytes (as of August, 1999) • Earth Observing System: 194 G/day • Landsat 7: 150 G/day • Hubble Space Telescope: 0.6 G/day http://spsosun.gsfc.nasa.gov/eosinfo/EOSDIS_Site/index.html

The Data Explosion (2) • Google indexes 2,073,418,204 web pages • US Year 2000 Census: 62 Terabytes of scanned images • Walmart Data Warehouse: 7 (500?) Terabytes • Missouri Botanical Garden TROPICOS plant image database: 700 Gbytes

Old Computer Science Conception of Data Store Retrieve

New Computer Science Conception of Data Problems Store Build Models Solve Problems Solutions

Machine Learning: Making Computer Science Scientific

Machine Learning: Making Computer Science Scientific

Presentation Transcript

Transit-Oriented Development (TOD)

Magnetic Data Storage

Function-Oriented Software Design (lecture 5)

CPSC 2100 Software Design and Development

2. Modeling with UML

Histograms of Oriented Gradients for Human Detection

General Information

Data

Chapter 9

Chapter 3 outline

CHAPTER 12

Service Oriented Architecture

Object-Oriented Programming

The Rationale of Using Object-Oriented Techniques under BIM

Object Oriented Programming in Java

Object Oriented Analysis and Design Using UML

Service Oriented Architecture

COP 3330: Object-Oriented Programming Summer 2007

Chapter 3 outline

Object-Oriented Software Engineering