CS 657/790 Machine Learning and Data Mining Course Introduction

CS 657/790 Machine Learning andData MiningCourse Introduction

Student Survey • Please hand in sheet of paper with: • Your name and email address • Your classification (eg, 2nd year computer science PhD student) • Your experience with MATLAB (none, some or much) • Your undergraduate degree (when, what, where) • Your AI experience (courses at UWM or elsewhere) • Your programming experience

Course Information • Course Instructor: Joe Bockhorst • email: joebock@uwm.edu • office: 1155 EMS • Course webpage: http://www.uwm.edu/~joebock/790.html • office hours: ??? • Possible times: • before class on Monday (3:30-5:30) • Monday morning • Wednesday morning • after class Monday (7:00-9:00)

Textbook & Reading Assignment • Machine Learning (Tom Mitchell) • Bookstore in union, $140 new • Amazon.com hard cover: $125 new , $80 used • Amazon.com soft cover: < $30 • Read (posted on class web page) • Preface • Chapter 1 • Sections 6.1, 6.2, 6.9, 6.10 • Sections 8.1, 8.2

Powerpoint Vs Whiteboard • Powerpoint encourages words over pictures (not good) • But powerpoint can be saved, tweaked, easily shared, … • Notes posted on course website following lecture • Your thoughts?

Full Disclosure • Slides are a combination of • Jude Shavlik’s notes from UW-Madison machine learning course (Prof. I had) • Textbook Slides (Google “machine learning textbook”) • My notes

Class Email List • Is there one?

Course Outline • 1st half covers supervised learning • Algorithms: support vector machines, neural networks, probabilistic models … • Methodology • 2nd half covers graphical probability models • Powerful statistical models very useful for learning in complex and/or noisy settings

Course "Style" • Primarily algorithmic & experimental • Some theory, both mathematical & conceptual (much on statistics) • "Hands on" experience, interactive lectures/discussions • Broad survey of many ML subfields • "symbolic" (rules, decision trees) • "connectionist" (neural nets) • Support Vector Machines • statistical ("Bayes rule") • genetic algorithms (if time)

Two Major Goals • to understand what a learning system should do • to understand how (and how well) existing systems work

Background Assumed • Programming • Data structures and algorithms • CS 535 • Math • Calculus (partial derivatives) • Simple probability & statistics

Programming Assignments in MATLAB • Why MATLAB? • Fast prototyping • Integrated plotting • Widely used in academia (industry too?) • Will save you time in the long run • Why not MATLAB? • Proprietary software • Harder to work from home • Optional Assignment: familiarize yourself with MATLAB, use MATLAB help system

Student Computer Labs • E256, E280, E285, E384, E270 • All have MATLAB installed under Windows XP

Requirements • Bi-weekly programming plus perhaps some “paper & pencil” homework • "hands on" experience valuable • HW0 – build a dataset • HW1 & HW2 supervised learning algorithms • HW3 & HW4 graphical probability models • Midterm exam (after about 8-10 weeks) • Final exam • Find project of your choosing • during last 4-5 weeks of class

Grading HW's 25% Project 20% Midterm 20% Final 30% Quality Discussion 5%

Late HW's Policy • HW's due @ 4pm • you have 5 late days to use over the semester • (Fri 4pm → Mon 4pm is 1 late "day") • SAVE UP late days! • extensions only for extreme cases • Penalty points after late days exhausted • 10% per day • Can't be more than one week late

Machine Learning Vs Data Mining • Machine Learning: computer algorithms that improve automatically through experience [Mitchell]. • Data Mining: Extracting knowledge from large amounts of data. [Han & Kamber] (synonym: knowledge discovery in databases (KDD))

What’s the difference? Topics in ML and DM texts (Mitchell Vs Han & Kamber) Supervised learning, decision trees, neural nets, Bayesian networks, k-nearest neighbor, genetic algorithms, unsupervised learning (clustering in DM jargon),… reinforcement learning, learning theory, evaluating learning systems, using domain knowledge, inductive logic programming, … Data Warehouse, OLAP, query languages, association rules, presentation, … ML DM We’ll try to cover topics in red

The learning problem • Learning = improving with experience • Example: learn to play checkers • Improve over task T, • with respect to performance measure P, • based on experience E • T: Play Checkers • P: % of games won • E: games played against self

Famous Example: Discovering Genes • T: find genes in DNA sequences • ACGTGCATGTGTGAACGTGTGGGTCTGATGATGT… • P: % of genes found • E: experimentally verified genes * Prediction of Complete Gene Structures in Human Genomic DNA, Burge & Carlin J. Molecular Biology, 1997, 268 78-94

Famous Example 2: Autonomous Vehicles Driving • T: drive vehicle • P: reach destination • E: machine observation of human driver

ML key to winning DARPA Grand Challenge Stanford team won 2005 driverless vehicle race across Mojave Desert “The robot's software system relied predominately on state-of-the-art AI technologies, such as machine learning and probabilistic reasoning.” [Winning the DARPA Grand Challenge, Thrun et al., Journal of Field Robotics, 2006]

Why study machine learning (data mining)? • Data is plentiful • Retail, video, images, speech, text, DNA, bio-medical measurements, … • Computational power is available • Budding Industry • ML has great applications • ML still relatively immature

Next Time: HW0 – Create Your Own Dataset • Think about this • will need to create it by week after next • Google to find: • UCI archive (or UCI KDD archive) • UCI ML archive (UCI machine learning repository)

HW0 – Your “Personal Concept” • Step 1: Choose a Boolean (true/false) concept • Subjective Judgement • Books I like/dislike • Movies I like/dislike • Web pages I like/dislike • “Time will tell” concepts • Stocks to buy • Medical outcomes • Sensory interpretation • Face recognition (See text) • Handwritten digit recognition • Sound recognition

HW0 – Your “Personal Concept” • Step 2: Choosing a feature Space • We will use fixed-length feature vectors • Choose N features • Each feature has Vipossible values • Each example is represented by a vector of N feature values (i.e., is a point in the feature space) e.g.: <red, 50, round> colorweight shape • Feature Types • Boolean • Nominal • Ordered • Hierarchical • Step 3: Collect examples (“I/O” pairs) Defines a space In HW0 we will use a subset (see next slide)

closed polygon continuous square triangle circle ellipse Standard Feature Typesfor representing training examples – source of “domain knowledge” • Nominal • No relationship among possible values e.g., color є {red, blue, green} (vs. color = 1000 Hertz) • Linear (or Ordered) • Possible values of the feature are totally ordered e.g., size є{small, medium, large}←discrete weight є [0…500] ←continuous • Hierarchical • Possible values are partiallyordered in an ISA hierarchy e.g. for shape->

Product Pct Foods Tea 99 Product Classes 2302 Product Subclasses Dried Cat Food Canned Cat Food Friskies Liver, 250g ~30k Products Example Hierarchy (KDD* Journal, Vol 5, No. 1-2, 2001, page 17) • Structure of one feature! • “the need to be able to incorporate hierarchical (knowledge about data types) is shown in every paper.” • - From eds. Intro to special issue (on applications) of KDD journal, Vol 15, 2001 * Officially, “Data Mining and Knowledge Discovery”, Kluwer Publishers

Our Feature Types(for homeworks) • Discrete • tokens (char strings, w/o quote marks and spaces) • Continuous • numbers (int’s or float’s) • If only a few possible values (e.g., 0 & 1) use discrete • i.e., merge nominal and discrete-ordered (or convert discrete-ordered into 1,2,…) • We will ignore hierarchy info and only use the leaf values (it is rare any way)

Today’sTopics • Creating a dataset of • HW0 out on-line • Due next Monday fixed length feature vectors

Digitized camera image Learned Function Steering Angle age = 13 sex = M wgt = 18 Learned Function ill vs healthy Some Famous Examples • Car Steering (Pomerleau) • Medical Diagnosis (Quinlan) • DNA Categorization • TV-pilot rating • Chemical-plant control • Back gammon playing • WWW page scoring • Credit application scoring Medical record

HW0: Creating your dataset • Choose a dataset • based on interest/familiarity • meets basic requirements • >1000 examples • category (function) learned should be binary valued • ~500 examples labeled class A, other 500 labeled class B → Internet Movie Database (IMD)

HW0: Creating your dataset • IMD has a lot of data that are not discrete or continuous or binary-valued for target function (category) Name Country List of movies Name Year of birth Gender Oscar nominations List of movies Studio Actor Name Year of birth List of movies Director/ Producer Made Directed Acted in Produced Movie Title, Genre, Year, Opening Wkend BO receipts, List of actors/actresses, Release season

HW0: Creating your dataset • Choose a boolean or binary-valued target function (category) • Opening weekend box office receipts > $2 million • Movie is drama? (action, sci-fi,…) • Movies I like/dislike (e.g. Tivo)

HW0: Creating your dataset • How to transfer available attributes: Other example attributes (select predictive features) • Movie • Average age of actors • Number of producers • Percent female actors • Studio • Number of movies made • Average movie gross • Percent movies released in US

HW0: Creating your dataset • Director/Producer • Years of experience • Most prevalent genre • Number of award winning movies • Average movie gross • Actor • Gender • Has previous Oscar award or nominations • Most prevalent genre

HW0: Creating your dataset David Jensen’s group at UMass used Naïve Bayes (NB) to predict the following based on attributes they selected and a novel way of sampling from the data: • Opening weekend box office receipts > $2 million • 25 attributes • Accuracy = 83.3% • Default accuracy = 56% • Movie is drama? • 12 attributes • Accuracy = 71.9% • Default accuracy = 51% • http://kdl.cs.umass.edu/proximity/about.html

What Do You Think Machine Learning Means?

What is Learning? Learning denotes changes in the system that … enable the system to do the same task … more effectively the next time. - Herbert Simon Learning is making useful changes in our minds. - Marvin Minsky

Not in Mitchell’s textbook (will spend 0-2 lectures on this – but also in CS776) Major Paradigms of Machine Learning • Inducing Functions from I/O Pairs • Decision trees (e.g., Quinlan’s C4.5 [1993]) • Connectionism / neural networks (e.g., backprop) • Nearest-neighbor methods • Genetic algorithms • SVM’s • Learning without a Teacher • Conceptual clustering • Self-organizing systems • Discovery systems

Will be covered briefly Major Paradigms of Machine Learning • Improving a Multi-Step Problem Solver • Explanation-based learning • Reinforcement learning • Using Preexisting Domain Knowledge Inductively • Analogical learning • Case-based reasoning • Inductive/explanatory hybrids

CS 657/790 Machine Learning and Data Mining Course Introduction