1 / 76

Machine Learning & Data Mining Part 1: The Basics

Machine Learning & Data Mining Part 1: The Basics. Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang) Carnegie Mellon University jgc@cs.cmu.edu. Some Definitions (KBS vs ML). Knowledge-Based Systems Rules, procedures, semantic nets, Horn clauses

hedy
Download Presentation

Machine Learning & Data Mining Part 1: The Basics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning & Data MiningPart 1: The Basics Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang) Carnegie Mellon University jgc@cs.cmu.edu © 2008, Jaime G Carbonell

  2. Some Definitions (KBS vs ML) • Knowledge-Based Systems • Rules, procedures, semantic nets, Horn clauses • Inference: matching, inheritance, resolution • Acquisition: manually from human experts • Machine Learning • Data: tables, relations, attribute lists, … • Inference: rules, trees, decision functions, … • Acquisition: automated from data • Data Mining • Machine learning applied to large real problems • May be augmented with KBS © 2008, Jaime G. Carbonell

  3. Ingredients for Machine Learning • “Historical” data (e.g. DB tables) • E.g. products (features, marketing, support, …) • E.g. competition (products, pricing, customers) • E.g. customers (demographics, purchases, …) • Objective function (to be predicted or optimized) • E.g. maximize revenue per customer • E.g. minimize manufacturing defects • Scalable machine learning method(s) • E.g. decision-tree induction, logistic regression • E.g. “active” learning, clustering © 2008, Jaime G. Carbonell

  4. Sample ML/DM Applications I • Credit Scoring • Training: past applicant profiles, how much credit given, payback or default • Input: applicant profile (income, debts, …) • Objective: credit-score + max amount • Fraud Detection (e.g. credit-card transactions) • Training: past known legitimate & fraudulent transactions • Input: proposed transaction (loc, cust, $$, …) • Objective: approve/block decision © 2008, Jaime G. Carbonell

  5. Sample ML/DM Applications II • Demographic Segmentation • Training: past customer profiles (age, gender, education, income,…) + product preferences • Input: new product description (features) • Objective: predict market segment affinity • Marketing/Advertisement Effectiveness • Training: past advertisement campaigns, demographic targets, product categories • Input: proposed advertisement campaign • Objective: project effectiveness (sales increase modulated by marketing cost) © 2008, Jaime G. Carbonell

  6. Sample ML/DM Applications III • Product (or Part) Reliability • Training: past products/parts + specs at manufacturing + customer usage + maint rec • Input: new part + expected usage • Objective: mean-time-to-failure (replacement) • Manufacturing Tolerances • Training: past product/part manufacturing process, tolerances, inspections, … • Input: new part + expected usage • Objective: optimal manufacturing precision (minimize costs of failure + manufacture) © 2008, Jaime G. Carbonell

  7. Sample ML/DM Applications IV • Mechanical Diagnosis • Training: past observed symptoms at (or prior to) breakdown + underlying cause • Input: current symptoms • Objective: predict cause of failure • Mechanical Repair • Training: cause of failure + product usage + repair (or PM) effectiveness • Input: new failure cause + product usage • Objective: recommended repair (or preventive maintenance operation) © 2008, Jaime G. Carbonell

  8. Sample ML/DM Applications V • Billeting (job assignments) • Training: employee profiles, position profiles, employee performance in assigned position • Input: new employee or new position profile • Objective: predict performance in position • Text Mining & Routing (e.g. customer centers) • Training: electronic problem reports, customer requests + who should handle them • Input: new incoming texts • Objective: Assign category + route or reply © 2008, Jaime G. Carbonell

  9. Preparing Historical Data • Extract a DB table with all the needed information • Select, join, project, aggregate, … • Filter out rows with significant missing data • Determine predictor attributes (columns) • Ask domain expert for relevant attributes, or • Start with all attributes and automatically sub-select most predictive ones (feature selection) • Determine to-be-predicted attribute (column) • Objective of the DM (number, decision, …) © 2008, Jaime G. Carbonell

  10. Sample DB Table [predictor attributes] [objective] Tot Num Max Num Acct. Income Job Delinq Delinq Owns Credit Good numb. in K/yr Now? accts cycles home? years cust.? --------------------------------------------------------------------------- 1001 85 Y 1 1 N 2 Y 1002 60 Y 3 2 Y 5 N 1003 ? N 0 0 N 2 N 1004 95 Y 1 2 N 9 Y 1005 110 Y 1 6 Y 3 Y 1006 29 Y 2 1 Y 1 N 1007 88 Y 6 4 Y 8 N 1008 80 Y 0 0 Y 0 Y 1009 31 Y 1 1 N 1 Y 1011 ? Y ? 0 ? 7 Y 1012 75 ? 2 4 N 2 N 1013 20 N 1 1 N 3 N 1014 65 Y 1 3 Y 1 Y 1015 65 N 1 2 N 8 Y 1016 20 N 0 0 N 0 N 1017 75 Y 1 3 N 2 N 1018 40 N 0 0 Y 1 Y © 2008, Jaime G. Carbonell

  11. Supervised Learning on DB Table • Given: DB table • With identified predictor attributes x1, x2,… • And objective attribute y • Find: Prediction Function • Subject to: Error Minimization on data table M • Least-squares error, or L1-norm, or L-norm, … © 2008, Jaime G. Carbonell

  12. Popular Predictor Functions • Linear Discriminators (next slides) • k-Nearest-Neighbors (lecture #2) • Decision Trees (lecture #5) • Linear & Logistic Regression (lecture #4) • Probabilistic Methods (Lecture #3) • Neural Networks • 2-layer  Logistic Regression • Multi-layer  Difficult to scale up • Classification Rule Induction (in a few slides) © 2008, Jaime G. Carbonell

  13. Linear Discriminator Functions x2 Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell

  14. Linear Discriminator Functions x2 Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell

  15. Linear Discriminator Functions x2 Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell

  16. Linear Discriminator Functions x2 Two class problem: y={ , } new x1 © 2008, Jaime G. Carbonell

  17. Issues with Linear Discriminators • What is the “best” placement of the discriminator? • Maximize the margin • In general  Support Vector Machines • What if there are k classes (K > 2)? • Must learn k different discriminators • Each discriminates ki vs kji (all other classes) • What if it classes are not linearly separable? • Minimal error (L1 or L2) placement (regression) • Give up on linear discriminators ( other fk’s) © 2008, Jaime G. Carbonell

  18. Maximizing the Margin x2 margin Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell

  19. Nearly-Separable Classes x2 Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell

  20. Nearly-Separable Classes x2 Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell

  21. Minimizing Training Error • Optimal placing of maximum-margin separator • Quadratic programming (Support Vector Machines) • Slack variables to accommodate training errors • Minimizing error metrics • Number of errors • Magnitude of error • Squared error • Chevycheff norm © 2008, Jaime G. Carbonell

  22. Symbolic Rule Induction General idea • Labeled instances are DB tuples • Rules are generalized tuples • Generalization occurs at terms in tuples • Generalize on new E+ not correctly predicted • Specialize on new E- not correctly predicted • Ignore predicted E+ or E- (error-driven learning) © 2008, Jaime G. Carbonell

  23. Symbolic Rule Induction (2) Example term generalizations • Constant => disjunction e.g. if small portion value set seen • Constant => least-common-generalizer class e.g. if large portion of value set seen • Number (or ordinal) => range e.g. if dense sequential sampling © 2008, Jaime G. Carbonell

  24. Symbolic Rule Induction Example (1) Age Gender Temp b-cult c-cult loc Skin disease 65 M 101 + .23 USA normal strep 25 M 102 + .00 CAN normal strep 65 M 102 - .78 BRA rash dengue 36 F 99 - .19 USA normal *none* 11 F 103 + .23 USA flush strep 88 F 98 + .21 CAN normal *none* 39 F 100 + .10 BRA normal strep 12 M 101 + .00 BRA normal strep 15 F 101 + .66 BRA flush dengue 20 F 98 + .00 USA rash *none* 81 M 98 - .99 BRA rash ec-12 87 F 100 - .89 USA rash ec-12 12 F 102 + ?? CAN normal strep 14 F 101 + .33 USA normal 67 M 102 + .77 BRA rash

  25. Symbolic Rule Induction Example (2) Candidate Rules: IF age = [12,65] gender = *any* temp = [100,103] b-cult = + c-cult = [.00,.23] loc = *any* skin = (normal,flush) THEN: strep IF age = (15,65) gender = *any* temp = [101,102] b-cult = *any* c-cult = [.66,.78] loc = BRA skin = rash THEN: dengue Disclaimer: These are not real medical records or rules

  26. Types of Data Mining • “Supervised” Methods (this DM course) • Training data has both predictor attributes & objective (to be predicted) attributes • Predict discrete classes  classification • Predict continuous values  regression • Duality: classification  regression • “Unsupervised” Methods • Training data without objective attributes • Goal: find novel & interesting patterns • Cutting-edge research, fewer success stories • Semi-supervised methods: market-basket, … © 2008, Jaime G. Carbonell

  27. Machine Learning Application Process in a Nutshell • Choose problem where • Prediction is valuable and non-trivial • Sufficient historical data is available • The objective is measurable (incl in past data) • Prepare the data • Tabular form, clean, divide training & test sets • Select a Machine Learning algorithm • Human readable decision fn  rules, trees, … • Robust with noisy data  kNN, logistic reg, … © 2008, Jaime G. Carbonell

  28. Machine Learning Application Process in a Nutshell (2) • Train ML Algorithm on Training Data Set • Each ML method has different training process • Training uses both predictor & objective att’s • Run Training ML Algorithm on Test Data Set • Test uses only predictor att’s & outputs predictions on objective attributes • Compare predictions vs actual objective att’s (see lecture 2 for evaluation metrics) • If Accuracy  threshold, done. • Else, try different ML algorithm, different parameter settings, get more training data, … © 2008, Jaime G. Carbonell

  29. Sample DB Table (same) [predictor attributes] [objective] Tot Num Max Num Acct. Income Job Delinq Delinq Owns Credit Good numb. in K/yr Now? accts cycles home? years cust.? --------------------------------------------------------------------------- 1001 85 Y 1 1 N 2 Y 1002 60 Y 3 2 Y 5 N 1003 ? N 0 0 N 2 N 1004 95 Y 1 2 N 9 Y 1005 100 Y 1 6 Y 3 Y 1006 29 Y 2 1 Y 1 N 1007 88 Y 6 4 Y 8 N 1008 80 Y 0 0 Y 0 Y 1009 31 Y 1 1 N 1 Y 1011 ? Y ? 0 ? 7 Y 1012 75 ? 2 4 N 2 N 1013 20 N 1 1 N 3 N 1014 65 Y 1 3 Y 1 Y 1015 65 N 1 2 N 8 Y 1016 20 N 0 0 N 0 N 1017 75 Y 1 3 N 2 N 1018 40 N 0 0 Y 10 Y © 2008, Jaime G. Carbonell

  30. Feature Vector Representation • Predictor-attribute rows in DB tables can be represented as vectors. For instance, the 2nd & 4th rows of predictor attributes in our DB table are: R2 = [60 Y 3 2 Y 5] R4 = [95 Y 1 2 N 9] Converting to numbers (Y = 1, N = 0), we get: R2 = [60 1 3 2 1 5] R4 = [95 1 1 2 0 9] © 2008, Jaime G. Carbonell

  31. Vector Similarity • Suppose we have a new credit applicant R-new = [65 1 1 2 0 10] To which of R2 or R4 is she closer? R2 = [60 1 3 2 1 5] R4 = [95 1 1 2 0 9] • What should we use as a SIMILARITY METRIC? • Should we first NORMALIZE the vectors? • If not, the largest component will dominate © 2008, Jaime G. Carbonell

  32. Normalizing Vector Attributes • Linear Normalization (often sufficient) • Find max & min values for each attribute • Normalize each attribute by: • Apply to all vectors (historical + new) • …by normalizing each attribute, e.g.: © 2008, Jaime G. Carbonell

  33. Normalizing Full Vectors • Normalizing the new applicant vector R-new = [65 1 1 2 0 10]  [.56 1 .17 .33 0 1] And normalizing the two past customer vectors R2 = [60 1 3 2 1 5]  [.50 1 .50 .33 1 .50] R4 = [95 1 1 2 0 9]  [.94 1 .17 .33 0 .90] • How about if some attributes are known to be more important, say salary (A1) & delinquencies (A3)? • Weight accordingly, e.g. x2 for each • E.g., R-new-weighted: [1.12 1 .34 .33 0 1] © 2008, Jaime G. Carbonell

  34. Similarity Functions (inverse dist) • Now that we have weighted normalized vectors, how do we tell exactly their degree of similarity? • Inverse sum of differences (L1) • Inverse Euclidean distance (L2) © 2008, Jaime G. Carbonell

  35. Similarity Functions (direct) • Dot-Product Similarity • Cosine Similarity (dot product of unit vectors) © 2008, Jaime G. Carbonell

  36. Alternative: Similarity Matrix for Non-Numeric Attributes tiny little small medium large huge tiny 1.0 0.8 0.7 0.5 0.2 0.0 little 1.0 0.9 0.7 0.3 0.1 small 1.0 0.7 0.3 0.2 medium 1.0 0.5 0.3 large 1.0 0.8 huge 1.0 • Diagonal must be 1.0 • Monotonicity property must hold • Triangle inequality must hold • Transitive property must hold • Additivity/Compostionality need not hold © 2008, Jaime G. Carbonell

  37. k-Nearest Neighbors Method • No explicit “training” phase • When new case arrives (vector of predictor att’s) • Find nearest k neighbors (max similarity) among previous cases (row vectors in DB table) • k-neighbors vote for objective attribute • Unweighted majority vote, or • Similarity-weighted vote • Works for both discrete or continuous objective attributes © 2008, Jaime G. Carbonell

  38. Similarity-Weighted Voting in kNN • If the Objective Attribute is Discrete: • If the Objective Attribute is Continuous: © 2008, Jaime G. Carbonell

  39. Applying kNN to Real Problems 1 • How does one choose the vector representation? • Easy: Vector = predictor attributes • What if attributes are not numerical? • Convert: (e.g. High=2, Med=1, Low=0), • Or, use similarity function over nominal values • E.g. equality or edit-distance on strings • How does one choose a distance function? • Hard: No magic recipe; try simpler ones first • This implies a need for systematic testing (discussed in coming slides) © 2008, Jaime G. Carbonell

  40. Applying kNN to Real Problems 2 • How does one determine whether data should be normalized? • Normalization is usually a good idea • One can try kNN both ways to make sure • How does one determine “k” in kNN? • k is often determined empirically • Good start is: © 2008, Jaime G. Carbonell

  41. Evaluating Machine Learning • Accuracy = Correct-Predictions/Total-Predictions • Simplest & most popular metric • But misleading on very-rare event prediction • Precision, recall & F1 • Borrowed from Information Retrieval • Applicable to very-rare event prediction • Correlation (between predicted & actual values) for continuous objective attributes • R2, kappa-coefficient, … © 2008, Jaime G. Carbonell

  42. Sample Confusion Matrix True Diagnoses Predicted Diagnoses © 2008, Jaime G. Carbonell

  43. Measuring Accuracy • Accuracy = correct/total • Error = incorrect/total • Hence: accuracy = 1 – error • For the diagnosis example: • A = 340/386 = 0.88, E = 1 – A = 0.12 © 2008, Jaime G. Carbonell

  44. What About Rare Events? True Diagnoses Predicted Diagnoses © 2008, Jaime G. Carbonell

  45. Rare Event Evaluation • Accuracy for example = 0.88 • …but NO correct predictions for “shorted power supply”, 1 of 4 diagnoses • Alternative: Per-diagnosis (per-class) accuracy: • A(“shorted PS”) = 0/22 = 0 • A(“not plugged in”) = 160/184 = 0.87 © 2008, Jaime G. Carbonell

  46. ROC Curves (ROC=Receiver Operating Characteristic) © 2008, Jaime G. Carbonell

  47. ROC Curves (ROC=Receiver Operating Characteristic) Sensitivity = TP/(TP+FN) Specificity = TN/(TN+FP) © 2008, Jaime G. Carbonell

  48. evaluate  measure error train If Plenty of data, evaluate with Holdout Set Data • Often also used for parameter optimization © 2008, Jaime G. Carbonell

  49. Finite Cross-Validation Set • True error: • Test error: (true risk) D = all data (empirical risk) m = #test samples S = test data © 2008, Jaime G. Carbonell

  50. Confidence Intervals If • S contains m examples, drawn independently • m 30 Then • With approximately 95% probability, the true error eD lies in the interval © 2008, Jaime G. Carbonell

More Related