1 / 35

KDD Cup 2009

KDD Cup 2009. Fast Scoring on a Large Database Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team. KDD Cup 2009 Organizing Team. Project team at Orange Labs R&D: Vincent Lemaire Marc Boullé Fabrice Clérot Raphaël Féraud Aurélie Le Cam

micol
Download Presentation

KDD Cup 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KDD Cup2009 Fast Scoring on a Large Database Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team

  2. KDD Cup 2009 Organizing Team • Project team at Orange Labs R&D: • Vincent Lemaire • Marc Boullé • Fabrice Clérot • Raphaël Féraud • Aurélie Le Cam • Pascal Gouzien • Beta testing and proceedings editor: • Gideon Dror • Web site design: • Olivier Guyon (MisterP.net, France) • Coordination (KDD cup co-chairs): • Isabelle Guyon • David Vogel

  3. Thanks to our sponsors… • Orange • ACM SIGKDD • Pascal • Unipen • Google • Health Discovery Corp • Clopinet • Data Mining Solutions • MPS

  4. Record KDD Cup Participation

  5. Participation Statistics • 1299 registered teams • 7865 entries • 46 countries :

  6. A worlwide operator • One of the main telecommunication operators in the world • Providing services to more than 170 millions customers over five continents • Including 120 millions under the Orange Brand

  7. KDD Cup 2009 organized by OrangeCustomer Relationship Management (CRM) • Three marketing tasks: predict the propensity of customers • to switch provider: Churn • to buy new products or services: Appentency • to buy upgrades or new options proposed to them: Up-selling • Objective: improve the return of investments (ROI) of marketing campaigns • Increase the efficiency of the campaign given a campaign cost • Decrease the campaign cost for a given marketing objective • Better prediction leads to better ROI

  8. Data, constraints and requirements • Train and deploy requirements • About one hundred models per month • Fast data preparation and modeling • Fast deployment • Model requirements • Robust • Accurate • Understandable • Business requirement • Return of investment for the whole process • Input data • Relational databases • Numerical or categorical • Noisy • Missing values • Heavily unbalanced distribution • Train data • Hundreds of thousands of instances • Tens of thousand of variables • Deployment • Tens of millions of instances

  9. scoring model In-house systemFrom raw data to scoring models • Data warehouse • Relational data base • Data mart • Star schema • Feature construction • PAC technology • Generates tens of thousands of variables • Data preparation and modeling • Khiops technology Data feeding PAC Khiops

  10. Design of the challenge • Orange business objective • Benchmark the in-house system against state of the art techniques • Data • Data store • Not an option • Data warehouse • Confidentiality and scalability issues • Relational data requires domain knowledge and specialized skills • Tabular format • Standard format for the data mining community • Domain knowledge incorporated using feature construction (PAC) • Easy anonymization • Tasks • Three representative marketing tasks • Requirements • Fast data preparation and modeling (fully automatic) • Accurate • Fast deployment • Robust • Understandable

  11. Data sets extraction and preparation • Input data • 10 relational table • A few hundreds of fields • One million customers • Instance selection • Resampling given the three marketing tasks • Keep 100 000 instances, with less unbalanced target distributions • Variable construction • Using PAC technology • 20000 constructed variables to get a tabular representation • Keep 15 000 variables (discard constant variables) • Small track: subset of 230 variables related to classical domain knowledge • Anonymization • Discard variable names, discard identifiers • Randomize order of variables • Rescale each numerical variable by a random factor • Recode each categorical variable using random category names • Data samples • 50 000 train and test instances sampled randomly • 5000 validation instances sampled randomly from the test set

  12. Scientific and technical challenge • Scientific objective • Fast data preparation and modeling: within five days • Large scale: 50 000 train and test data, 15 000 variables • Hetegeneous data • Numerical with missing values • Categorical with hundreds of values • Heavily unbalanced distribution • KDD social meeting objective • Attract as many participants as possible • Additional small track and slow track • Online feedback on validation dataset • Toy problem (only one informative input variable) • Leverage challenge protocol overhead • One month to explore descriptive data and test submission protocol • Attractive conditions • No intellectual property conditions • Money prizes

  13. Business impact of the challenge • Bring Orange datasets to the data mining community • Benefit for community • Access to challenging data • Benefit for Orange • Benchmark of numerous competing techniques • Drive the research efforts towards Orange needs • Evaluate the Orange in-house system • High number of participants and high quality of the results • Orange in-house results: • Improved by a significant margin when leveraging all business requirements • Almost Parretto optimal when other criterions are considered (automation, very fast train and deploy, robustness and understandability) • Need to study the best challenge methods to get more insights

  14. KDD Cup 2009: Result Analysis Best Result (period considered in the figure) In House System (downloadable : www.khiops.com) Baseline (Naïve Bayes)

  15. Overall – Test AUC – Fast Best Results (on each dataset) Submissions Good Result Very Quickly

  16. Overall – Test AUC – Fast Best Results (on each dataset) Submissions Good Result Very Quickly • In House (Orange) System: • No parameters • On 1 standard laptop (mono proc) • If deal as 3 different problems

  17. Overall – Test AUC – Fast Very Fast Good Result Small improvement after the first day (83.85  84.93)

  18. Overall – Test AUC – Slow Very Small improvement after the 5th day (84.93 85.2) Improvement due to unscrambling?

  19. Overall – Test AUC – Submissions 23.24% of the submissions (>0.5)< Baseline  15.25% of the submissions (>0.5)> In House  84.75% of the submissions (>0.5)< In House

  20. Overall – Test AUC 'Correlation' Test / Valid

  21. Overall – Test AUC'Correlation' Test / Train ? Random Values Submitted Boosting Method orTrain Target Submitted Over fitting

  22. Overall – Test AUC Test AUC - 24 hours Test AUC - 12 hours Test AUC – 5 days Test AUC – 36 days

  23. Overall – Test AUC •  Difference between : • best result at the end of the first day and • best result at the end of the 36 days =1.35% Test AUC - 12 hours • time to adjust model parameters ? • time to train ensemble method ? • time to find more processors ? • time to test more methods • time to unscramble ? • … Test AUC – 36 days

  24. Test AUC = f (time) Churn – Test AUC – day  [0:36] Appetency – Test AUC – day  [0:36] Up-selling– Test AUC – day  [0:36] Harder ? Easier ?

  25. Test AUC = f (time) Churn – Test AUC – day  [0:36] Appetency – Test AUC – day  [0:36] Up-selling– Test AUC – day  [0:36] =1.84% =1.38% =0.11% Harder ? Easier ? •  Difference between : • best result at the end of the first day and • best result at the end of the 36 days

  26. CorrelationTest AUC / Valid AUC (5 days) Churn – Test/Valid – day  [0:5] Appetency – Test/Valid – day  [0:5] Up-selling– Test/Valid – day  [0:5] Harder ? Easier ?

  27. CorrelationTrain AUC / Valid AUC (36 days) Churn – Test/Train – day  [0:36] Appetency – Test/Train – day  [0:36] Up-selling– Test/Train – day  [0:36] Difficulty to conclude something…

  28. HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days) Churn – Test AUC – day  [0:36] Appetency – Test AUC – day  [0:36] Up-selling– Test AUC – day  [0:36] Knowledge (parameters?) found during 5 days helps after… ?

  29. HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days) Churn – Test AUC – day  [0:36] Appetency – Test AUC – day  [0:36] Up-selling– Test AUC – day  [0:36] YES ! Knowledge (parameters?) found during 5 days helps after… ? Churn – Test AUC – day  ]5:36] Appetency – Test AUC – day  ]5:36] Up-selling– Test AUC – day ]5:36]

  30. Fact Sheets:Preprocessing&Feature Selection PREPROCESSING (overall usage=95%) Replacement of the missing values Discretization Normalizations Grouping modalities Other prepro Principal Component Analysis 0 20 40 60 80 Percent of participants FEATURE SELECTION (overall usage=85%) Feature ranking Filter method Other FS Forward / backward wrapper Embedded method Wrapper with search 0 10 20 30 40 50 60 Percent of participants

  31. Fact Sheets:Classifier CLASSIFIER (overall usage=93%) Decision tree... Linear classifier Non-linear kernel • About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss. • Less than 50% regularization (20% 2-norm, 10% 1-norm). • Only 13% unlabeled data. Other Classif Neural Network Naïve Bayes Nearest neighbors Bayesian Network Bayesian Neural Network 0 10 20 30 40 50 60 Percent of participants

  32. Fact Sheets:Model Selection MODEL SELECTION (overall usage=90%) 10% test K-fold or leave-one-out Out-of-bag est Bootstrap est Other-MS • About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other). • About 10% used unscrambling. Other cross-valid Virtual leave-one-out Penalty-based Bi-level Bayesian 0 10 20 30 40 50 60 Percent of participants

  33. >= 32 GB > 8 GB <= 8 GB <= 2GB Java Mac OS Other (R, SAS) Matlab Linux Unix Windows C C++ Fact Sheets:Implementation Run in parallel None Multi-processor Memory Parallelism Software Platform Operating System

  34. Winning methods • Fast track: • IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.) • ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems an additive boosting decision tree technology, bagging also used. • David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter FS, ensemble of decision trees. • Slow track: • University of Melbourne: CV-based FS targeting AUC. Boosting with classification trees and shrinkage, using Bernoulli loss. • Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS using AIC, gradient tree-classifier boosting. • National Taiwan University +: Average 3 classifiers: (1) Solve joint multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes. • (+: small dataset unscrambling)

  35. Conclusion • Participation exceeded our expectations. We thank the participants for their hard work, our sponsors, and Orange who offered: • A problem of real industrial interest with challenging scientific and technical aspects • Prizes. • Lessons learned: • Do not under-estimate the participants: five days were given for the fast challenge, only a few hours sufficed to some participants. • Ensemble methods are effective. • Ensemble of decision trees offer off-the-shelf solutions to problems with large numbers of samples and attributes, mixed types of variables, and lots of missing values.

More Related