1 / 32

KDD-Cup A Survey: 1997-201 2

KDD-Cup A Survey: 1997-201 2. Special Thanks to Prof. Qiang YANG ’s course materials! (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science and Technology. About ACM KDDCUP.

shauna
Download Presentation

KDD-Cup A Survey: 1997-201 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KDD-Cup A Survey: 1997-2012 Special Thanks to Prof.Qiang YANG’s course materials! (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science and Technology

  2. About ACM KDDCUP • ACM KDD: Premiere Conference in knowledge discovery and data mining • ACM KDDCUP: • Worldwide competition in conjunction with ACM KDD conferences. • It aims at: • showcase the best methods for discovering higher-level knowledge from data. • Helping to close the gap between research and industry • Stimulating further KDD research and development

  3. Statistics • Participation in KDD Cup grew steadily • Average person-hours per submission: 204Max person-hours per submission: 910

  4. KDD Cup 97 • A classification task – to predict financial services industry (direct mail response) • Winners • Charles Elkan, a Prof from UC-San Diego with his Boosted Naive Bayesian (BNB) • Silicon Graphics, Inc with their software MineSet • Urban Science Applications, Inc. with their software gain, Direct Marketing Selection System

  5. MineSet (Silicon Graphics Inc.) • A KDD tool that combines data access, transformation, classification, and visualization.

  6. KDD Cup 98: CRM Benchmark • URL:www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html • A classification task – to analyze fund raising mail responses to a non-profit organization • Winners • Urban Science Applications, Inc. with their software GainSmarts. • SAS Institute, Inc. with their software SAS Enterprise Miner ™ • Quadstone Limited with their software Decisionhouse ™

  7. KDDCUP 1998 Results Maximum Possible Profit Line ($72,776 in profits with 4,873 mailed) Mail to Everyone Solution ($10,560 in profits with 96,367 mailed) GainSmarts SAS/Enterprise Miner Quadstone/Decisionhouse

  8. ACM KDD Cup 1999 • URL: www.cse.ucsd.edu/users/elkan/kdresults.html • Problem To detect network intrusion and protect a computer network from unauthorized users, including perhaps insiders • Data: from DoD • Winners • SAS Institute Inc. with their software Enterprise Miner. • Amdocs with their Information Analysis Environment • URL: www.cse.ucsd.edu/users/elkan/kdresults.html • Problem To detect network intrusion and protect a computer network from unauthorized users, including perhaps insiders • Data: from DoD • Winners • SAS Institute Inc. with their software Enterprise Miner. • Amdocs with their Information Analysis Environment

  9. Data collected from Gazelle.com, a legwear and legcare Web retailer Pre-processed Training set: 2 months Test sets: one month Data collected includes: Click streams Order information The goal – to design models to support web-site personalization and to improve the profitability of the site by increasing customer response. Questions - Whengiven a set of page views, characterize heavy spenders characterize killer pages characterize which product brand a visitor will view in the remainder of the session? KDDCUP 2000: Data Set and Goal:

  10. 3 Bioinformatics Tasks Dataset 1: Prediction of Molecular Bioactivity for Drug Design half a gigabyte when uncompressed Dataset 2: Prediction of Gene/Protein Function (task 2) and Localization (task 3) Dataset 2 is smaller and easier to understand 7 megabytes uncompressed A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization. KDD Cup 2001

  11. Task 1, Thrombin: Jie Cheng (Canadian Imperial Bank of Commerce). Bayesian network learner and classifier Task 2, Function: Mark-A. Krogel (University of Magdeburg). Inductive Logic programming Task 3, Localization: Hisashi Hayashi, Jun Sese, and Shinichi Morishita (University of Tokyo). K nearest neighbor Task 2: the genes of one particular type of organism A gene/protein can have more than one function, but only one localization. 2001 Winners

  12. molecular biology : Two tasks Task 1: Document extraction from biological articles Task 2: Classification of proteins based on gene deletion experiments Winners: Task 1: ClearForest and Celera, USA Yizhar Regev and Michal Finkelstein Task 2: Telstra Research Laboratories, Australia Adam Kowalczyk and Bhavani Raskutti

  13. 2003 KDDCUP • Information Retrieval/Citation Mining of Scientific research papers • based on a very large archive of research papers • First Task: predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference • Second Task: a citation graph of a large subset of the archive from only the LaTex sources • Third Task: each paper's popularity will be estimated based on partial download logs • Last Task: devise their own questions

  14. 2004 Tasks and Results • (Particle physics; plus protein homology prediction) • Winners of the two tasks: • David S. Vogel, Eric Gottschalk, and Morgan C. Wang • Bernhard Pfahringer, Yan Fu, RuiXiang Sun, Qiang Yang, Simin He, Chunli Wang, Haipeng Wang, Shiguang Shan, Junfa Liu, Wen Gao.

  15. Past KDDCUP Overview: 2005-2010

  16. KDDCUP’11 Dataset • 11 years of data • Rated items are • Tracks • Albums • Artists • Genres • Items arranges in a taxonomy • Two tasks

  17. Items in a Taxonomy

  18. Track 1 Details

  19. Track 1 Highlights • Largest publicly available dataset • Large number of items (50 times more than Netflix) • Extreme rating sparsity (20 times more sparse than Netflix) • Taxonomy can help in combating sparsely rated items. • Fine time stamps with both date and time allow sophisticated temporal modeling.

  20. Track 2 Details

  21. Track 2 Highlights • Performance metric focus on ranking/ classification, which differs from traditional collaborative filtering. • No validation data provided, need to self-construct binary labeled data from rating data. • Unlike track 1, track 2 removed time stamps to focus more than long term preference rather than short term behaviors.

  22. Submission Stats

  23. Winners

  24. Chinese Teams at KDDCUP (NTU, CAS, HKUST) Nathan Liu: HKUST CSE PhD student

  25. KDDCUP 2012 • Tencent • Task 1: Micro-blog (Weibo) User Recommendation • Recommends a popular person / an organization / a group TO a user • Task 2: Ad click-through rate prediction from search log • How often will an Ad be clicked by a user?

  26. Task1: User recommendation UI 26 Popular user recommendation

  27. Task2: Ad click-through rate prediction Ad click-through rate prediction

  28. Task1 Data – User-Item Matrix 28 2088948 1760350 -1 1318348785 2088948 1774722 -1 1318348785 2088948 786313 -1 1318348785 601635 1775029 -1 1318348785 601635 1902321 -1 1318348785 601635 462104 -1 1318348785 1529353 1774509 -1 1318348786 • rec_log_train.txt / rec_log_test.txt UserID ItemID ?followed TimeStamp • ~75M records in training data • ?followed: -1/1, user accepts the recommendation or not • In test data, it is filled with 0, to be predicted as -1/1. • TimeStamp: unix-timestamp • Seconds from 70.1.1 00:00:00 (UTC time)

  29. Task2 Data – Main Data Table 29 Extremely Large Training Data ~150M records 10Gig raw csv file + keywords + userProfiles Predicting CTR to helps search provider to rank/price ads correctly

  30. Winners

  31. Summary • To place on top of KDDCUP requires • Team work • Expertise in domain knowledge as well as mathematical tools • Often done by world famous institutes and companies • Recent trends: • Dataset increasingly more realistic • Participants increasingly more professional • Tasks are increasingly more difficult

  32. Summary • KDD Cup is an excellent source to learn the state-of-art KDD techniques • KDDCUP dataset often becomes the standard benchmark for future research, development and teaching • Top winners are highly regarded and respected • References: http://www.sigkdd.org/kddcup/index.php

More Related