1 / 36

LIACS Data Mining course

LIACS Data Mining course. an introduction. Course Information. Course website: http://datamining.liacs.nl/DaMi/ (will be updated periodically) videos (EN and NL). Course Schedule. Start Tuesday Sept 3 Last lecture Dec 3 Lecture room varies

Download Presentation

LIACS Data Mining course

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LIACS Data Mining course an introduction

  2. Course Information • Course website: http://datamining.liacs.nl/DaMi/ (will be updated periodically) • videos (EN and NL)

  3. Course Schedule • Start Tuesday Sept 3 • Last lecture Dec 3 • Lecture room varies • Pieter de la Court, room 1A20 (with the exception of Oct 1: room SC01) • Van Steenis, room F104 • Sept 17 (in two weeks), no lecture! • Oct 22, no lecture! • Two assignments, to be determined • Practical exercises • Oct 1, second hour • Oct 15, second hour • Exam: Jan 16, 14:00 – 17:00

  4. Course Textbook Data Mining Practical Machine Learning Tools and Techniques third edition, Morgan Kaufmann, ISBN 978-0-12-374856-0 by Ian Witten and Eibe Frank (EUR 45,95 at Amazon.de)

  5. Course participants? • Bachelor Informatica 45 • … & Economie 18 • … & Biologie 5 • Minor Data Science 14 • other Science Faculty 4 • other programmes 2 • PhD students 0 • others? 7 • 95

  6. Introduction Data Mining an overview and some examples

  7. Data Mining definitions Data Mining: the concept of extracting previously unknown and potentially useful, interesting knowledge from large sets of data secondary statistics: analyzing data that wasn’t originally collected for analysis

  8. Data Mining, the big idea • Organizations collect large amounts of data • Often for administrative purposes • Large body of experience • Learning from experience • Goals • Prediction/forecasting • Diagnostics • Optimization • …

  9. 2 Streams • Knowledge Discovery • understanding a domain • interpretable models • examples: medicine, production, maintenance • examine details of model • Prediction • don’t care how you do it, just do it well • interpretable model? • black box • examples: marketing, forecasting (financial, weather) • apply model to new data

  10. Data Mining customer model example: Direct Mail Optimize the response to a mailing, by targeting only those that are likely to respond: • more response • fewer letters Customer information response 3% test mailing Customer information response 30% final mailing remainder

  11. example: Bioinformatics • Find genes involved in disease (Parkinson’s, Celiac, Neuroblastoma) • Measurements from patients (1) and controls (0) • Gene expression: measurements of 20k genes • dataset 20,001 x 100 • Challenges • many variables • few examples (patients), testing is expensive • interactions between genes

  12. Taxonomy of the field Data Mining supervised unsupervised regression classification

  13. Supervised(Classification and Regression) • Learning a ‘function’ from input to output • How does output depend on input? • Can we predict the output, given the input? target predictive model

  14. Taxonomy of the field Data Mining supervised unsupervised regression classification target is numeric target is nominal

  15. Taxonomy of the field Data Mining supervised unsupervised outlier detection pattern mining clustering community detection frequent patterns

  16. Data Mining paradigms • Classification • (binary) class variable • predict class of future cases • most popular paradigm • Regression • numeric target variable • Clustering • divide dataset into groups of similar cases • Frequent Patterns/Association • find dependencies between variables • basket analysis, …

  17. 0.64 Yes Age < 35 0.4 Rent 0.25 No Age≥ 35 0.51 Yes Price < 200K 0.1 Buy 0.01 No Price≥ 200K 0.07 No Other Classification (decision tree) Predict the class (often 0/1) of an object on the basis of examples of other objects (with a class given). 0.2

  18. Yes Age < 35 Rent No Age≥ 35 Yes Price < 200K Buy No Price≥ 200K No Other Applying a classifier (decision tree) New customer: (House = Rent, Age = 32, …) prediction = Yes

  19. Yes Age < 35 Rent No Age≥ 35 Yes Price < 200K Buy No Price≥ 200K No Other Classification • Tree makes attribute dependencies explicit • Class depends on House • The influence of other attributes is less • Dependencies are (often) fuzzy • multiple attributes are needed • Perfect predictions are rare

  20. y + + + + + + + x < t - + - + y < t’ - + - x  t + - - - y  t’ + - 0 x Graphical interpretation • dataset with two attributes + 1 class (+/-) • graphical interpretation of decision tree

  21. y + + + + + + + - + - + - + - + - - - + - 0 x Graphical interpretation • dataset with two attributes + 1 class (+/-) • other classifiers Support Vector Machine Neural Network

  22. Applications of DM • Marketing • outgoing • incoming • Bioinformatics & Medicine • Fraud detection • Risk management • Insurance • Enterprise resource planning

  23. Break

  24. Data Mining Applications

  25. Training Data Speed Skating • Speed skating team LottoNL-Jumbo • Detailed historic data • training details • duration • intensity • competition results • Finding patterns of effective training • Visualise data

  26. Kjeld Nuis • 178 races • On average 2.89% above track record • Specialises on 1000 m (2.1%) • 2015-2016 • Dutch champion 1000 m, 1500 m • WC Distances: bronze 1000 m, silver 1500 m • WC Sprint: ‘silver’ • ISU World Cup: gold 1000 m, silver 1500 m • 2017-2018 • Olympic Champion 1000 mand 1500 m

  27. Total sum of load over last 5 days, morning sessions undesired result due to over-training advised upper limit

  28. InfraWatch: monitoring of infrastructure Continuous monitoring of a large bridge ‘Hollandse Brug’ • 145 sensors • time-dependent, at frequencies up to 100 Hz • multi-modal (sensor, video, different freq.) • managing large data quantities, >5 Gb per day

  29. InfraWatch: monitoring of infrastructure • 34 geo-phones (vibration sensors) • 44 embedded strain-gauges, 47 gauges outside • 20 thermometers • video camera • weather station

  30. sensor mining

  31. Maintenance planning at KLM • Routine checks of aircraft • Maintenance requires up to 10k different parts • Ordering parts incurs delay (costs)… • … but so does keeping stock • In theory 10k individual predictions • Input • maintenance history • flight history, Sahara/North Pole • Only few parts predictable

  32. Cashflow Online • Online personal finance overview • All bank transactions are loaded into the application • transactions are classified into different categories • Data Mining predicts category

  33. 67 Categories Gas Water Licht Onderhoud huis en tuin Telefoon + Internet + TV Contributie (sport-)verenigingen Levensverzekering / Lijfrente Rente ontvangen Boodschappen Hypotheekrente Naar spaarrekening Geldopname/chipknip Verzekeringen overig Loterijen Cadeau's Interne boeking Vakantie & Recreatie Uitgaan, hobby's en sport Creditcard Ziektekostenverzekering Brandstof Woonhuis / Opstalverzekering Huishouden overig School- en Studiekosten Inkomsten overig Kleding & Schoenen Lenen Openbaar vervoer/Taxi …

  34. Fragmented results: Boodschappen (groceries) Contributie

  35. Decision Tree over all categories true false

  36. Data Mining at LIACS • Applications • bioinformatics (LUMC) • Sports Analytics (LottoNL-Jumbo, PSV) • Hollandse Brug (Strukton, TU Delft, RWS) • fraud detection at Achmea health insurance and NZa • ChartEx, medieval documents (English, Latin) • Complex data • graphical data (molecules) • relational data (criminal careers) • stream data (sensor data, click streams) • …

More Related