1 / 37

An Introduction to Knowledge Discovery and Data Mining

An Introduction to Knowledge Discovery and Data Mining. Outline. Introduction Data Mining Tasks Applications. The Hard Facts About Data. Enormous amounts of data are being stored in databases Businesses are increasingly becoming data-rich , yet, paradoxically, they remain knowledge-poor

Download Presentation

An Introduction to Knowledge Discovery and Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction toKnowledge Discovery and Data Mining

  2. Outline • Introduction • Data Mining Tasks • Applications

  3. The Hard Facts About Data • Enormous amounts of data are being stored in databases • Businesses are increasingly becoming data-rich, yet, paradoxically, they remain knowledge-poor “We are drowning in information, but starving for knowledge” -John Naisbett • Unless it is used to improve business practices, data is a liability, not an asset • Standard data analysis techniques are useful but insufficient and may miss valuable insight

  4. Real Examples • Consider the enormous amounts of data generated • Transactional data by credit card companies • Searches on Google, Yahoo, and MSN • Clickstream data (or Web usage data) • Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second of astronomical data over a 25-day observation session • storage and analysis a big problem • Walmart reported to have 24 Tera-byte DB (likely even larger now) • AT&T handles billions of calls per day • data cannot be stored -- analysis is done on the fly

  5. Related Fields Machine Learning Visualization Data Mining and Knowledge Discovery Statistics Databases

  6. What Is Data Mining?Business Definition • Deployment of business processes, supported by adequate analytical techniques, to: • Take further advantage of data • Discover relevant knowledge • Act on the results

  7. From Data to Action • Knowledge • People who buy product X also buy product Y, P% of the time • Doctors who perform in excess of N operations of type T per month may be fraudulent • Molecules of class X are most likely carcinogenic • Actions • Offer product Y to owners of product X • Investigate potential frauds • Information • Mrs. X buys product Y • Product X costs Y dollars • Mr. X drives a car of type Y • Dr. X performed Y operations of type T • Data (raw) • Lifestyle • Transactions • Socio-demographics

  8. Knowledge Discovery and Data MiningAcademic Definition • Knowledge Discovery in Databases (KDD) • KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. [Fayyad et al. 1996] • Data Mining • Data mining is that step of the knowledge discovery process in which data analysis methods are applied to find interesting patterns. • It can be characterized by a set of types of tasks that have to be solved. • It uses methods from a variety of research areas. • (statistics, machine learning, artificial intelligence, databases, etc.)

  9. The Data Mining Process ELCA

  10. What Kind of Output?

  11. Income Low Medium High HIGH Credit History Debt Level Good High Unknown Low Bad LOW MODERATE LOW MODERATE Credit History Unknown Good Bad HIGH HIGH MODERATE Illustration > Risk for Loan

  12. Outline • Introduction • Data Mining Tasks • Applications

  13. Data Mining Tasks • Summarization • Classification / Prediction • Classification, Concept learning, Regression • Clustering • Dependency modeling • Anomaly detection • Link Analysis

  14. Summarization • To find a compact description for a subset of the data. • Producing the average down time of all plant equipments in a given month, computing the total income generated by each sales representative per region per year • Techniques: • Statistics, Information theory, OLAP, etc.

  15. Prediction • To learn a function that associates a data item with the value of a response variable. If the response variable is discrete, we talk of classification learning; if the response variable is continuous, we talk of regression learning. • Assessing credit worthiness in a loan underwriting business, assessing the probability of response to a direct marketing campaign • Techniques: • Decision trees, Neural networks, Naïve Bayes, Linear discriminates, Logistic regression, Nearest-neighbors, etc.

  16. Clustering • To identify a set of (meaningful) categories or clusters to describe the data. Clustering relies on some notion of similarity among data items and strives to maximize intra-cluster similarity whilst minimizing inter-cluster similarity. • Segmenting a business’ customer base, building a taxonomy of animals in a zoological application • Techniques: • K-Means, Hierarchical clustering, Kohonen SOM, etc.

  17. Dependency Modeling • To find a model that describes significant dependencies, associations or affinities among variables. • Analyzing market baskets in consumer goods retail, uncovering cause-effect relationships in medical treatments • Techniques: • Association rules, ILP, Graphical modeling, etc.

  18. Anomaly Detection • To discover the most significant changes in the data from previously measured or normative values. • Detecting fraudulent credit card usage, detecting anomalous turbine behavior in nuclear plants • Techniques: • Novelty detectors, Probability density models, etc.

  19. Data Mining Deliverables • Provides additional insight about the data and the business • Provides scientific confirmation of empirical/intuitive business observations • Discovers new, subtle pieces of business knowledge In that order !

  20. Food for Thought • “Data mining can't be ignored -- the data is there, the methods are numerous, and the advantages that knowledge discovery brings to a business are tremendous.” • “People who can't see the value in data mining as a concept either don't have the data or don't have data with integrity.” • “Data mining is quickly becoming a necessity, and those who do not do it will soon be left in the dust. Data mining is one of the few software activities with measurable return on investment associated with it.”

  21. Outline • Introduction • Data Mining Tasks • Applications

  22. Application Domains (I) • Direct marketing and retail • Behavior analysis, Offer targeting, Market basket analysis, Up-selling, etc. • Banks and financial institutions • Credit risk assessment, Fraud detection, Portfolio management, Forecasting, etc. • Telecommunications • Churn prediction, Product/service development, campaign management, fraud detection, etc.

  23. Application Domains (II) • Healthcare • Public health monitoring (infectious outbreaks, etc), Outcomes measurement (performance, cost, success rate, etc), Diagnostic help, etc. • Pharmaceutical industry / Bio-informatics • Biological activity prediction, Coding sequence discovery, Animal tests reduction, etc. • Insurances • Cross-selling, Risk analysis, Premium setting, Claims analysis, Fraud detection, etc.

  24. Application Domains (III) • Transports • Network management, Booking optimization, Customer service, etc. • Manufacturing • Load forecasting, Production management, Equipment monitoring, Quality management, etc. • White Paper: 143 case studies!

  25. Examples

  26. Survey andOnlineGame

  27. Survey Analysis or Simple Complex 0-13136 Poor 21 13136-19453 Fair 91 19453-25769 Good 90 25769-32086 Excellent 39 32086+ Outstanding 15

  28. Subscription & Mail Ordering • Potential applications: • Associations of products that sell together • Identification of potential « cancellers » • Segmentation of customers • Short audit: • Nice DWH, only 2 years old, not fully populated • Limited data on purchases and subscriptions • Business-relevant case studies: • Are there « interesting » product associations? • What are the signs of « cancellation »?

  29. Summarisation / Aggregation • Revenue distribution • 80% generated by 41.5% of subscribers • 60% generated by 18.3% of subscribers • 42.9% generated by top 5 products • Simple customer classes • Over 65 years old most profitable • Under 16 years old least profitable • Birthdate filled-in for only about 10% of subscribers!

  30. Product Association • About 21% of subscribers buy P4, P7 and P9 • P4 is most profitable product • P7 is ranked 6th • P9 is ranked 15th with only 2% of revenue • Several possible actions • Make a bundle offering of these products • Cross-sell from P9 to P4 • Temptation to remove P9 should be resisted

  31. Clustering 30% of customers who buy a single yearly product !!!

  32. Summary of Findings • Data Mining found: • A small percentage of the customers are responsible for a large share of the sales • Several groups of « strongly-connected » articles • A sizeable group of subscribers who buy a single article • What was learned: • First 2 findings: « we knew that! » • 3rd finding: « we could target these customers with a special offer! » • Lack of relevant data: the structure is in place but not being used systematically

  33. Actions • What must be done: • Engage in cross-departmental (e.g., marketing and customer care) discussions regarding data needs and data collection procedures • Improve quality of existing data • Implement ways to collect missing data (both internally and from outside sources)

  34. Spam Filtering • In many classification applications, the true prediction depends on some hidden context, not available directly to the mining algorithm • Spam filtering: features are words, which do change over time

  35. Our Solution • Ensemble of incremental learners • “Political committee” where members are voted in and out frequently • Decisions are made by the most popular committee members and new members are voted in over old members when the old members are out of touch with “societies trends and issues” (i.e., become poor predictors)

  36. Sample Results

  37. Conclusion • Data Mining transforms data into actions • Data Mining is hard work • It is a process, not a single activity • Most companies are clueless and DM is an afterthought • Plan to learn through the process • Think big, start small • Data Mining is FUN!

More Related