Knowledge discovery & data mining Tools, methods, and experiences

Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/ A tutorial @ EDBT2000

Contributors and acknowledgements • The people @ Pisa KDD Lab: Francesco BONCHI, Giuseppe MANCO, Mirco NANNI, Chiara RENSO, Salvatore RUGGIERI, Franco TURINI and many students • The many KDD tutorialists and teachers which made their slides available on the web (all of them listed in bibliography) ;-) • In particular: • Jiawei HAN, Simon Fraser University, whose forthcoming book Data mining: concepts and techniques has influenced the whole tutorial • Rajeev RASTOGI and Kyuseok SHIM, Lucent Bell Labs • Daniel A. KEIM, University of Halle • Daniel Silver, CogNova Technologies • The EDBT2000 board who accepted our tutorial proposal EDBT2000 tutorial - Intro

Tutorial goals • Introduce you to major aspects of the Knowledge Discovery Process, and theory and applications of Data Mining technology • Provide a systematization to the many many concepts around this area, according the following lines • the process • the methods applied to paradigmatic cases • the support environment • the research challenges • Important issues that will be not covered in this tutorial: • methods: time series, exception detection, neural nets • systems: parallel implementations EDBT2000 tutorial - Intro

Tutorial Outline • Introduction and basic concepts • Motivations, applications, the KDD process, the techniques • Deeper into DM technology • Decision Trees and Fraud Detection • Association Rules and Market Basket Analysis • Clustering and Customer Segmentation • Trends in technology • Knowledge Discovery Support Environment • Tools, Languages and Systems • Research challenges EDBT2000 tutorial - Intro

Introduction - module outline • Motivations • Application Areas • KDD Decisional Context • KDD Process • Architecture of a KDD system • The KDD steps in short EDBT2000 tutorial - Intro

Evolution of Database Technology:from data management to data analysis • 1960s: • Data collection, database creation, IMS and network DBMS. • 1970s: • Relational data model, relational DBMS implementation. • 1980s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.). • 1990s: • Data mining and data warehousing, multimedia databases, and Web technology. EDBT2000 tutorial - Intro

Motivations“Necessity is the Mother of Invention” • Data explosion problem: • Automated data collection tools, mature database technology and internet lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. • We are drowning in information, but starving for knowledge!(John Naisbett) • Data warehousing and data mining : • On-line analytical processing • Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases. EDBT2000 tutorial - Intro

A rapidly emerging field A rapidly emerging field • Also referred to as: Data dredging, Data harvesting, Data archeology • A multidisciplinary field: • Database • Statistics • Artificial intelligence • Machine learning, Expert systems and Knowledge Acquisition • Visualization methods EDBT2000 tutorial - Intro

Motivations for DM • Abundance of business and industry data • Competitive focus - Knowledge Management • Inexpensive, powerful computing engines • Strong theoretical/mathematical foundations • machine learning & logic • statistics • database management systems EDBT2000 tutorial - Intro

What is DM useful for? Increase knowledge to base decision upon. E.g., impact on marketing Marketing Database Marketing KDD & Data Mining Data Warehousing EDBT2000 tutorial - Intro

The Value Chain • Decision • Promote product A in region Z. • Mail ads to families of profile P • Cross-sell service B to clients C • Knowledge • A quantity Y of product A is used in region Z • Customers of class Y use x% of C during period D • Information • X lives in Z • S is Y years old • X and S moved • W has money in Z • Data • Customer data • Store data • Demographical Data • Geographical data EDBT2000 tutorial - Intro

Application Areas and Opportunities • Marketing: segmentation, customer targeting, ... • Finance: investment support, portfolio management • Banking & Insurance: credit and policy approval • Security: fraud detection • Science and medicine: hypothesis discovery, prediction, classification, diagnosis • Manufacturing: process modeling, quality control, resource allocation • Engineering: simulation and analysis, pattern recognition, signal processing • Internet: smart search engines, web marketing EDBT2000 tutorial - Intro

Classes of applications • Market analysis • target marketing, customer relation management, market basket analysis, cross selling, market segmentation. • Risk analysis • Forecasting, customer retention, improved underwriting, quality control, competitive analysis. • Fraud detection • Text (news group, email, documents) and Web analysis. EDBT2000 tutorial - Intro

Market Analysis • Where are the data sources for analysis? • Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies. • Target marketing • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. • Determine customer purchasing patterns over time • Conversion of single to a joint bank account: marriage, etc. • Cross-market analysis • Associations/co-relations between product sales • Prediction based on the association information. EDBT2000 tutorial - Intro

Market Analysis and Management Market Analysis (2) • Customer profiling • data mining can tell you what types of customers buy what products (clustering or classification). • Identifying customer requirements • identifying the best products for different customers • use prediction to find what factors will attract new customers • Provides summary information • various multidimensional summary reports; • statistical summary information (data central tendency and variation) EDBT2000 tutorial - Intro

Risk Analysis • Finance planning and asset evaluation: • cash flow analysis and prediction • contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) • Resource planning: • summarize and compare the resources and spending • Competition: • monitor competitors and market directions (CI: competitive intelligence). • group customers into classes and class-based pricing procedures • set pricing strategy in a highly competitive market EDBT2000 tutorial - Intro

Fraud Detection • Applications: • widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. • Approach: • use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. • Examples: • auto insurance: detect a group of people who stage accidents to collect on insurance • money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) • medical insurance: detect professional patients and ring of doctors and ring of references EDBT2000 tutorial - Intro

Fraud Detection (2) • More examples: • Detecting inappropriate medical treatment: • Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). • Detecting telephone fraud: • Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. • British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. • Retail: Analysts estimate that 38% of retail shrink is due to dishonest employees. EDBT2000 tutorial - Intro

Other applications • Sports • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat. • Astronomy • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining • Internet Web Surf-Aid • IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. • Watch for the PRIVACY pitfall! EDBT2000 tutorial - Intro

What is KDD? A process! • The selection and processing of data for: • the identification of novel, accurate, and useful patterns, and • the modeling of real-world phenomena. • Data miningis a major component of the KDD process - automated discovery of patterns and the development of predictive and explanatory models. EDBT2000 tutorial - Intro

The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Patterns & Models Prepared Data Warehouse Consolidated Data Data Sources EDBT2000 tutorial - Intro

The KDD Process Core Problems & Approaches • Problems: • identification of relevant data • representation of data • search for valid pattern or model • Approaches: • top-down deduction by expert • interactive visualization of data/models • * bottom-up inductionfrom data * OLAP Data Mining EDBT2000 tutorial - Intro

The steps of the KDD process • Learning the application domain: • relevant prior knowledge and goals of application • Data consolidation: Creating a target data set • Selection and Preprocessing • Data cleaning : (may take 60% of effort!) • Data reduction and projection: • find useful features, dimensionality/variable reduction, invariant representation. • Choosing functions of data mining • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s) • Data mining: search for patterns of interest • Interpretation and evaluation: analysis of results. • visualization, transformation, removing redundant patterns, … • Use of discovered knowledge EDBT2000 tutorial - Intro

The virtuous cycle Knowledge Problem Identify Problem or Opportunity Act on Knowledge Measure effect of Action Results Strategy EDBT2000 tutorial - Intro

Applications, operations, techniques EDBT2000 tutorial - Intro

Roles in the KDD process EDBT2000 tutorial - Intro

Data mining and business intelligence Increasing potential to support business decisions End User Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP EDBT2000 tutorial - Intro

Architecture of a KDD system Graphical User Interface Data Mining Interpretation and Evaluation Selection and Preprocessing Data Consolidation Knowledge Warehouse Data Sources EDBT2000 tutorial - Intro

A business intelligence environment EDBT2000 tutorial - Intro

The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Patterns & Models Prepared Data Warehouse Consolidated Data Data Sources EDBT2000 tutorial - Intro

Data consolidation and preparation Garbage in Garbage out • The quality of results relates directly to quality of the data • 50%-70% of KDD process effort is spent on data consolidation and preparation • Major justification for a corporate data warehouse EDBT2000 tutorial - Intro

Data consolidation From data sources to consolidated data repository RDBMS Data Consolidation and Cleansing Warehouse Legacy DBMS • Object/Relation DBMS • Multidimensional DBMS • Deductive Database • Flat files Flat Files External EDBT2000 tutorial - Intro

Data consolidation • Determine preliminary list of attributes • Consolidate data into working database • Internal and External sources • Eliminate or estimate missing values • Remove outliers (obvious exceptions) • Determine prior probabilities of categories and deal with volume bias EDBT2000 tutorial - Intro

The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Warehouse EDBT2000 tutorial - Intro

Data selection and preprocessing • Generate a set of examples • choose sampling method • consider sample complexity • deal with volume bias issues • Reduce attribute dimensionality • remove redundant and/or correlating attributes • combine attributes (sum, multiply, difference) • Reduce attribute value ranges • group symbolic discrete values • quantize continuous numeric values • Transform data • de-correlate and normalize values • map time-series data to static representation • OLAP and visualization tools play key role EDBT2000 tutorial - Intro

The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Warehouse EDBT2000 tutorial - Intro

Data mining tasks and methods x2 x1 f(x) x • Automated Exploration/Discovery • e.g.. discovering new market segments • clustering analysis • Prediction/Classification • e.g.. forecasting gross sales given current factors • regression, neural networks, genetic algorithms, decision trees • Explanation/Description • e.g.. characterizing customers by demographics and purchase history • decision trees, association rules if age > 35 and income < $35k then ... EDBT2000 tutorial - Intro

Automated exploration and discovery • Clustering: partitioning a set of data into a set of classes, called clusters, whose members share some interesting common properties. • Distance-based numerical clustering • metric grouping of examples (K-NN) • graphical visualization can be used • Bayesian clustering • search for the number of classes which result in best fit of a probability distribution to the data • AutoClass (NASA) one of best examples EDBT2000 tutorial - Intro

Prediction and classification • Learning a predictive model • Classification of a new case/sample • Many methods: • Artificial neural networks • Inductive decision tree and rule systems • Genetic algorithms • Nearest neighbor clustering algorithms • Statistical (parametric, and non-parametric) EDBT2000 tutorial - Intro

Generalization and regression • The objective of learning is to achieve good generalization to new unseen cases. • Generalization can be defined as a mathematical interpolation or regression over a set of training points • Models can be validated with a previously unseen test set or using cross-validation methods f(x) x EDBT2000 tutorial - Intro

Classification and prediction • Classify data based on the values of a target attribute, e.g., classify countries based on climate, or classify cars based on gas mileage. • Use obtained model to predict some unknown or missing attribute values based on other information. EDBT2000 tutorial - Intro

Summarizing: inductive modeling = learning f(x) x x2 x1 Objective:Develop a general model or hypothesis from specific examples • Function approximation (curve fitting) • Classification (concept learning, pattern recognition) A B EDBT2000 tutorial - Intro

Explanation and description • Learn a generalized hypothesis (model) from selected data • Description/Interpretation of model provides new knowledge • Methods: • Inductive decision tree and rule systems • Association rule systems • Link Analysis • … EDBT2000 tutorial - Intro

Exception/deviation detection • Generate a model of normal activity • Deviation from model causes alert • Methods: • Artificial neural networks • Inductive decision tree and rule systems • Statistical methods • Visualization tools EDBT2000 tutorial - Intro

Outlier and exception data analysis • Time-series analysis (trend and deviation): • Trend and deviation analysis: regression, sequential pattern, similar sequences, trend and deviation, e.g., stock analysis. • Similarity-based pattern-directed analysis • Full vs. partial periodicity analysis • Other pattern-directed or statistical analysis EDBT2000 tutorial - Intro

The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation and Warehousing Warehouse EDBT2000 tutorial - Intro

Are all the discovered pattern interesting? • A data mining system/query may generate thousands of patterns, not all of them are interesting. • Interestingness measures: • easily understood by humans • valid on new or test data with some degree of certainty. • potentially useful • novel, or validates some hypothesis that a user seeks to confirm • Objective vs. subjective interestingness measures • Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. • Subjective: based on user’s beliefs in the data, e.g., unexpectedness, novelty, etc. EDBT2000 tutorial - Intro

Completeness vs. optimization • Find all the interesting patterns: Completeness. • Can a data mining system find all the interesting patterns? • Search for only interesting patterns: Optimization. • Can a data mining system find only the interesting patterns? • Approaches • First generate all the patterns and then filter out the uninteresting ones. • Generate only the interesting patterns - mining query optimization. EDBT2000 tutorial - Intro

Interpretation and evaluation Evaluation • Statistical validation and significance testing • Qualitative review by experts in the field • Pilot surveys to evaluate model accuracy Interpretation • Inductive tree and rule models can be read directly • Clustering results can be graphed and tabled • Code can be automatically generated by some systems (IDTs, Regression models) EDBT2000 tutorial - Intro

Interpretation and evaluation • Visualization tools can be very helpful • sensitivity analysis (I/O relationship) • histograms of value distribution • time-series plots and animation • requires training and practice Response Temp Velocity EDBT2000 tutorial - Intro

Knowledge discovery & data mining Tools, methods, and experiences