Integrating Data Mining with Databases Overview

Integrating Mining With Databases Nikolay Kojuharov Lauren Massa-Lochridge Hoa Nguyen Quoc Le

Roadmap • Motivation • Approaches • Standards • Optimizing Mining Predicates • Q&A

Current Status • Business drown in data disparate data sources and turn into actionable information. • Data mining: Used to be “run-and-done” process. But not any more. • Ability to feed the results of the data mining models right back into the analysis process. • … Or in integration phase. (e.g. when customers input profiles). • “Data mining becomes an integral part of the entire development lifecycle of a business intelligence (BI) application, data integration, data analysis, and reporting” – The Promise of Data Mining.

Why integrate DM & RDBMS? • Pros: • Leverage decades of experience and improvements • Robustness, portability, scalability, concurrency • Indexing, logging, recover, space management • Cons: • Worse performance • Limited SQL capabilities • Hard to express the variety of models and data types

Data Mining and RDBMS • Not integrated • Separate data warehouse, or files • Loosely integrated • Data is extracted from DB is processed separately; results are stored back • Data mining runs in a separate process • Tightly integrated • Stored procedures

Data Mining Implementation • User-defined functions (UDF) • Similar to stored procedures, but less overhead • Doesn’t utilize DBMS query capabilities • SQL language extensions • Express logic as a SQL program • Optimized by DBMS

XML Standards • PMML (’98):Application can produce/consume models. • Data Dictionary: Input attributes (values/range). • Mining Schema: Schema attributes (specific to a certain model) and their usage (active, predicted, supplementary). • Transformation Dictionary: Normalization, Discretization, Value mapping, aggregation. • Model Statistics: Various statistical information. • Model Parameters: specified by tags: regression, clustering, classification, decision tree. • XML/A: set of XML Message Interfaces that use SOAP to provide API for client application and analytical data provider (OLAP and data mining) working over the Internet.

XML Examples • <Neuron id=”10”> <Con from=”0” weight=”2.08148”/>

Standard APIs • SQL/MM Part 6: Data Mining: SQL interface to data mining applications and services.

ISO’s SQL/MM (Example) Purpose: Returns the number of rules contained in the DM_content of a value of DM_RuleModel. Definition CREATE METHOD DM_getNORules () RETURNS INTEGER FOR DM_RuleModel BEGIN -- -- !! See Description -- END Description • The result of the invocation DM_getNORules () is determined as follows: • Case: a) If SELF or SELF.DM_content is the null value, then the null value. b) Otherwise, the number of rules contained in DM_content.

Standard APIs • SQL/MM Part 6: Data Mining: SQL interface to data mining applications and services. • Java DM: JSR-73: Java and Web interface (now at JDM 2.0 & JSR-247).

Java Data Mining (Example) // Retrieve the model to get the leaf clusters and their details ClusteringModel customerSeg = (ClusteringModel) dmeConn.retrieveObject(“customerSegments”); Collection segments = customerSeg.getLeafClusters(); Iterator segmentsIterator = segments.iterator(); while( segmentsIterator.hasNext() ) { Cluster segment = (Cluster) segmentsIterator.next(); Predicate splitPredicate = segment.getSplitPredicate(); long segmentSize = segment.getCaseCount(); double support = segment.getSupport(); AttributeStatisticsSet attrStats = segment.getStatistics(); }

Standard APIs • SQL/MM Part 6: Data Mining: SQL interface to data mining applications and services. • Java DM: JSR-73: Java and Web interface (now at JDM 2.0 & JSR-247). • Microsoft OLE DB for DM: new capabilities including data taxonomies and transformation  Microsoft Analysis Services in Microsoft SQL Server 2000. (2005).

Other Standards • Process standard: (’97) CRoss Industry Standard Process for Data Mining (CRISP-DM): business problems  captures and understands data  applies DM techniques  interprets results  deploys gained knowledge. • Data mining metadata: OMG Common Warehouse Metadata (CWM) for Data Mining: (2000) metadata interchange among data warehousing.

Microsoft Analysis Services (OLEDB for DM) • Create a data mining module object • CREATE MINING MODEL [model_name] • Insert training data into the model and train it • INSERT INTO [model_name] • Use the data mining model • SELECT relation_name.[id], [model_name].[predict_attr] • consult DMM content in order to make predictions and browse statistics obtained by the model • Using DELETE to empty/reset • Predictions on datasets: PREDICTION JOIN between a model and a data set (tables) • Deploy DMM by just writing SQL queries!

IBM Universal Database • Create model in IBM’s Intelligent Miner (or other) • Export model as PMML and use through a set of UDFs as recommended by SQL/MM

Microsoft SQL Server 2005 • Integrated across all the SQL Server products, including SQL Server, SQL Server Integration Services, and Analysis Services • Ease of use: Simple yet rich API • Scalability: eliminates the sampling challenge by allowing the models to run against the entire data set. • Data Integration, Data Analysis and Report. • Extensibility

Example • SQL Query: Identify the top 25 [potential] churning customers. SELECT TOP 25 t.CustomerID FROM CustomerChurnModel NATURAL PREDICTION JOIN OPENQUERY('CustomerDataSource', 'SELECT * FROM Customers') ORDER BY PredictProbability([Churned],True) DESC • Supporting Models: Decision Trees, Association Rules, Naïve Bayes, Sequence Clustering, Time Series, Neural Nets, Text Mining + Extensibility. • http://www.microsoft.com/sql/evaluation/compare/OLAP_comparison.doc

Efficient Evaluation of Queries with Mining Predicates Sunita Sarawagi Indian Institute of Technology, Bombay Surajit Chaudhuri Vivek Narasayya Microsoft Research Most of slides comes from Surajit Chaudhuri and Sunita Sarawagi

Motivation • Relational data warehouses are common • Potentially lots of valuable information • Organizations seek business value from information • E.g., Mail campaign, Online targeted ads • Data mining capable of predictive models • E.g., Clustering, Classification

State of the Art • Most commercial databases support • Creation/Import of mining models • Application of mining model on relational data • E.g., IBM’s Intelligent Miner, Microsoft Analysis Server • Extensions to SQL • However… • Weak integration with database systems • Inefficient on large databases

Data To Predict Mining Model Training Data Create Mining Model Trained Mining Model Predicted Data Phases in Data Mining DM Engine DM Engine

Yes No Maybe Maybe No No Yes Maybe Computer Other MAC PC RAM RAM No high low high normal No Maybe Yes Maybe Example: Decision Trees Training Data: Will Buy Product X? computer Online RAM income 45K PC normal AOL 50K MAC low AOL 45K PC high MSN 60K Other high ATT 45K Other low ? 48K MAC medium MSN 70K PC normal AOL 67K Other high ATT

Create mining model Name of model CREATE MINING MODEL [WillBuy Prediction] ( [RAM] TEXT DISCRETE ATTRIBUTE, [Computer] TEXT DISCRETE ATTRIBUTE, [Will Buy Product X] BOOL DISCRETE ATTRIBUTE PREDICT, ) USING [Microsoft Decision Tree] Name of algorithm

Training a DMM • Training a DMM requires passing it “known” cases • Cases encapsulate all we know about each entity • Use an INSERT INTO in order to “insert” the data to the DMM • The DMM will usually not retain the inserted data • Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model) • INSERT [INTO] <mining model name> [(columns list)] <source data query>

INSERT INTO INSERT INTO [Age Prediction] ( [RAM],[Computer], [Will Buy Product X] ) OPENQUERY([Provider=MSOLESQL…, ‘SELECT [RAM], [Computer], [Will Buy Product X] FROM [Customers]’ )

Mining Predicate Example of Mining Query • E.g., on Microsoft Analysis Server -- OLE-DB for Data Mining • Find customers who are predicted to buy Product X SELECT T.CustomerID FROM Customer_Classifier M PREDICTION JOIN (SELECT CustomerID, Income, Online Computer, RAM FROM Customers) as T ON M.Income = T.Income and M.Online = T.Online and M.Computer = T.Computer and M.RAM = T.RAM WHERE M.WillBuyX = ‘Yes’

Problem • Processing SQL queries that apply mining models is inefficient • Mining model is treated as black box • => Evaluating mining predicate can be expensive • Cannot exploit standard physical design “goodies” • E.g., indexes

Computer Other MAC PC RAM RAM No high low high normal No Maybe Yes Maybe Idea: Upper-Envelopes • U = (Computer=‘PC’ AND RAM = ‘Normal’) is upper-envelope for P = (M.WillBuyX = ‘Yes’) • U => P • U can be exploited for access path selection • E.g. indexes on Computer, RAM columns

Train Mining Model System Catalog Upper-Envelopes Mining Query Execution Plan Query Optimizer Optimizing Mining Queries Apply Transformation Rules Include Upper-Envelopes

Caveats • Need access to content of mining model during optimization • Invalidate cached plans if mining model changes • Upper-envelopes useful only if optimizer uses them for access path selection • Assume inclusion does not increase execution cost • Relies on “well-behaved” optimizer

Contributions • Technique to improve efficiency of SQL queries with mining predicates • Via inclusion of upper-envelope predicates • Automatically infer such predicates for some discrete predictive models • Decision Trees, Naïve Bayes Classifiers, Clustering • Demonstrate promise of technique on several real databases • Implementation on Microsoft SQL Server

Roadmap • Deriving Upper-Envelopes • Naïve Bayes • Clustering • Handling Broader Class of Queries • Experimental Results • Related Work

Upper-Envelopes • Upper-Envelope for class c (Mc) • Is an expression of simple selection predicates on attribute values of table • A record belongs to class c only if Mc is true • Can be exploited for access path selection • Goal: Find tight upper-envelope • Few false positives

Pr(M12,C3) 0.5*0.05 * 0.3 = 0.0075 C1 C2 0.29*0.4 * 0.5 = 0.058 C3 0.5*0.05 * 0.3 = 0.004 Naïve Bayes Classifier: Example Consider Tuple (M12, M24)

Naïve Bayes Classifier • Assumes independence between dimensions • Predicted class of a tuple x is K = number of classes (1  k  K) n = number of dimensions

Finding Upper-Envelope of a Class • An obvious bottom-up approach • For each possible combination of dimension members, find “winning” class • Collect groups of contiguous combinations as predicates

Example: Bottom-up approach Upper-Envelope(C1) = ((D1 IN {M12,M13}) AND (D2 IN {M21,M22})) Upper-Envelope(C2): ((D1 = M11) OR ((D1=M12) AND (D2 IN{M23,M24}))) Limitation: Exponential in the number of dimensions

An Efficient Top-Down Algorithm • Intuition • Exploit upper and lower bounds on probabilities of class • Prune out large number of combinations without explicit enumeration • Upper/Lower bound for class C • Product of maximum/minimum probability for C along each dimension

Steps in Top-Down Algorithm Find-U(R, C1) “Must-Lose” Yes No part of R in U(C1) MaxP(C1) < MinP(Cj) No “Must-Win” Yes All of R in U(C1) MinP(C1) > MaxP(Cj) “Ambiguous” No Split R into R1 and R2, Find-U(R1,C1) and Find-U(R2,C1)

Example: Finding Upper-Envelope for Class C1 “Ambiguous” “Ambiguous” “Must-Lose” “Ambiguous” Add to U(C1) “Must-Win” “Must-Lose” “Must-Lose”

Clustering • Distance-based clustering • Form is analogous to Naïve Bayes. Hence similar algorithm can be used.

Handling Broader Class of Mining Queries • M.PredColumn IN (c1, c2, …) • Disjunction of individual upper-envelopes • Join between predicted column and data column • E.g., M1.PredColumn = T.DataColumn • Useful in cross-validation tasks • Join between two predicted columns • E.g., M1.PredColumn = M2.PredColumn • Useful when we need to see if different models agree

Experimental Evaluation • Databases • 9 UCI data sets, KDD dataset • Stored on Microsoft SQL Server • Database sizes: • Between 1-2 million rows each • Mining Models • Decision Tree using Microsoft Analysis Server • Naïve Bayes, Clustering from MLC++ machine learning library

Methodology • Goal: Evaluate potential benefit of including upper-envelope predicates • Measure impact on plan/running time • For each class/cluster • Compute upper-envelope predicate <p> • Generate query: SELECT * FROM T WHERE <p> • Tune physical design for above workload • Use Tuning Wizard for Microsoft SQL Server 2000 • Compare running time with: SELECT * FROM T

Reduction in Running Time Average over all databases

Impact on Plan: Decision Tree

Impact on Plan: Naïve Bayes

Integrating Data Mining with Databases Overview

Integrating Data Mining with Databases Overview

Presentation Transcript

Mining Multiple Private Databases

Efficiently Mining Long Patterns from Databases

Integrating Databases with the Web Using FileMaker Pro 5.0

Integrating Resources: Databases and Web Sites

Mining Association Rules in Large Databases

Data Mining with DDView+ and the PDF-4 Databases

Data Mining with DDView+ and the PDF-4 Databases

Integrating Databases with XML

Mining Multidimensional Databases

Data Mining with DDView+ and the PDF-4 Databases

Mining Several Databases with an Ensemble of Classifiers

Mining Time-Series Databases

Mining Frequent Itemsets over Uncertain Databases

Integrating Web Applications With Databases

Integrating Private Databases for Data Analysis

Integrating Perl and Databases

Mining Multidimensional Databases

Integrating Genomic Databases