1 / 47

ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation

DAT204 Introduction to Data Mining with SQL Server 2000. ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation. Agenda. What is Data Mining The Data Mining Market OLE DB for Data Mining Overview of the Data Mining Features in SQL Server 2000 Demo Q&A.

caesar
Download Presentation

ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DAT204Introduction to Data Mining with SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation

  2. Agenda • What is Data Mining • The Data Mining Market • OLE DB for Data Mining • Overview of the Data Mining Features in SQL Server 2000 • Demo • Q&A

  3. What Is Data Mining?

  4. What is DM? • A process of data exploration and analysis using automatic or semi-automatic means • Techniques origin from Machine Learning, statistics and database • “Exploring data” – scanning samples of known facts about “cases”. • “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules… • Once the “knowledge” is extracted it: • Can be browsed • Provides a very useful insight on the cases behavior • Can be used to predict values of other cases • Can serve as a key element in closed loop analysis

  5. What drives high school students to attend college?

  6. IQ=High IQ=Low Attend College: 79% Yes 21% No Attend College: 45% Yes 55% No Wealth = False Parents Encourage = Yes Parents Encourage = No Wealth = True Attend College: 70% Yes 30% No Attend College: 31% Yes 69% No Attend College: 94% Yes 6% No Attend College: 69% Yes 21% No The deciding factors for high school students to attend college are… All Students Attend College: 55% Yes 45% No IQ ? Wealth Parents Encourage?

  7. Business Oriented DM Problems • Targeted ads • “What banner should I display to this visitor?” • Cross sells • “What other products is this customer likely to buy? • Fraud detection • “Is this insurance claim a fraud?” • Churn analysis • “Who are those customers likely to churn?” • Risk Management • “Should I approve the loan to this customer?” • …

  8. Data To Predict Training Data Mining Model Mining Model Mining Model Predicted Data Mining Process - Illustrated DM Engine DM Engine

  9. The Data Mining Market

  10. The $$$: Market Size • DM Tools Market: • 1999: $341.3M • 2000: $455.1M • 2001: $449.5M * IDC

  11. The Players • Leading vendors • SAS • SPSS • IBM • Angoss • Hundreds of smaller vendors offering DM algorithms… • Oracle –Thinking Machines acquisition

  12. The Products • End-to-end horizontal DM tools • Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting…. • The customer is the power-analyst • PhD in statistics is usually required… • Closed tools – no standard API • Total vendor lock-in • Limited integration with applications • DM an “outsider” in the Data Warehouse • Extensive consulting required • Sky rocketing prices • $60K+ for a single user license

  13. What the analysts say… • “Stand-alone Data Mining Is Dead” - Forrester • “The demise of [stand alone] data mining” – Gartner

  14. The Microsoft Approach

  15. DataPro Users Survey 1999-2001 “Data mining will be the fastest-growing BI technology…”

  16. Market Size of BI * IDC

  17. SQL Server 2000 - The Analysis Platform • SQL 2000 provides a complete Analysis Platform • Not an isolated, stand alone DM product • Platform means: • Standard based DM API’s (OLE DB for DM) for applications development • Integrated vision for all technologies, tools • Extensible • Scaleable

  18. Reports & Analysis Data Flow DM Apps DM OLTP DW OLAP

  19. DM Analysis Services 2000 –Components Tree View Control Manager UI DM Wizards Cluster View Control DM DTS Task Lift Chart Control DSO DMM OLE DB OLAP Sample Query Tool Client Analysis Server OLAP Engine (local) DM Engine (local) OLAP Engine DM Engine

  20. OLE DB for Data Mining…

  21. Why OLE DB for DM? Make DM a mass market technology by: • Leverage existing technologies and knowledge • SQL and OLE DB • Common industry wide concepts and data presentation • Changing DM market perception from “proprietary” to “open” • Increasing the number of players: • Reduce the cost and risk of becoming a consumer – one tool works with multiple providers • Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering

  22. Integration With RDBMS • Customers would like to • Build DM models from within their RDBMS • Train the models directly off their relational tables • Perform predictions as relational queries (tables in, tables out) • Feel that DM is a native part of their database. • Therefore… • Data mining models are relational objects • All operations on the models are relational • The language used is SQL (w/Extensions) • The effect: every DBA and VB developer can become a DM developer

  23. Creating a Data Mining Model (DMM)

  24. Identifying the “Cases” • DM algorithms analyze “cases” • The “case” is the entity being categorized and classified • Examples • Customer credit risk analysis: Case = Customer • Product profitability analysis: Case = Product • Promotion success analysis: Case = Promotion • Each case encapsulate all we know about the entity

  25. A Simple Set of Cases

  26. More Complicated Cases

  27. A DMM is a Table! • A DMM structure is defined as a table • Training a DMM means inserting data (pattern) into the table • Predicting from a DMM means querying the table • All information describing the case are contained in columns

  28. Creating a Mining Model CREATE MINING MODEL [Plans Prediction] ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG CONTINUOUS, IQ DOUBLE CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees

  29. Creating a mining model with nested table Create Mining Model MoviePrediction ( CutomerId long key, Age long continuous, Gender discrete, Education discrete, MovieList table predict ( MovieName text key ) ) using microsoft_decision_trees

  30. Training a DMM

  31. Training a DMM • Training a DMM means passing it data for which the attributes to be predicted are known • Multiple passes are handled internally by the provider! • Use an INSERT INTO statement • The DMM will not persist the inserted data • Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) INSERT [INTO] <mining model name> [(columns list)] <source data query>

  32. INSERT INTO INSERT INTO [Plans Prediction] ( StudentID, Gender, ParentIncome, IQ, Encouragement, CollegePlans ) SELECT [StudentID], [Gender], [ParentIncome], [IQ], [Encouragement], [CollegePlans] FROM [Students]

  33. When Insert Into Is Done… • The DMM is trained • The model can be retrained • Content (rules, trees, formulas) can be explored • OLE DB Schema rowset • SELECT * FROM <dmm>.CONTENT • XML string (PMML) • Prediction queries can be executed

  34. Predictions

  35. What are Predictions? • Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values • Predictions = queries • The syntax is SQL - like • The output is a rowset • In order to predict you need: • Input data set • A trained DMM • Binding (mapping) information between the input data and the DMM

  36. The Truth Table Concept

  37. Prediction It’s a JOIN!

  38. The Prediction Query Syntax SELECT <columns to return or predict> FROM <dmm> PREDICTION JOIN <input data set> ON <dmm column> = <dmm input column>…

  39. Example SELECT [New Students].[StudentID], [Plans Prediction].[CollegePlans], PredictProbability([CollegePlans]) FROM [Plans Prediction] PREDICTION JOIN [New Students] ON [Plans Prediction].[Gender] = [New Students].[Gender] AND [Plans Prediction].[IQ] = [New Students].[IQ] AND ...

  40. Demo

  41. OLE DB for Data Mining Defines API Consumer Consumer … OLE DB for DM (API) Provider Provider Provider … OLE DB Misc. Data Source Cube RDBMS

  42. OLEDB for DM Configuration Options Demo Consumers OLEDB for DM Providers MS Analysis Manager ANGOSS Controls 2 3 1 4 MS DM Provider ANGOSS DM Provider

  43. Demo on OLE DB for DM API using Angoss Controls and Provider

  44. For more info… • DM URL • www.microsoft.com/data/oledb • www.microsoft.com/data/oledb/DMResKit.htm • News Group: • Microsoft.public.SQLserver.datamining • Communities.msn.com/AnalysisServicesDataMining • White papers: • Performance paper: www.unisys.com/windows2000/default-07.asp www.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp

  45. Questions ?

More Related