1 / 48

DAT205 Advanced Data Mining Using SQL Server 2000

DAT205 Advanced Data Mining Using SQL Server 2000. ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation. Agenda. Microsoft Data Mining Algorithms OLE DB for DM Data mining query Data Mining Case Study: Click Stream Analysis Customer Segmentation

deanna
Download Presentation

DAT205 Advanced Data Mining Using SQL Server 2000

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DAT205Advanced Data Mining Using SQL Server 2000 ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation

  2. Agenda • Microsoft Data Mining Algorithms • OLE DB for DM Data mining query • Data Mining Case Study: Click Stream Analysis • Customer Segmentation • Site affiliation • Target ads in banner • Performance of Microsoft Data Mining Algorithm • Q&A

  3. Data Mining Algorithms in SQL Server 2000

  4. All Students Attend College: 55% Yes 45% No IQ=High IQ < > High Attend College: 79% Yes 21% No Attend College: 35% Yes 65% No Parent Income = Low Parent Income = High Attend College: 69% Yes 31% No Attend College: 94% Yes 6% No Decision Tree • Popular technique for classification, Prediction task • Churn analysis • Credit risk analysis • … • Easy to understand • any path from node to leaf forms a rule • Fast to build • Prediction based on leaf node stats • Variation: C4.5, C5, CART, Chaid

  5. 1000 1800 900 1600 800 1400 700 1200 600 1000 500 800 400 300 600 200 400 Yes 100 200 0 0 IQ=High IQ=Medium IQ=Low PE=TRUE PE=FALSE 1200 1800 1600 1000 1400 800 1200 1000 600 800 No 400 600 400 200 200 0 0 Male Female PI=High PI=FALSE How tree works

  6. All Students College Plan 33% Yes 67% No Parent Encouragement = True Parent Encouragement = False College Plan 63% Yes 37% No College Plan 16% Yes 84% No Split recursively

  7. Microsoft Decision Trees • Probabilistic Classification Tree • Splitting methods: Bayesian score and Entropy • Forward pruning • Tree shape: Binary and Nary tree • Scalable framework

  8. Clustering Algorithm (EM) • A popular method for customer segmentation, mailing list, profiling… • Algorithm process • Assign a set of Initial Points • Assign initial cluster to each points • Assign data points to each cluster with a probability • Computer new central point based on weighted computation • Cycle until convergence

  9. X X X EM Illustration

  10. Microsoft Clustering Algorithm (Scalable EM) Identify Data to be Compressed Build/Update Model Fill Buffer Stop? Data Compressed date  Sufficient stats Final Model

  11. OLE DB for Data Mining

  12. OLE DB for DM • Industry standard for data mining • Based on existing technologies • SQL • OLE DB • Define common concepts for DM • Case, Nested Case • Mining Model • Model Creation • Model Training • Prediction • Language based API

  13. Customer Table

  14. DM Query Language Create Mining Model CreditRisk (CustomerID long key, Gender text discrete, Income long continuous, Profession text discrete, Risk text discrete predict) Using Microsoft_Decision_Trees Insert into CreditRisk (CustomerId, Gender, Income, Profession, Risk) Select CustomerID, Gender, Income, Profession,Risk From Customers Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk) From CreditRisk Prediction Join NewCustomers On CreditRisk.Gender=NewCustomer.Gender And CreditRisk.Income=NewCustomer.Income And CreditRisk.Profession=NewCustomer.Profession

  15. Schema Rowsets • Tabular data to provide meta data information • List of Schema Rowsets in OLE DB for DM • Mining_Services • Mining_Service_Parameters • Mining_Models • Mining_Columns • Mining_Model_Contents • Model_Content_PMML

  16. Mining Model Contents Schema Rowsets

  17. Schema Rowsets & Thin Client Browser

  18. Case Study: Click Stream Analysis

  19. Schema

  20. Web Customer Segmentation

  21. Web Visitors Segmentation

  22. Segmentation based on Customer table Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTime long continuous, GeoLocation text discrete ) Using Microsoft_Clustering

  23. Segmentation based on Customer and WebClick Create Mining Model CustomerClustering (CustomerID text key, DayTimeOnline long continuous, NightTimeOnline long continuous, BrowserType text discrete, ChatTime long continuous, EmailTime long continuous, GeoLocation text discrete WebClick table ( UrlCategory text key ) )Using Microsoft_Clustering

  24. MSFTies Segmentation

  25. Web Site Affiliation

  26. Insurance Business No Loan Loan No Insurance Insurance Loan No Loan Stock No Stock Stock Loan No Business Business No Stock Stock Shopping No Shopping Insurance No Insurance Association analysis using Microsoft Decision Trees

  27. Association analysis using Microsoft Decision Trees Insurance Business No Loan Loan No Insurance Insurance Loan No Loan Stock No Stock Stock Loan No Business Business No Stock Stock Shopping No Shopping Insurance No Insurance

  28. Site Affiliation

  29. Site Affiliation Create Mining Model SiteAffiliation (CustomerID text key, WebClick table predict ( UrlCategory text key ) )Using Microsoft_Decision_Trees Insert into SiteAffiliation (CustomerID,WebClick (skip, UrlCategory)OpenRowset(‘MSDataShape’, 'data provider=SQLOLEDB;Server=myserver;UID=me; PWD=mypass' , 'Shape{Select CustomerID from Customer} Append ( {Select customerid, URLCategory from WebClick } relate CustomerID to CustomerID) as WebClick’ )

  30. Path Prediction

  31. Path Prediction

  32. Singleton Prediction • Select Flattened • Topcount((select URLCategory, $adjustedProbability as prob • From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) • From • WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as input • On • WebLog.[Web Click].URLCategory = input.WebClick.URLCategory

  33. DM Provider Web Customer IIS Internet ASP DMM Architecture ADO/DSO Real Time Prediction

  34. Performance of DM Algorithms

  35. DM Performance Study • Joint effort between Unisys & Microsoft • Two parts of the white paper: • First part: Use AS2k to build DM Models for a banking business scenario • Second Part: Performance results of DM algorithms study • Some results in this session… • Details in the paper and SQL Server magazine articles…

  36. Data Source for DMMs

  37. Training Performance Results…

  38. Sample Business Question for Non Nested MDT

  39. Non Nested: Training Times for varying Number of Input attributes • Assumptions: • 1 mm cases • 25 states • 1 predictable attribute Observations:

  40. Non Nested: Training Times for varying Number of Cases • Assumptions: • 20 attributes • 25 states • 1 predictable attribute Observations:

  41. Sample Business Question for Nested MDT

  42. Nested Cases: Training Times for varying Sample size of Case Table Assumptions: • Avg. customer purchases=25 • States in nested=200 • Nested key predictable Observations:

  43. Nested Cases: Training Times for varying Number of Products purchased per customer Assumptions: • 200000 cases • 1000 products in nested Observations:

  44. For more info… • DM URL • www.microsoft.com/data/oledb • www.microsoft.com/data/oledb/DMResKit.htm • News Group: • Microsoft.public.SQLserver.datamining • Communities.msn.com/AnalysisServicesDataMining • White papers: • Performance paper: www.unisys.com/windows2000/default-07.asp www.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp

  45. Don’t forget to complete the on-line Session Feedback form on the Attendee Web sitehttps://web.mseventseurope.com/teched/

More Related