欢迎光临
This presentation is the property of its rightful owner.
Sponsored Links
1 / 54

欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会 PowerPoint PPT Presentation


  • 167 Views
  • Uploaded on
  • Presentation posted in: General

欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会. 今日安排. 微软 SQL 数据挖掘技术概述 左洪 微软公司 数据仓库在电信的应用 贝志城 明天高科 数据挖掘在 CRM 中的应用 王立军 中圣公司 灵通 IT Service 维护管理服务系统 邹雄文 广州灵通. Introduction to Data Mining with SQL Server 2000 左洪 高级产品市场经理 微软(中国)有限公司. Agenda. What is Data Mining The Data Mining Market

Download Presentation

欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sql

欢迎光临微软SQL数据挖掘/数据仓库技术研讨会


Sql

今日安排

  • 微软SQL数据挖掘技术概述

    • 左洪 微软公司

  • 数据仓库在电信的应用

    • 贝志城 明天高科

  • 数据挖掘在CRM中的应用

    • 王立军 中圣公司

  • 灵通IT Service维护管理服务系统

    • 邹雄文 广州灵通


  • Sql

    Introduction to Data Mining with SQL Server 2000

    左洪 高级产品市场经理

    微软(中国)有限公司


    Agenda

    Agenda

    • What is Data Mining

    • The Data Mining Market

    • OLE DB for Data Mining

    • Overview of the Data Mining Features in SQL Server 2000

    • Q&A


    What is data mining

    What Is Data Mining?


    What is dm

    What is DM?

    • A process of data exploration and analysis using automatic or semi-automatic means

      • “Exploring data” – scanning samples of known facts about “cases”.

      • “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules…

    • Once the “knowledge” is extracted it:

      • Can be browsed

        • Provides a very useful insight on the cases behavior

      • Can be used to predict values of other cases

        • Can serve as a key element in closed loop analysis


    What drive high school students to attend college

    What drive high school students to attend college?


    The deciding factors for high school students to attend college are

    IQ=High

    IQ=Low

    Attend College:

    79% Yes

    11% No

    Attend College:

    45% Yes

    55% No

    Wealth = False

    Parents

    Encourage = Yes

    Parents

    Encourage = No

    Wealth = True

    Attend College:

    70% Yes

    30% No

    Attend College:

    31% Yes

    69% No

    Attend College:

    94% Yes

    6% No

    Attend College:

    69% Yes

    21% No

    The deciding factors for high school students to attend college are…

    All Students

    Attend College:

    55% Yes

    45% No

    IQ ?

    Wealth

    Parents

    Encourage?


    Business oriented dm problems

    Business Oriented DM Problems

    • Targeted ads

      • “What banner should I display to this visitor?”

    • Cross sells

      • “What other products is this customer likely to buy?

    • Fraud detection

      • “Is this insurance claim a fraud?”

    • Churn analysis

      • “Who are those customers likely to churn?”

    • Risk Management

      • “Should I approve the loan to this customer?”


    Sql

    Http://www.tunes.com


    Mining process illustrated

    Data

    To Predict

    Training Data

    Mining Model

    Mining Model

    Mining Model

    Predicted Data

    Mining Process - Illustrated

    DM

    Engine

    DM

    Engine


    The data mining market

    The Data Mining Market


    The y2000 market size

    The $$$: Y2000 Market Size

    • DM Tools Market: $250M

      • 40% - license fees

      • 60% consulting

    * Gartner


    The players

    The Players

    • Leading vendors

      • SAS

      • SPSS

      • IBM

      • Hundreds of smaller vendors offering DM algorithms…

    • Oracle –Thinking Machines acquisition


    The products

    The Products

    • End-to-end Data Mining tools

      • Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting….

    • The customer is the power-analyst

      • PhD in statistics is usually required…

    • Closed tools – no standard API

      • Total vendor lock-in

      • Limited integration with applications

    • DM an “outsider” in the Data Warehouse

    • Extensive consulting required

    • Sky rocketing prices

      • $60K+ for a single user license


    What the analysts say

    What the analysts say…

    • “Stand-alone Data Mining Is Dead” - Forrester

    • “The demise of [stand alone] data mining” – Gartner


    The microsoft approach

    The Microsoft Approach


    Datapro users survey 1999 2001

    DataPro Users Survey 1999-2001

    “Data mining will be the fastest-growing BI technology…”


    The 2000 market size

    The $$$: 2000 Market Size

    • DM Applications Market Size: $1.5B

    * IDC


    Sql server 2000 the analysis platform

    SQL Server 2000 - The Analysis Platform

    • SQL 2000 provides a complete Analysis Platform

      • Not an isolated, stand alone DM product

    • Platform means:

      • The infrastructure for applications

        • Not an application by itself

      • Integrated vision for all technologies, tools

      • Standard based API’s (OLE DB for DM)

      • Extensible

      • Scaleable


    Data flow

    Reports

    &

    Analysis

    Data Flow

    DM

    Apps

    DM

    OLTP

    DW

    OLAP


    Analysis services 2000 architecture

    DM

    Analysis Services 2000 - Architecture

    Manager UI

    DM Wizards

    DM DTS Task

    DSO

    DMM

    OLE DB

    OLAP

    Client

    Analysis Server

    OLAP

    Engine

    (local)

    DM

    Engine

    (local)

    OLAP

    Engine

    DM

    Engine

    Ext.

    Ext.


    Ole db for data mining

    OLE DB for Data Mining…


    Why ole db for dm

    Why OLE DB for DM?

    Make DM a mass market technology by:

    • Leverage existing technologies and knowledge

      • SQL and OLE DB

    • Common industry wide concepts and data presentation

    • Changing DM market perception from “proprietary” to “open”

    • Increasing the number of players:

      • Reduce the cost and risk of becoming a consumer – one tool works with multiple providers

      • Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering

      • Dramatically increase the number of DM developers


    Integration with rdbms

    Integration With RDBMS

    • Customers would like to

      • Build DM models from within their RDBMS

      • Train the models directly off their relational tables

      • Perform predictions as relational queries (tables in, tables out)

      • Feel that DM is a native part of their database.

    • Therefore…

      • Data mining models are relational objects

      • All operations on the models are relational

      • The language used is SQL (w/Extensions)

    • The effect: every DBA and VB developer can become a DM developer


    Creating a data mining model dmm

    Creating a Data Mining Model (DMM)


    Identifying the cases

    Identifying the “Cases”

    • DM algorithms analyze “cases”

    • The “case” is the entity being categorized and classified

    • Examples

      • Customer credit risk analysis: Case = Customer

      • Product profitability analysis: Case = Product

      • Promotion success analysis: Case = Promotion

    • Each case encapsulate all we know about the entity


    A simple set of cases

    A Simple Set of Cases


    More complicated cases

    More Complicated Cases


    A dmm is a table

    A DMM is a Table!

    • A DMM structure is defined as a table

      • Training a DMM means inserting data into the table

      • Predicting from a DMM means querying the table

    • All information describing the case are contained in columns


    Creating a mining model

    Creating a Mining Model

    CREATE MINING MODEL [Plans Prediction]

    (

    StudentID LONG KEY,

    Gender TEXT DISCRETE,

    ParentIncome LONG CONTINUOUS,

    IQ DOUBLE CONTINUOUS,

    Encouragement TEXT DISCRETE,

    CollegePlans TEXT DISCRETE PREDICT

    )

    USING Microsoft_Decision_Trees


    Creating a mining model with nested table

    Creating a mining model with nested table

    Create Mining Model MoviePrediction

    (

    CutomerId long key,

    Age long continuous,

    Gender discrete,

    Education discrete,

    MovieList table predict (

    MovieName text key

    )

    )

    using microsoft_decision_trees


    Training a dmm

    Training a DMM


    Training a dmm1

    Training a DMM

    • Training a DMM means passing it data for which the attributes to be predicted are known

      • Multiple passes are handled internally by the provider!

    • Use an INSERT INTO statement

    • The DMM will not persist the inserted data

    • Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules)

      INSERT [INTO] <mining model name>

      [(columns list)]

      <source data query>


    Insert into

    INSERT INTO

    INSERT INTO [Plans Prediction]

    (

    StudentID, Gender, ParentIncome, IQ,

    Encouragement, CollegePlans

    )

    SELECT

    [StudentID], [Gender],

    [ParentIncome], [IQ],

    [Encouragement], [CollegePlans]

    FROM [CollegePlans]


    When insert into is done

    When Insert Into Is Done…

    • The DMM is trained

      • The model can be retrained

      • Content (rules, trees, formulas) can be explored

      • OLE DB Schema rowset

      • SELECT * FROM <dmm>.CONTENT

      • XML string (PMML)

    • Prediction queries can be executed


    Predictions

    Predictions


    What are predictions

    What are Predictions?

    • Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values

    • Predictions = queries

      • The syntax is SQL - like

      • The output is a rowset

    • In order to predict you need:

      • Input data set

      • A trained DMM

      • Binding (mapping) information between the input data and the DMM

      • Specification of what to predict


    The truth table concept

    The Truth Table Concept


    Prediction

    Prediction

    It’s a JOIN!


    The prediction query syntax

    The Prediction Query Syntax

    SELECT <columns to return or predict>

    FROM

    <dmm> PREDICTION JOIN

    <input data set>

    ON <dmm column> = <dmm input column>…


    Example

    Example

    SELECT [New Students].[StudentID],

    [Plans Prediction].[CollegePlans],

    PredictProbability([CollegePlans])

    FROM

    [Plans Prediction] PREDICTION JOIN

    [New Students]

    ON [Plans Prediction].[Gender] =

    [New Students].[Gender] AND

    [Plans Prediction].[IQ] =

    [New Students].[IQ] AND ...


    Ole db dm sample provider with source

    OLE DB DM Sample Provider with Source

    • All required OLE DB objects, such as session, command, and rowset

    • The OLE DB for Data Mining syntax parser

    • Tokenization of input data

    • Query processing engine

    • A sample Naïve Bayes algorithm

    • Model persistence in XML and binary formats

    • Available at www.microsoft.com/data/oledb/DMResKit.htm


    Integrated olap and dm analysis

    Integrated OLAP and DM Analysis


    Why use dm with olap

    Why Use DM with OLAP

    • Relational DM is designed for:

      • Reports of patterns

      • Batch predictions fed into an OLTP system

      • Real-time singleton prediction in an operational environment

    • OLAP is designed for

      • interactive analysis by a knowledge worker

      • Consistent and convenient navigational model

      • Pre-aggregations of OLAP allow faster performance


    Understanding dm content decision trees

    Understanding DM Content – Decision Trees

    Customers having high debt and college education:

    Filter([Individual Customers].Members,

    Customers.CurrentMember.Properties(“Debt”) = “High”

    And Customers.CurrentMember.Properties(“Education”) = “College”)

    All Customers

    Credit Risk:

    65% Good

    35% Bad

    Debt ?

    Customers having low debt and are self employed:

    Filter([Individual Customers].Members,

    Customers.CurrentMember.Properties(“Debt”) = Low

    And Customers.CurrentMember.Properties(“Employment Type”) = “Self Employed”)

    Debt=Low

    Debt=High

    Credit Risk:

    89% Good

    11% Bad

    Credit Risk:

    45% Good

    55% Bad

    Employ-

    -ment Type?

    Education?

    ET = Self

    Employed

    Education=

    College

    Education=

    High School

    ET = Salaried

    Credit Risk:

    70% Good

    30% Bad

    Credit Risk:

    31% Good

    69% Bad

    Credit Risk:

    94% Good

    6% Bad

    Credit Risk:

    79% Good

    21% Bad


    Equivalent dm dimension

    …Equivalent DM Dimension

    All Customers

    Customers with low debt

    Customers with low debt and self employed

    Customers with low debt and salaried

    Customers with high debt

    Customers with high debt and college education

    Customers with high debt and high school education


    Tree dimension

    Tree = Dimension

    • Every node on the tree is a dimension member

    • The node statistics are the member properties

    • All members are calculated

      • Formula aggregates the case dimension members that apply to this node

      • The MDX is generated by the DM algorithm

    • Analysis Service will automatically generate the calculated dimension based on the DM content and also a virtual cube

    • Applies to

      • Classification (decision trees)

      • Segmentation (clusters)


    Browsing the virtual cube

    Browsing the Virtual Cube

    • Pivot the DM dimension:

    Credit Risk: 70% Good, 30% Bad


    Predictions1

    Predictions

    • You might want to view predictions for each case

    • For example:

      • What is the expected profitability of a product?

      • What is the credit risk of a specific customer?

      • What are the products this customer is likely to buy?

    • All of those predictions are available through MDX calculated members

    • Singleton query is created automatically


    Prediction calculated member

    Prediction Calculated Member

    Measures.[Probability of High Credit Risk]:

    PREDICT(Customers.CurrentMember,

    “Credit Risk Model”,

    “PredictionProbability(

    PredictionHistogram(“Credit Risk”),

    ‘High’)“

    )


    Predictions example

    Predictions Example


    Questions e mail billzuo@microsoft com

    Questions ?E-Mail: [email protected]

    http://www.microsoft.com/china/sql


  • Login