欢迎光临
Download
1 / 54

欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会 - PowerPoint PPT Presentation


  • 219 Views
  • Uploaded on

欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会. 今日安排. 微软 SQL 数据挖掘技术概述 左洪 微软公司 数据仓库在电信的应用 贝志城 明天高科 数据挖掘在 CRM 中的应用 王立军 中圣公司 灵通 IT Service 维护管理服务系统 邹雄文 广州灵通. Introduction to Data Mining with SQL Server 2000 左洪 高级产品市场经理 微软(中国)有限公司. Agenda. What is Data Mining The Data Mining Market

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' 欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会' - neola


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
欢迎光临微软SQL数据挖掘/数据仓库技术研讨会


今日安排

  • 微软SQL数据挖掘技术概述

    • 左洪 微软公司

  • 数据仓库在电信的应用

    • 贝志城 明天高科

  • 数据挖掘在CRM中的应用

    • 王立军 中圣公司

  • 灵通IT Service维护管理服务系统

    • 邹雄文 广州灵通


  • Introduction to Data Mining with SQL Server 2000

    左洪 高级产品市场经理

    微软(中国)有限公司


    Agenda
    Agenda

    • What is Data Mining

    • The Data Mining Market

    • OLE DB for Data Mining

    • Overview of the Data Mining Features in SQL Server 2000

    • Q&A



    What is dm
    What is DM?

    • A process of data exploration and analysis using automatic or semi-automatic means

      • “Exploring data” – scanning samples of known facts about “cases”.

      • “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules…

    • Once the “knowledge” is extracted it:

      • Can be browsed

        • Provides a very useful insight on the cases behavior

      • Can be used to predict values of other cases

        • Can serve as a key element in closed loop analysis



    The deciding factors for high school students to attend college are

    IQ=High

    IQ=Low

    Attend College:

    79% Yes

    11% No

    Attend College:

    45% Yes

    55% No

    Wealth = False

    Parents

    Encourage = Yes

    Parents

    Encourage = No

    Wealth = True

    Attend College:

    70% Yes

    30% No

    Attend College:

    31% Yes

    69% No

    Attend College:

    94% Yes

    6% No

    Attend College:

    69% Yes

    21% No

    The deciding factors for high school students to attend college are…

    All Students

    Attend College:

    55% Yes

    45% No

    IQ ?

    Wealth

    Parents

    Encourage?


    Business oriented dm problems
    Business Oriented DM Problems

    • Targeted ads

      • “What banner should I display to this visitor?”

    • Cross sells

      • “What other products is this customer likely to buy?

    • Fraud detection

      • “Is this insurance claim a fraud?”

    • Churn analysis

      • “Who are those customers likely to churn?”

    • Risk Management

      • “Should I approve the loan to this customer?”



    Mining process illustrated

    Data

    To Predict

    Training Data

    Mining Model

    Mining Model

    Mining Model

    Predicted Data

    Mining Process - Illustrated

    DM

    Engine

    DM

    Engine



    The y2000 market size
    The $$$: Y2000 Market Size

    • DM Tools Market: $250M

      • 40% - license fees

      • 60% consulting

    * Gartner


    The players
    The Players

    • Leading vendors

      • SAS

      • SPSS

      • IBM

      • Hundreds of smaller vendors offering DM algorithms…

    • Oracle –Thinking Machines acquisition


    The products
    The Products

    • End-to-end Data Mining tools

      • Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting….

    • The customer is the power-analyst

      • PhD in statistics is usually required…

    • Closed tools – no standard API

      • Total vendor lock-in

      • Limited integration with applications

    • DM an “outsider” in the Data Warehouse

    • Extensive consulting required

    • Sky rocketing prices

      • $60K+ for a single user license


    What the analysts say
    What the analysts say…

    • “Stand-alone Data Mining Is Dead” - Forrester

    • “The demise of [stand alone] data mining” – Gartner



    Datapro users survey 1999 2001
    DataPro Users Survey 1999-2001

    “Data mining will be the fastest-growing BI technology…”


    The 2000 market size
    The $$$: 2000 Market Size

    • DM Applications Market Size: $1.5B

    * IDC


    Sql server 2000 the analysis platform
    SQL Server 2000 - The Analysis Platform

    • SQL 2000 provides a complete Analysis Platform

      • Not an isolated, stand alone DM product

    • Platform means:

      • The infrastructure for applications

        • Not an application by itself

      • Integrated vision for all technologies, tools

      • Standard based API’s (OLE DB for DM)

      • Extensible

      • Scaleable


    Data flow

    Reports

    &

    Analysis

    Data Flow

    DM

    Apps

    DM

    OLTP

    DW

    OLAP


    Analysis services 2000 architecture

    DM

    Analysis Services 2000 - Architecture

    Manager UI

    DM Wizards

    DM DTS Task

    DSO

    DMM

    OLE DB

    OLAP

    Client

    Analysis Server

    OLAP

    Engine

    (local)

    DM

    Engine

    (local)

    OLAP

    Engine

    DM

    Engine

    Ext.

    Ext.



    Why ole db for dm
    Why OLE DB for DM?

    Make DM a mass market technology by:

    • Leverage existing technologies and knowledge

      • SQL and OLE DB

    • Common industry wide concepts and data presentation

    • Changing DM market perception from “proprietary” to “open”

    • Increasing the number of players:

      • Reduce the cost and risk of becoming a consumer – one tool works with multiple providers

      • Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering

      • Dramatically increase the number of DM developers


    Integration with rdbms
    Integration With RDBMS

    • Customers would like to

      • Build DM models from within their RDBMS

      • Train the models directly off their relational tables

      • Perform predictions as relational queries (tables in, tables out)

      • Feel that DM is a native part of their database.

    • Therefore…

      • Data mining models are relational objects

      • All operations on the models are relational

      • The language used is SQL (w/Extensions)

    • The effect: every DBA and VB developer can become a DM developer



    Identifying the cases
    Identifying the “Cases”

    • DM algorithms analyze “cases”

    • The “case” is the entity being categorized and classified

    • Examples

      • Customer credit risk analysis: Case = Customer

      • Product profitability analysis: Case = Product

      • Promotion success analysis: Case = Promotion

    • Each case encapsulate all we know about the entity




    A dmm is a table
    A DMM is a Table!

    • A DMM structure is defined as a table

      • Training a DMM means inserting data into the table

      • Predicting from a DMM means querying the table

    • All information describing the case are contained in columns


    Creating a mining model
    Creating a Mining Model

    CREATE MINING MODEL [Plans Prediction]

    (

    StudentID LONG KEY,

    Gender TEXT DISCRETE,

    ParentIncome LONG CONTINUOUS,

    IQ DOUBLE CONTINUOUS,

    Encouragement TEXT DISCRETE,

    CollegePlans TEXT DISCRETE PREDICT

    )

    USING Microsoft_Decision_Trees


    Creating a mining model with nested table
    Creating a mining model with nested table

    Create Mining Model MoviePrediction

    (

    CutomerId long key,

    Age long continuous,

    Gender discrete,

    Education discrete,

    MovieList table predict (

    MovieName text key

    )

    )

    using microsoft_decision_trees



    Training a dmm1
    Training a DMM

    • Training a DMM means passing it data for which the attributes to be predicted are known

      • Multiple passes are handled internally by the provider!

    • Use an INSERT INTO statement

    • The DMM will not persist the inserted data

    • Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules)

      INSERT [INTO] <mining model name>

      [(columns list)]

      <source data query>


    Insert into
    INSERT INTO

    INSERT INTO [Plans Prediction]

    (

    StudentID, Gender, ParentIncome, IQ,

    Encouragement, CollegePlans

    )

    SELECT

    [StudentID], [Gender],

    [ParentIncome], [IQ],

    [Encouragement], [CollegePlans]

    FROM [CollegePlans]


    When insert into is done
    When Insert Into Is Done…

    • The DMM is trained

      • The model can be retrained

      • Content (rules, trees, formulas) can be explored

      • OLE DB Schema rowset

      • SELECT * FROM <dmm>.CONTENT

      • XML string (PMML)

    • Prediction queries can be executed



    What are predictions
    What are Predictions?

    • Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values

    • Predictions = queries

      • The syntax is SQL - like

      • The output is a rowset

    • In order to predict you need:

      • Input data set

      • A trained DMM

      • Binding (mapping) information between the input data and the DMM

      • Specification of what to predict



    Prediction
    Prediction

    It’s a JOIN!


    The prediction query syntax
    The Prediction Query Syntax

    SELECT <columns to return or predict>

    FROM

    <dmm> PREDICTION JOIN

    <input data set>

    ON <dmm column> = <dmm input column>…


    Example
    Example

    SELECT [New Students].[StudentID],

    [Plans Prediction].[CollegePlans],

    PredictProbability([CollegePlans])

    FROM

    [Plans Prediction] PREDICTION JOIN

    [New Students]

    ON [Plans Prediction].[Gender] =

    [New Students].[Gender] AND

    [Plans Prediction].[IQ] =

    [New Students].[IQ] AND ...


    Ole db dm sample provider with source
    OLE DB DM Sample Provider with Source

    • All required OLE DB objects, such as session, command, and rowset

    • The OLE DB for Data Mining syntax parser

    • Tokenization of input data

    • Query processing engine

    • A sample Naïve Bayes algorithm

    • Model persistence in XML and binary formats

    • Available at www.microsoft.com/data/oledb/DMResKit.htm



    Why use dm with olap
    Why Use DM with OLAP

    • Relational DM is designed for:

      • Reports of patterns

      • Batch predictions fed into an OLTP system

      • Real-time singleton prediction in an operational environment

    • OLAP is designed for

      • interactive analysis by a knowledge worker

      • Consistent and convenient navigational model

      • Pre-aggregations of OLAP allow faster performance


    Understanding dm content decision trees
    Understanding DM Content – Decision Trees

    Customers having high debt and college education:

    Filter([Individual Customers].Members,

    Customers.CurrentMember.Properties(“Debt”) = “High”

    And Customers.CurrentMember.Properties(“Education”) = “College”)

    All Customers

    Credit Risk:

    65% Good

    35% Bad

    Debt ?

    Customers having low debt and are self employed:

    Filter([Individual Customers].Members,

    Customers.CurrentMember.Properties(“Debt”) = Low

    And Customers.CurrentMember.Properties(“Employment Type”) = “Self Employed”)

    Debt=Low

    Debt=High

    Credit Risk:

    89% Good

    11% Bad

    Credit Risk:

    45% Good

    55% Bad

    Employ-

    -ment Type?

    Education?

    ET = Self

    Employed

    Education=

    College

    Education=

    High School

    ET = Salaried

    Credit Risk:

    70% Good

    30% Bad

    Credit Risk:

    31% Good

    69% Bad

    Credit Risk:

    94% Good

    6% Bad

    Credit Risk:

    79% Good

    21% Bad


    Equivalent dm dimension
    …Equivalent DM Dimension

    All Customers

    Customers with low debt

    Customers with low debt and self employed

    Customers with low debt and salaried

    Customers with high debt

    Customers with high debt and college education

    Customers with high debt and high school education


    Tree dimension
    Tree = Dimension

    • Every node on the tree is a dimension member

    • The node statistics are the member properties

    • All members are calculated

      • Formula aggregates the case dimension members that apply to this node

      • The MDX is generated by the DM algorithm

    • Analysis Service will automatically generate the calculated dimension based on the DM content and also a virtual cube

    • Applies to

      • Classification (decision trees)

      • Segmentation (clusters)


    Browsing the virtual cube
    Browsing the Virtual Cube

    • Pivot the DM dimension:

    Credit Risk: 70% Good, 30% Bad


    Predictions1
    Predictions

    • You might want to view predictions for each case

    • For example:

      • What is the expected profitability of a product?

      • What is the credit risk of a specific customer?

      • What are the products this customer is likely to buy?

    • All of those predictions are available through MDX calculated members

    • Singleton query is created automatically


    Prediction calculated member
    Prediction Calculated Member

    Measures.[Probability of High Credit Risk]:

    PREDICT(Customers.CurrentMember,

    “Credit Risk Model”,

    “PredictionProbability(

    PredictionHistogram(“Credit Risk”),

    ‘High’)“

    )



    Questions e mail billzuo@microsoft com

    Questions ?E-Mail: [email protected]

    http://www.microsoft.com/china/sql


    ad