slide1
Download
Skip this Video
Download Presentation
欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会

Loading in 2 Seconds...

play fullscreen
1 / 54

欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会 - PowerPoint PPT Presentation


  • 221 Views
  • Uploaded on

欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会. 今日安排. 微软 SQL 数据挖掘技术概述 左洪 微软公司 数据仓库在电信的应用 贝志城 明天高科 数据挖掘在 CRM 中的应用 王立军 中圣公司 灵通 IT Service 维护管理服务系统 邹雄文 广州灵通. Introduction to Data Mining with SQL Server 2000 左洪 高级产品市场经理 微软(中国)有限公司. Agenda. What is Data Mining The Data Mining Market

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会' - neola


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide2
今日安排
  • 微软SQL数据挖掘技术概述
        • 左洪 微软公司
  • 数据仓库在电信的应用
        • 贝志城 明天高科
  • 数据挖掘在CRM中的应用
        • 王立军 中圣公司
  • 灵通IT Service维护管理服务系统
          • 邹雄文 广州灵通
slide3
Introduction to Data Mining with SQL Server 2000

左洪 高级产品市场经理

微软(中国)有限公司

agenda
Agenda
  • What is Data Mining
  • The Data Mining Market
  • OLE DB for Data Mining
  • Overview of the Data Mining Features in SQL Server 2000
  • Q&A
what is dm
What is DM?
  • A process of data exploration and analysis using automatic or semi-automatic means
    • “Exploring data” – scanning samples of known facts about “cases”.
    • “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules…
  • Once the “knowledge” is extracted it:
    • Can be browsed
      • Provides a very useful insight on the cases behavior
    • Can be used to predict values of other cases
      • Can serve as a key element in closed loop analysis
the deciding factors for high school students to attend college are
IQ=High

IQ=Low

Attend College:

79% Yes

11% No

Attend College:

45% Yes

55% No

Wealth = False

Parents

Encourage = Yes

Parents

Encourage = No

Wealth = True

Attend College:

70% Yes

30% No

Attend College:

31% Yes

69% No

Attend College:

94% Yes

6% No

Attend College:

69% Yes

21% No

The deciding factors for high school students to attend college are…

All Students

Attend College:

55% Yes

45% No

IQ ?

Wealth

Parents

Encourage?

business oriented dm problems
Business Oriented DM Problems
  • Targeted ads
    • “What banner should I display to this visitor?”
  • Cross sells
    • “What other products is this customer likely to buy?
  • Fraud detection
    • “Is this insurance claim a fraud?”
  • Churn analysis
    • “Who are those customers likely to churn?”
  • Risk Management
    • “Should I approve the loan to this customer?”
mining process illustrated
Data

To Predict

Training Data

Mining Model

Mining Model

Mining Model

Predicted Data

Mining Process - Illustrated

DM

Engine

DM

Engine

the y2000 market size
The $$$: Y2000 Market Size
  • DM Tools Market: $250M
    • 40% - license fees
    • 60% consulting

* Gartner

the players
The Players
  • Leading vendors
    • SAS
    • SPSS
    • IBM
    • Hundreds of smaller vendors offering DM algorithms…
  • Oracle –Thinking Machines acquisition
the products
The Products
  • End-to-end Data Mining tools
    • Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting….
  • The customer is the power-analyst
    • PhD in statistics is usually required…
  • Closed tools – no standard API
    • Total vendor lock-in
    • Limited integration with applications
  • DM an “outsider” in the Data Warehouse
  • Extensive consulting required
  • Sky rocketing prices
    • $60K+ for a single user license
what the analysts say
What the analysts say…
  • “Stand-alone Data Mining Is Dead” - Forrester
  • “The demise of [stand alone] data mining” – Gartner
datapro users survey 1999 2001
DataPro Users Survey 1999-2001

“Data mining will be the fastest-growing BI technology…”

the 2000 market size
The $$$: 2000 Market Size
  • DM Applications Market Size: $1.5B

* IDC

sql server 2000 the analysis platform
SQL Server 2000 - The Analysis Platform
  • SQL 2000 provides a complete Analysis Platform
    • Not an isolated, stand alone DM product
  • Platform means:
    • The infrastructure for applications
      • Not an application by itself
    • Integrated vision for all technologies, tools
    • Standard based API’s (OLE DB for DM)
    • Extensible
    • Scaleable
data flow
Reports

&

Analysis

Data Flow

DM

Apps

DM

OLTP

DW

OLAP

analysis services 2000 architecture
DMAnalysis Services 2000 - Architecture

Manager UI

DM Wizards

DM DTS Task

DSO

DMM

OLE DB

OLAP

Client

Analysis Server

OLAP

Engine

(local)

DM

Engine

(local)

OLAP

Engine

DM

Engine

Ext.

Ext.

why ole db for dm
Why OLE DB for DM?

Make DM a mass market technology by:

  • Leverage existing technologies and knowledge
    • SQL and OLE DB
  • Common industry wide concepts and data presentation
  • Changing DM market perception from “proprietary” to “open”
  • Increasing the number of players:
    • Reduce the cost and risk of becoming a consumer – one tool works with multiple providers
    • Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering
    • Dramatically increase the number of DM developers
integration with rdbms
Integration With RDBMS
  • Customers would like to
    • Build DM models from within their RDBMS
    • Train the models directly off their relational tables
    • Perform predictions as relational queries (tables in, tables out)
    • Feel that DM is a native part of their database.
  • Therefore…
    • Data mining models are relational objects
    • All operations on the models are relational
    • The language used is SQL (w/Extensions)
  • The effect: every DBA and VB developer can become a DM developer
identifying the cases
Identifying the “Cases”
  • DM algorithms analyze “cases”
  • The “case” is the entity being categorized and classified
  • Examples
    • Customer credit risk analysis: Case = Customer
    • Product profitability analysis: Case = Product
    • Promotion success analysis: Case = Promotion
  • Each case encapsulate all we know about the entity
a dmm is a table
A DMM is a Table!
  • A DMM structure is defined as a table
    • Training a DMM means inserting data into the table
    • Predicting from a DMM means querying the table
  • All information describing the case are contained in columns
creating a mining model
Creating a Mining Model

CREATE MINING MODEL [Plans Prediction]

(

StudentID LONG KEY,

Gender TEXT DISCRETE,

ParentIncome LONG CONTINUOUS,

IQ DOUBLE CONTINUOUS,

Encouragement TEXT DISCRETE,

CollegePlans TEXT DISCRETE PREDICT

)

USING Microsoft_Decision_Trees

creating a mining model with nested table
Creating a mining model with nested table

Create Mining Model MoviePrediction

(

CutomerId long key,

Age long continuous,

Gender discrete,

Education discrete,

MovieList table predict (

MovieName text key

)

)

using microsoft_decision_trees

training a dmm1
Training a DMM
  • Training a DMM means passing it data for which the attributes to be predicted are known
    • Multiple passes are handled internally by the provider!
  • Use an INSERT INTO statement
  • The DMM will not persist the inserted data
  • Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules)

INSERT [INTO]

[(columns list)]

insert into
INSERT INTO

INSERT INTO [Plans Prediction]

(

StudentID, Gender, ParentIncome, IQ,

Encouragement, CollegePlans

)

SELECT

[StudentID], [Gender],

[ParentIncome], [IQ],

[Encouragement], [CollegePlans]

FROM [CollegePlans]

when insert into is done
When Insert Into Is Done…
  • The DMM is trained
    • The model can be retrained
    • Content (rules, trees, formulas) can be explored
    • OLE DB Schema rowset
    • SELECT * FROM .CONTENT
    • XML string (PMML)
  • Prediction queries can be executed
what are predictions
What are Predictions?
  • Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values
  • Predictions = queries
    • The syntax is SQL - like
    • The output is a rowset
  • In order to predict you need:
    • Input data set
    • A trained DMM
    • Binding (mapping) information between the input data and the DMM
    • Specification of what to predict
prediction
Prediction

It’s a JOIN!

the prediction query syntax
The Prediction Query Syntax

SELECT

FROM

PREDICTION JOIN

ON =

example
Example

SELECT [New Students].[StudentID],

[Plans Prediction].[CollegePlans],

PredictProbability([CollegePlans])

FROM

[Plans Prediction] PREDICTION JOIN

[New Students]

ON [Plans Prediction].[Gender] =

[New Students].[Gender] AND

[Plans Prediction].[IQ] =

[New Students].[IQ] AND ...

ole db dm sample provider with source
OLE DB DM Sample Provider with Source
  • All required OLE DB objects, such as session, command, and rowset
  • The OLE DB for Data Mining syntax parser
  • Tokenization of input data
  • Query processing engine
  • A sample Naïve Bayes algorithm
  • Model persistence in XML and binary formats
  • Available at www.microsoft.com/data/oledb/DMResKit.htm
why use dm with olap
Why Use DM with OLAP
  • Relational DM is designed for:
    • Reports of patterns
    • Batch predictions fed into an OLTP system
    • Real-time singleton prediction in an operational environment
  • OLAP is designed for
    • interactive analysis by a knowledge worker
    • Consistent and convenient navigational model
    • Pre-aggregations of OLAP allow faster performance
understanding dm content decision trees
Understanding DM Content – Decision Trees

Customers having high debt and college education:

Filter([Individual Customers].Members,

Customers.CurrentMember.Properties(“Debt”) = “High”

And Customers.CurrentMember.Properties(“Education”) = “College”)

All Customers

Credit Risk:

65% Good

35% Bad

Debt ?

Customers having low debt and are self employed:

Filter([Individual Customers].Members,

Customers.CurrentMember.Properties(“Debt”) = Low

And Customers.CurrentMember.Properties(“Employment Type”) = “Self Employed”)

Debt=Low

Debt=High

Credit Risk:

89% Good

11% Bad

Credit Risk:

45% Good

55% Bad

Employ-

-ment Type?

Education?

ET = Self

Employed

Education=

College

Education=

High School

ET = Salaried

Credit Risk:

70% Good

30% Bad

Credit Risk:

31% Good

69% Bad

Credit Risk:

94% Good

6% Bad

Credit Risk:

79% Good

21% Bad

equivalent dm dimension
…Equivalent DM Dimension

All Customers

Customers with low debt

Customers with low debt and self employed

Customers with low debt and salaried

Customers with high debt

Customers with high debt and college education

Customers with high debt and high school education

tree dimension
Tree = Dimension
  • Every node on the tree is a dimension member
  • The node statistics are the member properties
  • All members are calculated
    • Formula aggregates the case dimension members that apply to this node
    • The MDX is generated by the DM algorithm
  • Analysis Service will automatically generate the calculated dimension based on the DM content and also a virtual cube
  • Applies to
    • Classification (decision trees)
    • Segmentation (clusters)
browsing the virtual cube
Browsing the Virtual Cube
  • Pivot the DM dimension:

Credit Risk: 70% Good, 30% Bad

predictions1
Predictions
  • You might want to view predictions for each case
  • For example:
    • What is the expected profitability of a product?
    • What is the credit risk of a specific customer?
    • What are the products this customer is likely to buy?
  • All of those predictions are available through MDX calculated members
  • Singleton query is created automatically
prediction calculated member
Prediction Calculated Member

Measures.[Probability of High Credit Risk]:

PREDICT(Customers.CurrentMember,

“Credit Risk Model”,

“PredictionProbability(

PredictionHistogram(“Credit Risk”),

‘High’)“

)

questions e mail billzuo@microsoft com

Questions ?E-Mail: [email protected]

http://www.microsoft.com/china/sql

ad