- 75 Views
- Uploaded on
- Presentation posted in: General

Data Mining for Scientific & Engineering Applications

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify

Chandrika Kamath, Lawrence Livermore National Laboratory

Vipin Kumar, Army High Performance Research Center, University of Minnesota

Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify

- What are the four critical interfaces in a data mining system?
- Is data mining about rows or columns?
- What are the standards in data mining?
- What data mining systems are available?

10.1Overview of Data Mining Systems

10.2Case Study Using a System

10.3 Managing Data for Data Mining

10.4Data Mining Standards

10.5Commercial and Open Source Systems

Following R. L. Grossman, S. Bailey, A. Ramu and B. Malhi, P. Hallstrom, I. Pulleyn and X. Qin, The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML), Information and Software Technology, 1999.

Second Generation

First Generation

Data mining algorithms

Data mining algorithms

Data management

Fourth Generation

mobile data

Third Generation

Agents &

& Internet

Predictive modeling

Predictive modeling

Data mining algorithms

Data mining algorithms

NGI

Data management

Data management

agents

agents

Internet

pred. models

pred. models

data mining

data mining

NGI with QoS

data management

data management

Move results and metadata:

Other protocols and

services …

- Agents can move metadata around via net
- Warehouse can move data around via NGI

Move models:

Predictive Model Markup Language (PMML)

Move data: DSTP,

distributed databases, etc.

Phases in the Data Mining & Predictive Modeling Process

Phase B, C: Warehousing

Phase E: PredictiveModeling

Phase D: Data Mining

Data Mining Mart

Learning set

PM or Rule Set

Data Mining

Trans

-formations

(DXML)

Predictive

Model Markup

Language

(PMML)

Data Mining

Primitives

(DMP)

Operational data

PM rule or Rule Set

Scores

Phase F: Deployment

- Data Mining Transformation (DXML)
- Interface between operational data and data mining mart

- Data Mining Primitives (DMP)
- interface between data mining mart and data mining system

- Data Mining Application Interface (DM-API)
- interface between data mining applications and data mining system, DMQL, OLE DB for Data Mining, …

- Predictive Model Markup Language (PMML)
- interface between data mining system and predictive modeling system

- Define the data schema
- Clean and load the data
- Define the mining schema
- Compute derived attributes
- Build the model
- Analyze the model
- Deploy the model

Data Types: int, double, float, date-time, string, etc.

Select data schema.

Select data source: text, database, etc.

Select mining role: dependent, independent, excluded, key, etc.

Select mining type: continuous, ordinal, categorical, binary, etc.

Define petal_length/sepal_length

Select Data Store

Select Mining Schema

Select Parameters

Classification tree.

Select Model

Select Parameters

Analyze how well the model predicted class labels.

Move PMML

files to scoring engine.

Arranging data by record and by attribute; data mining primitives.

- The cost to access one record is exactly the same as to access a block of records
- Use variants of techniques from databases to lower the cost of accessing out of memory data
- There are a variety of tree-based methods for efficiently indexing blocks of data, such as B+ trees

Select all objects where is less than 10.

Select all objects where is less + than 10.

terabye of complex objects

Vertical

Horizontal

NCMb/sGBSecEvents/s

134.41177564

4104.43590655

8174.421322811

16234.415517731

Horizontal

110.271549400

440.275664377

870.2732015482

16100.2722344590

Vertical

- For many algorithms, data infrastructure only needs to supply:
(Attribute Id, Attribute Value, Class Value, Count)

- Specialized data structures can be created to do this.
- SQL databases can be extended to do this.

See www.dmg.org for more information.

Following R. L. Grossman, S. Bailey, A. Ramu and B. Malhi, P. Hallstrom, I. Pulleyn and X. Qin, The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML), Information and Software Technology, 1999.

- Current Version 2.0
- Products shipping with PMML Version 1.1
- PMML Working Group Full Members
- IBM, Magnify, MineIt, NCR, Oracle, Salford Systems, SPSS, xChange, University of Illinois at Chicago

- PMML Working Group Supporting Members
- Angoss, Insightful, KXEN, Microsoft, SGI …

- Part of xml.org Repository & Source Forge

agents

agents

Internet

pred. models

pred. models

data mining

data mining

NGI with QoS

data management

data management

- Agents can move metadata around via net
- Warehouse can move data around via NGI

Move models:

Predictive Model Markup Language (PMML)

data mining algorithm

- View data mining:
- 1. Extract a learning set from a data warehouse
- 2. Apply a data mining algorithm
- 3. To produce a statistical model, data mining model or rule set.

<PMML version=“1.1”

<TreeModel ModelName=“response”

etc.

<Node frequency=“freq_12_month">

etc.

</TreeModel>

</PMML>

- Models are deployed in proprietary formats
- Models are application dependent
- Models are system dependent
- Models are architecture dependant
- Time required to integrate models with other applications can be long.

partition 1

partition 2

partition 3

1. Scatter the query.

2. Compute the classifiers independently.

PMML

3. Gather and merge the PMML files

Combine

Data Mining System

Data Mining System

Predictive Modeling System

Data Warehouse

Data Warehouse

Data - Chicago

Data - Amsterdam

PMML

<TreeModel modelName="golfing">

<MiningSchema>

<MiningField name="temperature"/>

<MiningField name="humidity"/>

…

</MiningSchema>

<Node score="play">

<Predicate field="outlook" operator="equal" value="sunny"/>

<Node score="play">

<CompoundPredicate booleanOperator="and" > <Predicate field="temperature“ operator="lessThan" value="90F" />

- Based on XML
- Benefits of PMML
- Open standard for Data Mining & Statistical Models
- Not concerned with the process of creating a model
- Provides independence from application, platform, and operating system
- Simplifies use of data mining models by other applications (consumers of data mining models)

- Very important to understand what PMML is not concerned with …
- PMML is a specification of a model, not an implementation of a model
- PMML allows a simple means of binding parameters to values for an agreed upon set of data mining models & transformations
- Also, PMML includes the metadata required to deploy models

- PMML Documents
- Data dictionary
- Transformation dictionary
- One or more PMML models
- Support for taxonomies/hierarchies

- PMML Model
- Mining Schema
- Univariate statistics (ModelStats)
- Optional extensions

PMML Consumers

Operational Data

PMML

models

derivedFields

miningFields

Campaign Manager

derivedFields

campaigns

PMML Producers

Data Mining

System

learning sets

miningFields

Data Mining

Warehouse

dataFields

- Data Dictionary defines data
- Mining Schema defines specific inputs (MiningFields) required for model
- Transformation Dictionary defines optional additional derived fields
- Two types of attributes:
- attributes defined by the mining schema
- derived attributes defined via transformations

- Models themselves can also support certain transformations

- polynomial regression
- general regression
- trees
- center based clusters
- density based clusters
- associations
- neural nets
- logistic regression
- naïve Bayes
- sequences

- Producer conformance
- In case, an application can write valid PMML documents for at least one type of model

- Consumer conformance
- In case an application can read valid PMML documents for at least one type of model

- Core and non-core features
- For a given model, certain features are identified as core by the DTD and must be supported
- Others are identified as optional

OMG

CWM

DM

SQL/MM

Pt. 6 DM

Object model

for representing

data mining metadata:

models, model results

(UML/DTD/XML)

SQL objects for defining,

creating, and applying

data mining models, and

obtaining their results

(SQL)

DMG

PMML

Representation of data

mining models for inter-

vendor exchange

(DTD/XML)

JSR-073

JDMAPI

Java API for defining,

creating, and applying

data mining models, and

obtaining their results

(Java)

SQL-like interface

for data mining

operations (OLE DB/SQL)

OLE DB

for DM

What do you do when you get home?

- SAS
- SPSS
- Splus(open source R)
- Matlab(open source Octave)
- Many other specialized systems

Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, 2001 - a good introduction to data mining from the systems and database perspective.

Ian H. Witten and Eibe Frank, Data Mining, Morgan Kaufmann Publishers, San Francisco, 2000 - a good introduction which includes Java tools for the common algorithms.

Ian H. Witten, Alistair Moffat and Timothy C. Bell, Managing Gigabytes, Second Edition, Morgan Kaufmann, San Diego, 1999 - a good book describing the infrastructurre and theory required for working with large collections of text or images.

J. R. Quinlan, C4.5 Programs for Machine Learning, Morgan Kauffmann, San Mateo, California, 1993.

Predictive Model Markup Language (PMML), see www.dmg.org