Data mining for scientific engineering applications
This presentation is the property of its rightful owner.
Sponsored Links
1 / 44

Data Mining for Scientific & Engineering Applications PowerPoint PPT Presentation


  • 74 Views
  • Uploaded on
  • Presentation posted in: General

Data Mining for Scientific & Engineering Applications. Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify Chandrika Kamath, Lawrence Livermore National Laboratory Vipin Kumar, Army High Performance Research Center, University of Minnesota.

Download Presentation

Data Mining for Scientific & Engineering Applications

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Data mining for scientific engineering applications

Data Mining for Scientific & Engineering Applications

Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify

Chandrika Kamath, Lawrence Livermore National Laboratory

Vipin Kumar, Army High Performance Research Center, University of Minnesota


Chapter 10 data mining systems

Chapter 10 – Data Mining Systems

Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify


Goals of chapter 10

Goals of Chapter 10

  • What are the four critical interfaces in a data mining system?

  • Is data mining about rows or columns?

  • What are the standards in data mining?

  • What data mining systems are available?


Outline

Outline

10.1Overview of Data Mining Systems

10.2Case Study Using a System

10.3 Managing Data for Data Mining

10.4Data Mining Standards

10.5Commercial and Open Source Systems


10 1 overview of data mining systems

10.1 Overview of Data Mining Systems

Following R. L. Grossman, S. Bailey, A. Ramu and B. Malhi, P. Hallstrom, I. Pulleyn and X. Qin, The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML), Information and Software Technology, 1999.


Four generations of dm systems

Second Generation

First Generation

Data mining algorithms

Data mining algorithms

Data management

Fourth Generation

mobile data

Third Generation

Agents &

& Internet

Predictive modeling

Predictive modeling

Data mining algorithms

Data mining algorithms

NGI

Data management

Data management

Four Generations of DM Systems


Layered systems for dm pm

agents

agents

Internet

pred. models

pred. models

data mining

data mining

NGI with QoS

data management

data management

Layered Systems for DM & PM

Move results and metadata:

Other protocols and

services …

  • Agents can move metadata around via net

  • Warehouse can move data around via NGI

Move models:

Predictive Model Markup Language (PMML)

Move data: DSTP,

distributed databases, etc.


Data mining for scientific engineering applications

Phases in the Data Mining & Predictive Modeling Process

Phase B, C: Warehousing

Phase E: PredictiveModeling

Phase D: Data Mining

Data Mining Mart

Learning set

PM or Rule Set

Data Mining

Trans

-formations

(DXML)

Predictive

Model Markup

Language

(PMML)

Data Mining

Primitives

(DMP)

Operational data

PM rule or Rule Set

Scores

Phase F: Deployment


Four critical interfaces

Four Critical Interfaces

  • Data Mining Transformation (DXML)

    • Interface between operational data and data mining mart

  • Data Mining Primitives (DMP)

    • interface between data mining mart and data mining system

  • Data Mining Application Interface (DM-API)

    • interface between data mining applications and data mining system, DMQL, OLE DB for Data Mining, …

  • Predictive Model Markup Language (PMML)

    • interface between data mining system and predictive modeling system


10 2 building a model

10.2 Building a Model


Some selected steps to build models

Some (Selected) Steps to Build Models

  • Define the data schema

  • Clean and load the data

  • Define the mining schema

  • Compute derived attributes

  • Build the model

  • Analyze the model

  • Deploy the model


1 define the data schema

1. Define the Data Schema

Data Types: int, double, float, date-time, string, etc.


2 clean and load the data

Select data schema.

Select data source: text, database, etc.

2. Clean and Load the Data


3 define the mining schema

3. Define the Mining Schema

Select mining role: dependent, independent, excluded, key, etc.

Select mining type: continuous, ordinal, categorical, binary, etc.


4 compute derived attributes

4. Compute Derived Attributes

Define petal_length/sepal_length


5 build the model

5. Build the Model

Select Data Store

Select Mining Schema

Select Parameters


5 build the model cont d

5. Build the Model (cont’d)

Classification tree.


5 build the model tuning

5. Build the Model - Tuning

Select Model

Select Parameters


6 analyze the model

6. Analyze the Model

Analyze how well the model predicted class labels.


7 deploy the model

7. Deploy the Model

Move PMML

files to scoring engine.


10 3 physical data management

10.3 Physical Data Management

Arranging data by record and by attribute; data mining primitives.


B trees

B+ Trees

  • The cost to access one record is exactly the same as to access a block of records

  • Use variants of techniques from databases to lower the cost of accessing out of memory data

  • There are a variety of tree-based methods for efficiently indexing blocks of data, such as B+ trees


Horizontal vs vertical

Select all objects where is less than 10.

Select all objects where is less + than 10.

Horizontal vs. Vertical

terabye of complex objects

Vertical

Horizontal


Thinking about columns

Thinking about Columns

NCMb/sGBSecEvents/s

134.41177564

4104.43590655

8174.421322811

16234.415517731

Horizontal

110.271549400

440.275664377

870.2732015482

16100.2722344590

Vertical


Data mining primitives

Data Mining Primitives

  • For many algorithms, data infrastructure only needs to supply:

    (Attribute Id, Attribute Value, Class Value, Count)

  • Specialized data structures can be created to do this.

  • SQL databases can be extended to do this.


10 4 data mining standards

10.4 Data Mining Standards

See www.dmg.org for more information.

Following R. L. Grossman, S. Bailey, A. Ramu and B. Malhi, P. Hallstrom, I. Pulleyn and X. Qin, The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML), Information and Software Technology, 1999.


Predictive model markup language pmml

Predictive Model Markup Language (PMML)

  • Current Version 2.0

  • Products shipping with PMML Version 1.1

  • PMML Working Group Full Members

    • IBM, Magnify, MineIt, NCR, Oracle, Salford Systems, SPSS, xChange, University of Illinois at Chicago

  • PMML Working Group Supporting Members

    • Angoss, Insightful, KXEN, Microsoft, SGI …

  • Part of xml.org Repository & Source Forge


Layered systems for dm pm1

agents

agents

Internet

pred. models

pred. models

data mining

data mining

NGI with QoS

data management

data management

Layered Systems for DM & PM

  • Agents can move metadata around via net

  • Warehouse can move data around via NGI

Move models:

Predictive Model Markup Language (PMML)


Point of view

Point of View

data mining algorithm

  • View data mining:

    • 1. Extract a learning set from a data warehouse

    • 2. Apply a data mining algorithm

    • 3. To produce a statistical model, data mining model or rule set.

<PMML version=“1.1”

<TreeModel ModelName=“response”

etc.

<Node frequency=“freq_12_month">

etc.

</TreeModel>

</PMML>


Problems with current techniques

Problems with Current Techniques

  • Models are deployed in proprietary formats

  • Models are application dependent

  • Models are system dependent

  • Models are architecture dependant

  • Time required to integrate models with other applications can be long.


High performance data mining pmml

partition 1

partition 2

partition 3

High Performance Data Mining & PMML

1. Scatter the query.

2. Compute the classifiers independently.

PMML

3. Gather and merge the PMML files


Distributed dm pmml

Combine

Data Mining System

Data Mining System

Predictive Modeling System

Data Warehouse

Data Warehouse

Data - Chicago

Data - Amsterdam

Distributed DM & PMML

PMML


Example pmml

Example: PMML

<TreeModel modelName="golfing">

<MiningSchema>

<MiningField name="temperature"/>

<MiningField name="humidity"/>

</MiningSchema>

<Node score="play">

<Predicate field="outlook" operator="equal" value="sunny"/>

<Node score="play">

<CompoundPredicate booleanOperator="and" > <Predicate field="temperature“ operator="lessThan" value="90F" />


Predictive model markup language pmml1

Predictive Model Markup Language (PMML)

  • Based on XML

  • Benefits of PMML

    • Open standard for Data Mining & Statistical Models

    • Not concerned with the process of creating a model

    • Provides independence from application, platform, and operating system

    • Simplifies use of data mining models by other applications (consumers of data mining models)


Philosophy

Philosophy

  • Very important to understand what PMML is not concerned with …

  • PMML is a specification of a model, not an implementation of a model

  • PMML allows a simple means of binding parameters to values for an agreed upon set of data mining models & transformations

  • Also, PMML includes the metadata required to deploy models


Pmml document structure

PMML Document Structure

  • PMML Documents

    • Data dictionary

    • Transformation dictionary

    • One or more PMML models

    • Support for taxonomies/hierarchies

  • PMML Model

    • Mining Schema

    • Univariate statistics (ModelStats)

    • Optional extensions


Pmml producers consumers data flow

PMML Consumers

Operational Data

PMML

models

derivedFields

miningFields

Campaign Manager

derivedFields

campaigns

PMML Producers,Consumers, & Data Flow

PMML Producers

Data Mining

System

learning sets

miningFields

Data Mining

Warehouse

dataFields


Data flow recap

Data Flow - Recap

  • Data Dictionary defines data

  • Mining Schema defines specific inputs (MiningFields) required for model

  • Transformation Dictionary defines optional additional derived fields

  • Two types of attributes:

    • attributes defined by the mining schema

    • derived attributes defined via transformations

  • Models themselves can also support certain transformations


Models in pmml v2 0

Models in PMML v2.0

  • polynomial regression

  • general regression

  • trees

  • center based clusters

  • density based clusters

  • associations

  • neural nets

  • logistic regression

  • naïve Bayes

  • sequences


Conformance

Conformance

  • Producer conformance

    • In case, an application can write valid PMML documents for at least one type of model

  • Consumer conformance

    • In case an application can read valid PMML documents for at least one type of model

  • Core and non-core features

    • For a given model, certain features are identified as core by the DTD and must be supported

    • Others are identified as optional


Other data mining standards

OMG

CWM

DM

SQL/MM

Pt. 6 DM

Object model

for representing

data mining metadata:

models, model results

(UML/DTD/XML)

SQL objects for defining,

creating, and applying

data mining models, and

obtaining their results

(SQL)

DMG

PMML

Representation of data

mining models for inter-

vendor exchange

(DTD/XML)

JSR-073

JDMAPI

Java API for defining,

creating, and applying

data mining models, and

obtaining their results

(Java)

SQL-like interface

for data mining

operations (OLE DB/SQL)

OLE DB

for DM

Other Data Mining Standards


10 5 commercial open source systems

10.5 Commercial & Open Source Systems

What do you do when you get home?


Data mining and related systems

Data Mining and Related Systems

  • SAS

  • SPSS

  • Splus(open source R)

  • Matlab(open source Octave)

  • Many other specialized systems


References

References

Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, 2001 - a good introduction to data mining from the systems and database perspective.

Ian H. Witten and Eibe Frank, Data Mining, Morgan Kaufmann Publishers, San Francisco, 2000 - a good introduction which includes Java tools for the common algorithms.

Ian H. Witten, Alistair Moffat and Timothy C. Bell, Managing Gigabytes, Second Edition, Morgan Kaufmann, San Diego, 1999 - a good book describing the infrastructurre and theory required for working with large collections of text or images.

J. R. Quinlan, C4.5 Programs for Machine Learning, Morgan Kauffmann, San Mateo, California, 1993.

Predictive Model Markup Language (PMML), see www.dmg.org


  • Login