Loading in 5 sec....

Data Mining for Scientific & Engineering ApplicationsPowerPoint Presentation

Data Mining for Scientific & Engineering Applications

- 108 Views
- Updated On :

Data Mining for Scientific & Engineering Applications. Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify Chandrika Kamath, Lawrence Livermore National Laboratory Vipin Kumar, Army High Performance Research Center, University of Minnesota.

Related searches for Data Mining for Scientific Engineering Applications

Download Presentation
## PowerPoint Slideshow about 'Data Mining for Scientific Engineering Applications' - maylin

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Data Mining for Scientific & Engineering Applications

Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify

Chandrika Kamath, Lawrence Livermore National Laboratory

Vipin Kumar, Army High Performance Research Center, University of Minnesota

Chapter 10 – Data Mining Systems

Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify

Goals of Chapter 10

- What are the four critical interfaces in a data mining system?
- Is data mining about rows or columns?
- What are the standards in data mining?
- What data mining systems are available?

Outline

10.1 Overview of Data Mining Systems

10.2 Case Study Using a System

10.3 Managing Data for Data Mining

10.4 Data Mining Standards

10.5 Commercial and Open Source Systems

10.1 Overview of Data Mining Systems

Following R. L. Grossman, S. Bailey, A. Ramu and B. Malhi, P. Hallstrom, I. Pulleyn and X. Qin, The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML), Information and Software Technology, 1999.

First Generation

Data mining algorithms

Data mining algorithms

Data management

Fourth Generation

mobile data

Third Generation

Agents &

& Internet

Predictive modeling

Predictive modeling

Data mining algorithms

Data mining algorithms

NGI

Data management

Data management

Four Generations of DM Systemsagents

Internet

pred. models

pred. models

data mining

data mining

NGI with QoS

data management

data management

Layered Systems for DM & PMMove results and metadata:

Other protocols and

services …

- Agents can move metadata around via net
- Warehouse can move data around via NGI

Move models:

Predictive Model Markup Language (PMML)

Move data: DSTP,

distributed databases, etc.

Phases in the Data Mining & Predictive Modeling Process

Phase B, C: Warehousing

Phase E: Predictive Modeling

Phase D: Data Mining

Data Mining Mart

Learning set

PM or Rule Set

Data Mining

Trans

-formations

(DXML)

Predictive

Model Markup

Language

(PMML)

Data Mining

Primitives

(DMP)

Operational data

PM rule or Rule Set

Scores

Phase F: Deployment

Four Critical Interfaces

- Data Mining Transformation (DXML)
- Interface between operational data and data mining mart

- Data Mining Primitives (DMP)
- interface between data mining mart and data mining system

- Data Mining Application Interface (DM-API)
- interface between data mining applications and data mining system, DMQL, OLE DB for Data Mining, …

- Predictive Model Markup Language (PMML)
- interface between data mining system and predictive modeling system

Some (Selected) Steps to Build Models

- Define the data schema
- Clean and load the data
- Define the mining schema
- Compute derived attributes
- Build the model
- Analyze the model
- Deploy the model

1. Define the Data Schema

Data Types: int, double, float, date-time, string, etc.

3. Define the Mining Schema

Select mining role: dependent, independent, excluded, key, etc.

Select mining type: continuous, ordinal, categorical, binary, etc.

4. Compute Derived Attributes

Define petal_length/sepal_length

5. Build the Model (cont’d)

Classification tree.

6. Analyze the Model

Analyze how well the model predicted class labels.

10.3 Physical Data Management

Arranging data by record and by attribute; data mining primitives.

B+ Trees

- The cost to access one record is exactly the same as to access a block of records
- Use variants of techniques from databases to lower the cost of accessing out of memory data
- There are a variety of tree-based methods for efficiently indexing blocks of data, such as B+ trees

Select all objects where is less than 10.

Select all objects where is less + than 10.

Horizontal vs. Verticalterabye of complex objects

Vertical

Horizontal

Thinking about Columns

NC Mb/s GB Sec Events/s

1 3 4.4 11775 64

4 10 4.4 3590 655

8 17 4.4 2132 2811

16 23 4.4 1551 7731

Horizontal

1 1 0.27 1549 400

4 4 0.27 566 4377

8 7 0.27 320 15482

16 10 0.27 223 44590

Vertical

Data Mining Primitives

- For many algorithms, data infrastructure only needs to supply:
(Attribute Id, Attribute Value, Class Value, Count)

- Specialized data structures can be created to do this.
- SQL databases can be extended to do this.

10.4 Data Mining Standards

See www.dmg.org for more information.

Following R. L. Grossman, S. Bailey, A. Ramu and B. Malhi, P. Hallstrom, I. Pulleyn and X. Qin, The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML), Information and Software Technology, 1999.

Predictive Model Markup Language (PMML)

- Current Version 2.0
- Products shipping with PMML Version 1.1
- PMML Working Group Full Members
- IBM, Magnify, MineIt, NCR, Oracle, Salford Systems, SPSS, xChange, University of Illinois at Chicago

- PMML Working Group Supporting Members
- Angoss, Insightful, KXEN, Microsoft, SGI …

- Part of xml.org Repository & Source Forge

agents

Internet

pred. models

pred. models

data mining

data mining

NGI with QoS

data management

data management

Layered Systems for DM & PM- Agents can move metadata around via net
- Warehouse can move data around via NGI

Move models:

Predictive Model Markup Language (PMML)

Point of View

data mining algorithm

- View data mining:
- 1. Extract a learning set from a data warehouse
- 2. Apply a data mining algorithm
- 3. To produce a statistical model, data mining model or rule set.

<PMML version=“1.1”

<TreeModel ModelName=“response”

etc.

<Node frequency=“freq_12_month">

etc.

</TreeModel>

</PMML>

Problems with Current Techniques

- Models are deployed in proprietary formats
- Models are application dependent
- Models are system dependent
- Models are architecture dependant
- Time required to integrate models with other applications can be long.

partition 2

partition 3

High Performance Data Mining & PMML1. Scatter the query.

2. Compute the classifiers independently.

PMML

3. Gather and merge the PMML files

Data Mining System

Data Mining System

Predictive Modeling System

Data Warehouse

Data Warehouse

Data - Chicago

Data - Amsterdam

Distributed DM & PMMLPMML

Example: PMML

<TreeModel modelName="golfing">

<MiningSchema>

<MiningField name="temperature"/>

<MiningField name="humidity"/>

…

</MiningSchema>

<Node score="play">

<Predicate field="outlook" operator="equal" value="sunny"/>

<Node score="play">

<CompoundPredicate booleanOperator="and" > <Predicate field="temperature“ operator="lessThan" value="90F" />

Predictive Model Markup Language (PMML)

- Based on XML
- Benefits of PMML
- Open standard for Data Mining & Statistical Models
- Not concerned with the process of creating a model
- Provides independence from application, platform, and operating system
- Simplifies use of data mining models by other applications (consumers of data mining models)

Philosophy

- Very important to understand what PMML is not concerned with …
- PMML is a specification of a model, not an implementation of a model
- PMML allows a simple means of binding parameters to values for an agreed upon set of data mining models & transformations
- Also, PMML includes the metadata required to deploy models

PMML Document Structure

- PMML Documents
- Data dictionary
- Transformation dictionary
- One or more PMML models
- Support for taxonomies/hierarchies

- PMML Model
- Mining Schema
- Univariate statistics (ModelStats)
- Optional extensions

Operational Data

PMML

models

derivedFields

miningFields

Campaign Manager

derivedFields

campaigns

PMML Producers,Consumers, & Data FlowPMML Producers

Data Mining

System

learning sets

miningFields

Data Mining

Warehouse

dataFields

Data Flow - Recap

- Data Dictionary defines data
- Mining Schema defines specific inputs (MiningFields) required for model
- Transformation Dictionary defines optional additional derived fields
- Two types of attributes:
- attributes defined by the mining schema
- derived attributes defined via transformations

- Models themselves can also support certain transformations

Models in PMML v2.0

- polynomial regression
- general regression
- trees
- center based clusters
- density based clusters
- associations
- neural nets
- logistic regression
- naïve Bayes
- sequences

Conformance

- Producer conformance
- In case, an application can write valid PMML documents for at least one type of model

- Consumer conformance
- In case an application can read valid PMML documents for at least one type of model

- Core and non-core features
- For a given model, certain features are identified as core by the DTD and must be supported
- Others are identified as optional

CWM

DM

SQL/MM

Pt. 6 DM

Object model

for representing

data mining metadata:

models, model results

(UML/DTD/XML)

SQL objects for defining,

creating, and applying

data mining models, and

obtaining their results

(SQL)

DMG

PMML

Representation of data

mining models for inter-

vendor exchange

(DTD/XML)

JSR-073

JDMAPI

Java API for defining,

creating, and applying

data mining models, and

obtaining their results

(Java)

SQL-like interface

for data mining

operations (OLE DB/SQL)

OLE DB

for DM

Other Data Mining Standards10.5 Commercial & Open Source Systems

What do you do when you get home?

Data Mining and Related Systems

- SAS
- SPSS
- Splus (open source R)
- Matlab (open source Octave)
- Many other specialized systems

References

Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, 2001 - a good introduction to data mining from the systems and database perspective.

Ian H. Witten and Eibe Frank, Data Mining, Morgan Kaufmann Publishers, San Francisco, 2000 - a good introduction which includes Java tools for the common algorithms.

Ian H. Witten, Alistair Moffat and Timothy C. Bell, Managing Gigabytes, Second Edition, Morgan Kaufmann, San Diego, 1999 - a good book describing the infrastructurre and theory required for working with large collections of text or images.

J. R. Quinlan, C4.5 Programs for Machine Learning, Morgan Kauffmann, San Mateo, California, 1993.

Predictive Model Markup Language (PMML), see www.dmg.org

Download Presentation

Connecting to Server..