1 / 44

Data Mining for Scientific & Engineering Applications

Data Mining for Scientific & Engineering Applications. Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify Chandrika Kamath, Lawrence Livermore National Laboratory Vipin Kumar, Army High Performance Research Center, University of Minnesota.

maylin
Download Presentation

Data Mining for Scientific & Engineering Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining for Scientific & Engineering Applications Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify Chandrika Kamath, Lawrence Livermore National Laboratory Vipin Kumar, Army High Performance Research Center, University of Minnesota

  2. Chapter 10 – Data Mining Systems Robert Grossman, Laboratory for Advanced Computing, University of Illinois & Magnify

  3. Goals of Chapter 10 • What are the four critical interfaces in a data mining system? • Is data mining about rows or columns? • What are the standards in data mining? • What data mining systems are available?

  4. Outline 10.1 Overview of Data Mining Systems 10.2 Case Study Using a System 10.3 Managing Data for Data Mining 10.4 Data Mining Standards 10.5 Commercial and Open Source Systems

  5. 10.1 Overview of Data Mining Systems Following R. L. Grossman, S. Bailey, A. Ramu and B. Malhi, P. Hallstrom, I. Pulleyn and X. Qin, The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML), Information and Software Technology, 1999.

  6. Second Generation First Generation Data mining algorithms Data mining algorithms Data management Fourth Generation mobile data Third Generation Agents & & Internet Predictive modeling Predictive modeling Data mining algorithms Data mining algorithms NGI Data management Data management Four Generations of DM Systems

  7. agents agents Internet pred. models pred. models data mining data mining NGI with QoS data management data management Layered Systems for DM & PM Move results and metadata: Other protocols and services … • Agents can move metadata around via net • Warehouse can move data around via NGI Move models: Predictive Model Markup Language (PMML) Move data: DSTP, distributed databases, etc.

  8. Phases in the Data Mining & Predictive Modeling Process Phase B, C: Warehousing Phase E: Predictive Modeling Phase D: Data Mining Data Mining Mart Learning set PM or Rule Set Data Mining Trans -formations (DXML) Predictive Model Markup Language (PMML) Data Mining Primitives (DMP) Operational data PM rule or Rule Set Scores Phase F: Deployment

  9. Four Critical Interfaces • Data Mining Transformation (DXML) • Interface between operational data and data mining mart • Data Mining Primitives (DMP) • interface between data mining mart and data mining system • Data Mining Application Interface (DM-API) • interface between data mining applications and data mining system, DMQL, OLE DB for Data Mining, … • Predictive Model Markup Language (PMML) • interface between data mining system and predictive modeling system

  10. 10.2 Building a Model

  11. Some (Selected) Steps to Build Models • Define the data schema • Clean and load the data • Define the mining schema • Compute derived attributes • Build the model • Analyze the model • Deploy the model

  12. 1. Define the Data Schema Data Types: int, double, float, date-time, string, etc.

  13. Select data schema. Select data source: text, database, etc. 2. Clean and Load the Data

  14. 3. Define the Mining Schema Select mining role: dependent, independent, excluded, key, etc. Select mining type: continuous, ordinal, categorical, binary, etc.

  15. 4. Compute Derived Attributes Define petal_length/sepal_length

  16. 5. Build the Model Select Data Store Select Mining Schema Select Parameters

  17. 5. Build the Model (cont’d) Classification tree.

  18. 5. Build the Model - Tuning Select Model Select Parameters

  19. 6. Analyze the Model Analyze how well the model predicted class labels.

  20. 7. Deploy the Model Move PMML files to scoring engine.

  21. 10.3 Physical Data Management Arranging data by record and by attribute; data mining primitives.

  22. B+ Trees • The cost to access one record is exactly the same as to access a block of records • Use variants of techniques from databases to lower the cost of accessing out of memory data • There are a variety of tree-based methods for efficiently indexing blocks of data, such as B+ trees

  23. Select all objects where is less than 10. Select all objects where is less + than 10. Horizontal vs. Vertical terabye of complex objects Vertical Horizontal

  24. Thinking about Columns NC Mb/s GB Sec Events/s 1 3 4.4 11775 64 4 10 4.4 3590 655 8 17 4.4 2132 2811 16 23 4.4 1551 7731 Horizontal 1 1 0.27 1549 400 4 4 0.27 566 4377 8 7 0.27 320 15482 16 10 0.27 223 44590 Vertical

  25. Data Mining Primitives • For many algorithms, data infrastructure only needs to supply: (Attribute Id, Attribute Value, Class Value, Count) • Specialized data structures can be created to do this. • SQL databases can be extended to do this.

  26. 10.4 Data Mining Standards See www.dmg.org for more information. Following R. L. Grossman, S. Bailey, A. Ramu and B. Malhi, P. Hallstrom, I. Pulleyn and X. Qin, The Management and Mining of Multiple Predictive Models Using the Predictive Modeling Markup Language (PMML), Information and Software Technology, 1999.

  27. Predictive Model Markup Language (PMML) • Current Version 2.0 • Products shipping with PMML Version 1.1 • PMML Working Group Full Members • IBM, Magnify, MineIt, NCR, Oracle, Salford Systems, SPSS, xChange, University of Illinois at Chicago • PMML Working Group Supporting Members • Angoss, Insightful, KXEN, Microsoft, SGI … • Part of xml.org Repository & Source Forge

  28. agents agents Internet pred. models pred. models data mining data mining NGI with QoS data management data management Layered Systems for DM & PM • Agents can move metadata around via net • Warehouse can move data around via NGI Move models: Predictive Model Markup Language (PMML)

  29. Point of View data mining algorithm • View data mining: • 1. Extract a learning set from a data warehouse • 2. Apply a data mining algorithm • 3. To produce a statistical model, data mining model or rule set. <PMML version=“1.1” <TreeModel ModelName=“response” etc. <Node frequency=“freq_12_month"> etc. </TreeModel> </PMML>

  30. Problems with Current Techniques • Models are deployed in proprietary formats • Models are application dependent • Models are system dependent • Models are architecture dependant • Time required to integrate models with other applications can be long.

  31. partition 1 partition 2 partition 3 High Performance Data Mining & PMML 1. Scatter the query. 2. Compute the classifiers independently. PMML 3. Gather and merge the PMML files

  32. Combine Data Mining System Data Mining System Predictive Modeling System Data Warehouse Data Warehouse Data - Chicago Data - Amsterdam Distributed DM & PMML PMML

  33. Example: PMML <TreeModel modelName="golfing"> <MiningSchema> <MiningField name="temperature"/> <MiningField name="humidity"/> … </MiningSchema> <Node score="play"> <Predicate field="outlook" operator="equal" value="sunny"/> <Node score="play"> <CompoundPredicate booleanOperator="and" > <Predicate field="temperature“ operator="lessThan" value="90F" />

  34. Predictive Model Markup Language (PMML) • Based on XML • Benefits of PMML • Open standard for Data Mining & Statistical Models • Not concerned with the process of creating a model • Provides independence from application, platform, and operating system • Simplifies use of data mining models by other applications (consumers of data mining models)

  35. Philosophy • Very important to understand what PMML is not concerned with … • PMML is a specification of a model, not an implementation of a model • PMML allows a simple means of binding parameters to values for an agreed upon set of data mining models & transformations • Also, PMML includes the metadata required to deploy models

  36. PMML Document Structure • PMML Documents • Data dictionary • Transformation dictionary • One or more PMML models • Support for taxonomies/hierarchies • PMML Model • Mining Schema • Univariate statistics (ModelStats) • Optional extensions

  37. PMML Consumers Operational Data PMML models derivedFields miningFields Campaign Manager derivedFields campaigns PMML Producers,Consumers, & Data Flow PMML Producers Data Mining System learning sets miningFields Data Mining Warehouse dataFields

  38. Data Flow - Recap • Data Dictionary defines data • Mining Schema defines specific inputs (MiningFields) required for model • Transformation Dictionary defines optional additional derived fields • Two types of attributes: • attributes defined by the mining schema • derived attributes defined via transformations • Models themselves can also support certain transformations

  39. Models in PMML v2.0 • polynomial regression • general regression • trees • center based clusters • density based clusters • associations • neural nets • logistic regression • naïve Bayes • sequences

  40. Conformance • Producer conformance • In case, an application can write valid PMML documents for at least one type of model • Consumer conformance • In case an application can read valid PMML documents for at least one type of model • Core and non-core features • For a given model, certain features are identified as core by the DTD and must be supported • Others are identified as optional

  41. OMG CWM DM SQL/MM Pt. 6 DM Object model for representing data mining metadata: models, model results (UML/DTD/XML) SQL objects for defining, creating, and applying data mining models, and obtaining their results (SQL) DMG PMML Representation of data mining models for inter- vendor exchange (DTD/XML) JSR-073 JDMAPI Java API for defining, creating, and applying data mining models, and obtaining their results (Java) SQL-like interface for data mining operations (OLE DB/SQL) OLE DB for DM Other Data Mining Standards

  42. 10.5 Commercial & Open Source Systems What do you do when you get home?

  43. Data Mining and Related Systems • SAS • SPSS • Splus (open source R) • Matlab (open source Octave) • Many other specialized systems

  44. References Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, 2001 - a good introduction to data mining from the systems and database perspective. Ian H. Witten and Eibe Frank, Data Mining, Morgan Kaufmann Publishers, San Francisco, 2000 - a good introduction which includes Java tools for the common algorithms. Ian H. Witten, Alistair Moffat and Timothy C. Bell, Managing Gigabytes, Second Edition, Morgan Kaufmann, San Diego, 1999 - a good book describing the infrastructurre and theory required for working with large collections of text or images. J. R. Quinlan, C4.5 Programs for Machine Learning, Morgan Kauffmann, San Mateo, California, 1993. Predictive Model Markup Language (PMML), see www.dmg.org

More Related