1 / 12

System Support for High Performance Scientific Data Mining

System Support for High Performance Scientific Data Mining. Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information Sciences Ohio State University. Scientific Data Mining Problem.

gtrammell
Download Presentation

System Support for High Performance Scientific Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information Sciences Ohio State University

  2. Scientific Data Mining Problem • Datasets used for scientific data mining are large – particularly from simulations • Our understanding of what algorithms and parameters will give desired insights is limited • Time required for implementing different algorithms and running them with different parameters on large datasets slows down the scientific data mining process

  3. Project Overview • FREERIDE (Framework for Rapid Implementation of datamining engines) as the base system • Already demonstrated for a variety of standard mining algorithms • Working for feature analysis and mining of simulation data currently

  4. FREERIDE offers: • The ability to rapidly prototype a high-performance mining implementation • Distributed memory parallelization • Shared memory parallelization • Ability to process large and disk-resident datasets • Only modest modifications to a sequential implementation for the above three

  5. Popular algorithms have a common canonical loop Can be used as the basis for supporting a common middleware Key Observation from Mining Algorithms While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }

  6. Performance of Shared Memory Parallelization K-means clustering

  7. Performance on Cluster of SMPs Apriori Association Mining

  8. SPIES On (a) FREERIDE • Developed a new communication efficient decision tree construction algorithm – Statistical Pruning of Intervals for Enhanced Scalability (SPIES) • Combines RainForest with statistical pruning of intervals of numerical attributes to reduce memory requirements and communication volume • Does not require sorting of data, or partitioning and writing-back of records

  9. Broader Research Agenda

  10. Applying FREERIDE for Scientific Data Mining • Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al. • A feature is a region of interest in a dataset • A suite of algorithms for extracting and tracking them

  11. A Feature Analysis Algorithm Data Transform Tour Grid Operator Aggregate Classify Points Denoise Track Rank Catalog ROIs Classify-Aggregate

  12. Ongoing Work – Parallelization Using FREERIDE • Most of the steps involve generalized reductions - supported well in FREERIDE • Extensions to FREERIDE required for aggregation and tracking steps • Overall, FREERIDE can allow rapid implementation of scalable versions of a variety of steps and algorithms that are part of the feature mining paradigm

More Related