Enhancing Data Mining Parallelism: Structural Insights

Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater, Oklahoma 74078, U.S.A.

Who am I? • Min Cai: cmin@cs.okstate.edu • Ph.D. Student of Computer Science Department at Oklahoma State University • Research Interests: • Parallel and distributed computing • Data Mining

Introduction • Data warehouses of increasing size • Data mining  technique for discovering interesting properties in data • structural data mining • data represented as a graph • aim  substructure discovery  finding “interesting” and recurring subgraphs in a labeled graph

SUBDUE (1) • Discovers substructures utilizing minimum description length (MDL) principle • Cook, D.J., Holder, L.B., G alal, G., Maglothin, R.: Approaches to Parallel Graph-Based Knowledge Discovery. Journal of Parallel and Distributed Computing, 61(3) (2001) 427-446 • Data objects  graph vertices • Relationships  graph edges • Substructure  connected subgraph • NOTE  graph algorithms are notorious for long execution times

SUBDUE (2) • Algorithm  two basic steps • substructure discovery • apply minimal description length (MDL) principle to find the “best” / “most important” structure in the graph • possibly stop here  this is the answer • substructure replacement • replace the substructure found in the first step by a single vertex and repeat the process • results • single substructure • hierarchy of substructures

Parallel SUBDUE • Data-parallel approach • Graph divided into subgraphs and send to separate processor • Processors find their best structure and communicate with the rest • The best overall substructure is found • Hierarchical process can be repeated

MPI-SUBDUE • Graph divided into subgraphs using METIS • point-to-point communications (MPI_Send and MPI_Recv) used to communicate between processors • NOTE  best structure in data set “7” may be dreadful when confronted with data set “18” • Galal, G.M., Cook, D.J., Holder, L.B.: Improving Scalability in a Knowledge Discovery System by Exploiting Parallelism In the Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (1997) 171-174

NEW-MPI-SUBDUE • Improvements • use PARMETIS to divide the initial graph • use global communication (MPI_Allgatherv) • use binary summation • YES, these changes do not look like much 

NEW-MPI-SUBDUE Spawn P(0), P(1), P(2), ... , P(n) Apply PARMETIS to partition G into n partitions for all P(i) where 1 ≤ i ≤ n do discover the best substructure in partition broadcast best substructure to all other processors evaluate best substructure and broadcast results parallel-binary summation of results to find the best overall partition P(0) finds the best overall structure

EXPERIMENTAL SETUP • Mutagenesis data from OxUni • datasets collected in order to predict mutagenicity of aromatic and heteroaromatic nitro compounds • Graph 1 • 2844 vertices and 2883 edges • Graph 2 • 2896 vertices and 2934 edges • Graph 3 • 22268 vertices and 22823 edges • 16 node cluster (32 processors) • two AMD Athlon MP 1800+ (1.6GHz) CPUs, 2 GB of DDR SDRAM, full-backplane Gigabit Ethernet switch • RedHat Linux 9.0, MPICH, Portland Group C compiler 5.0-2

EXPERIMENTAL RESULTS I

EXPERIMENTAL RESULTS II

EXPERIMENTAL REULTS III

COMMENTS • Graphs 1 and 2 that were large in 2000 are small and “useless” today • Graph 3 gives realistic performance picture  gains about 33% • Speedup over original SUBDUE  268 on 32 processors for Graph 3 • this IS “cheating” as some information may be lost due to graph partitioning but… • Graph partitioning and balancing matter

THE END THANK YOU!

Enhancing Data Mining Parallelism: Structural Insights

Enhancing Data Mining Parallelism: Structural Insights

Presentation Transcript

data parallelism

Data mining in practice

Clustering in Data Mining

Optimization in Data Mining

Data Mining in Banking

Data Mining in Politics

Chapel: Data Parallelism

Data mining in Wikipedia

Nested data parallelism in Haskell

Improving Structural Understanding in Architecture

ADVENTURES IN DATA MINING

Ligand search and data mining of Structural Genomics structures

Ligand search and data mining of Structural Genomics structures

Improving Data Mining of FFATA Awards Data Base

Data Mining: Data

Hierarchies in Data Mining

Data Mining in Counterterrorism

Nested data parallelism in Haskell

PARALLELISM PARALLELISM PARALLELISM

Data Mining: Data