1 / 15

Improving Parallelism in Structural Data Mining

Improving Parallelism in Structural Data Mining. Min Cai , Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater, Oklahoma 74078, U.S.A. Who am I?. Min Cai: cmin@cs.okstate.edu

jihan
Download Presentation

Improving Parallelism in Structural Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater, Oklahoma 74078, U.S.A.

  2. Who am I? • Min Cai: cmin@cs.okstate.edu • Ph.D. Student of Computer Science Department at Oklahoma State University • Research Interests: • Parallel and distributed computing • Data Mining

  3. Introduction • Data warehouses of increasing size • Data mining  technique for discovering interesting properties in data • structural data mining • data represented as a graph • aim  substructure discovery  finding “interesting” and recurring subgraphs in a labeled graph

  4. SUBDUE (1) • Discovers substructures utilizing minimum description length (MDL) principle • Cook, D.J., Holder, L.B., G alal, G., Maglothin, R.: Approaches to Parallel Graph-Based Knowledge Discovery. Journal of Parallel and Distributed Computing, 61(3) (2001) 427-446 • Data objects  graph vertices • Relationships  graph edges • Substructure  connected subgraph • NOTE  graph algorithms are notorious for long execution times

  5. SUBDUE (2) • Algorithm  two basic steps • substructure discovery • apply minimal description length (MDL) principle to find the “best” / “most important” structure in the graph • possibly stop here  this is the answer • substructure replacement • replace the substructure found in the first step by a single vertex and repeat the process • results • single substructure • hierarchy of substructures

  6. Parallel SUBDUE • Data-parallel approach • Graph divided into subgraphs and send to separate processor • Processors find their best structure and communicate with the rest • The best overall substructure is found • Hierarchical process can be repeated

  7. MPI-SUBDUE • Graph divided into subgraphs using METIS • point-to-point communications (MPI_Send and MPI_Recv) used to communicate between processors • NOTE  best structure in data set “7” may be dreadful when confronted with data set “18” • Galal, G.M., Cook, D.J., Holder, L.B.: Improving Scalability in a Knowledge Discovery System by Exploiting Parallelism In the Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (1997) 171-174

  8. NEW-MPI-SUBDUE • Improvements • use PARMETIS to divide the initial graph • use global communication (MPI_Allgatherv) • use binary summation • YES, these changes do not look like much 

  9. NEW-MPI-SUBDUE Spawn P(0), P(1), P(2), ... , P(n) Apply PARMETIS to partition G into n partitions for all P(i) where 1 ≤ i ≤ n do discover the best substructure in partition broadcast best substructure to all other processors evaluate best substructure and broadcast results parallel-binary summation of results to find the best overall partition P(0) finds the best overall structure

  10. EXPERIMENTAL SETUP • Mutagenesis data from OxUni • datasets collected in order to predict mutagenicity of aromatic and heteroaromatic nitro compounds • Graph 1 • 2844 vertices and 2883 edges • Graph 2 • 2896 vertices and 2934 edges • Graph 3 • 22268 vertices and 22823 edges • 16 node cluster (32 processors) • two AMD Athlon MP 1800+ (1.6GHz) CPUs, 2 GB of DDR SDRAM, full-backplane Gigabit Ethernet switch • RedHat Linux 9.0, MPICH, Portland Group C compiler 5.0-2

  11. EXPERIMENTAL RESULTS I

  12. EXPERIMENTAL RESULTS II

  13. EXPERIMENTAL REULTS III

  14. COMMENTS • Graphs 1 and 2 that were large in 2000 are small and “useless” today • Graph 3 gives realistic performance picture  gains about 33% • Speedup over original SUBDUE  268 on 32 processors for Graph 3 • this IS “cheating” as some information may be lost due to graph partitioning but… • Graph partitioning and balancing matter

  15. THE END THANK YOU!

More Related