1 / 21

Geoff Oxberry 18.337 Project, Spring 2009

Parallelizing the graph isomorphism portion of an automatic reaction mechanism generation algorithm. Geoff Oxberry 18.337 Project, Spring 2009. Automatic reaction mechanism generation yields models quickly. Reaction mechanisms are used to model chemistry in a wide range of applications

ronda
Download Presentation

Geoff Oxberry 18.337 Project, Spring 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelizing the graph isomorphism portion of an automatic reaction mechanism generation algorithm Geoff Oxberry 18.337 Project, Spring 2009

  2. Automatic reaction mechanism generation yields models quickly • Reaction mechanisms are used to model chemistry in a wide range of applications • Generating first principles reaction mechanisms can take years, requires lots of expertise • The Bill Green group developed software (RMG) that automatically generates these models based on rules

  3. For some problems, RMG takes days to generate a mechanism • We want it to take a day or less on a cluster • Big bottleneck for us is that we have to repeatedly solve a colored graph isomorphism (GI) problem • If we can speed it up, we can solve many more interesting chemistry problems • Parallelism is one option

  4. We want to see if parallelism can be used to speed up RMG • Want to see if a parallel version of RMG is faster than a serial version • Due to time constraints, I chose to implement skeletal prototypes of serial and parallel versions of RMG in Python • Idea is to use results for the prototypes to see if it is worth parallelizing the production-scale code

  5. Parallelism does speed up RMG on intermediate-sized case studies • When searching for graph isomorphisms in collections of 20 or fewer graphs, serial code is faster • When searching for graph isomorphisms in collections of ~100 graphs, parallel code is faster • When searching for graph isomorphisms in collections of ~2000 graphs, serial code is faster again

  6. Outline • Brief overview of graph isomorphism • Discussion of existing RMG algorithm and how to parallelize • Python prototypes of serial and parallel versions of RMG algorithm • Results • Discussion of obstacles • Conclusions

  7. Two graphs are isomorphic if there exists a bijection between their nodes • These two graphs are isomorphic: 3 1 5 2 3 2 4 1 5 4 • Bijection here (L-R): 1-1, 3-4, 2-5, 5-2, 4-3

  8. In RMG, ChemGraphsrepresent species • ChemGraphs are graphs with node labels and edge labels • Species are represented by a class of graphs equivalent under isomorphism • Example (methane): 3 1 Node labels refer to atom types, edge labels refer to bond types 5 2 1 5 2 4 3 4

  9. RMG classifies species as one of three types • Core species make up all of the reactants of the reaction mechanism • Edge species are products of the reaction mechanism not included in the core; they may be added to the core over the course of the algorithm • Postulated species are proposed species that may be added to the edge over the course of the algorithm

  10. RMG algorithm manipulates graphs to generate a reaction mechanism Initialize set of core species Generate postulated speciesusing some rules. No Use GI to discard postulatedspecies based on various criteria Is terminationcriteria met? Add remaining postulatedspecies to edge species. Yes Determine if any edge speciesshould be added to core.

  11. Checking for duplicate graphs using GI looks parallelizable Use GI to discard postulatedspecies based on various criteria • For example, could scatter postulated species over all processors and check for duplicates against core species in parallel • Could also do this with forbidden configs, etc. Use GI to check for forbidden configurations. Use GI to check for duplicatesamong postulated species. Use GI to check that postulatedspecies aren’t duplicated in core. Discard any duplicates.

  12. Instead of working with RMG directly, I created a prototype • RMG takes 18 mos. for a developer to get up to speed; this project was ~6 wks. • To save time, I built a prototype in Python because its syntax and available libraries enable rapid development • Also enabled me to focus on the parts of the code that matter (GI algorithms) and ignore the rest

  13. Serial prototype throws out everything but GI checking Initialize set of core species Select postulated speciesfrom existing RMG output. No Use GI to discard postulatedspecies based on various criteria Is terminationcriteria met? Add remaining postulatedspecies to core species. Yes

  14. Parallel prototype parallelizes part of the GI comparisons Use GI to discard postulatedspecies based on various criteria (in prototype) • Checking postulated species against core species is embarrassingly parallel • Postulated species are essentially independent in that step Use GI to check for duplicatesamong postulated species. Use GI in parallel to check thatpostulated species aren’tduplicated in core. Discard any duplicates.

  15. Software: Python 2.5 (w/ C extensions) igraph module (graph data structure, GI algorithms) mpi4py module (MPI bindings for Python) Hardware: 64-node cluster (pharos.mit.edu) 8 GB RAM per node Each node has 2 quad-core Xeon processors (either 2.33 GHz or 2.66 GHz) Prototypes were implementedin Python/MPI on a cluster

  16. Parallel prototype was faster on intermediate-sized problems • Species database was obtained from existing RMG output • Initial set of core species was 50% of database, randomly chosen • Program ran until all species in database were moved into core, or it reached 100 iterations

  17. Communication is slow in large test cases due to passing graph objects • Graphs are implemented using a class in the igraph library • mpi4py converts non-native Python objects using cPickle, which is compute-intensive • cPickle is probably why the serial code is faster in large test cases • Alternative approach would use NumPy and define an MPI derived data type; would be faster

  18. Many technical problems occurred during the project • Laptop experienced hardware failures • Difficulties installing igraph and mpi4py on pharos • System libraries had to be recompiled • Environment variables were reset so igraph and mpi4py could be recognized on all nodes • Incomplete mpi4py documentation • Python extended debugger not installed; no graphical front-end

  19. Parallelism can be used to speed up RMG for some case studies • Saw speed up for intermediate-sized case studies on parallel prototype • Additional opportunities for parallelism within RMG algorithm • Can also decrease MPI communication costs w/ additional development, use of debugger/profiler

  20. Future Work • Install extended Python debugger/profiler • Use NumPy and MPI derived data type to reduced communication overhead • Try alternative strategies for parallelization: • Reorganize algorithm (check core species, then postulated species) • Parallelize checks of postulated species against themselves

  21. RMG team: Franklin Goldsmith Sandeep Sharma Josh Allen Richard West Michael Harper Greg Magoon Ray Speth Kushal Kedia Prof. Bill Green DOE CSGF for funding Acknowledgments

More Related