1 / 37

Enabling Rapid Development of Parallel Tree-Search Applications

Enabling Rapid Development of Parallel Tree-Search Applications. Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis. Jeffrey P. Gardner Andrew Connolly Cameron McBride. Pittsburgh Supercomputing Center University of Pittsburgh Carnegie Mellon University.

britain
Download Presentation

Enabling Rapid Development of Parallel Tree-Search Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey P. Gardner Andrew Connolly Cameron McBride Pittsburgh Supercomputing Center University of Pittsburgh Carnegie Mellon University

  2. (happy scientist) Step 2: Analyze simulation on workstation Step 3: Extract meaningful scientific knowledge How to turn astrophysicssimulation output into scientific knowledge Using 300 processors: (circa 1995) Step 1: Run simulation

  3. (happy scientist) Step 3: Extract meaningful scientific knowledge How to turn astrophysicssimulation output into scientific knowledge Using 1000 processors: (circa 2000) Step 2: Analyze simulation on server (in serial) Step 1: Run simulation

  4. How to turn astrophysicssimulation output into scientific knowledge Using 4000+ processors: (circa 2006) (unhappy scientist) X Step 2: Analyze simulation on ??? Step 1: Run simulation

  5. Exploring the Universe can be (Computationally) Expensive • The size of simulations is no longer limited by computational power • It is limited by the parallelizability of data analysis tools • This situation, will only get worse in the future.

  6. How to turn astrophysicssimulation output into scientific knowledge Using ~1,000,000 cores?: (circa 2012) X Step 2: Analyze simulation on ??? Step 1: Run simulation By 2012, we will have machines that will have many hundreds of thousands of cores!

  7. The Challenge of Data Analysis in a Multiprocessor Universe • Parallel programs are difficult to write! • Steep learning curve to learn parallel programming • Parallel programs are expensive to write! • Lengthy development time • Parallel world is dominated by simulations: • Code is often reused for many years by many people • Therefore, you can afford to invest lots of time writing the code. • Example:GASOLINE (a cosmology N-body code) • Required 10 FTE-years of development

  8. The Challenge of Data Analysis in a Multiprocessor Universe • Data Analysis does not work this way: • Rapidly changing scientific inqueries • Less code reuse • Simulation groups do not even write their analysis code in parallel! • Data Mining paradigm mandates rapid software development!

  9. (happy astronomer) Step 2: Analyze data on workstation Step 3: Extract meaningful scientific knowledge How to turn observationaldata into scientific knowledge Observe at Telescope (circa 1990) Step 1: Collect data

  10. (unhappy astronomer) 3-point correlation function: ~200,000 node-hours of computation How to turn observationaldata into scientific knowledge Use Sky Survey Data (circa 2005) X Sloan Digital Sky Survey (500,000 galaxies) Step 2: Analyze data on ??? Step 1: Collect data

  11. How to turn observationaldata into scientific knowledge Use Sky Survey Data (circa 2012) 3-point correlation function: ~several petaflop weeks of computation Large Synoptic Survey Telescope (2,000,000 galaxies)

  12. Tightly-Coupled Parallelism(what this talk is about) • Data and computational domains overlap • Computational elements must communicate with one another • Examples: • Group finding • N-Point correlation functions • New object classification • Density estimation

  13. The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe • Build a library that is: • Sophisticated enough to take care of all of the nasty parallel bits for you. • Flexible enough to be used for your own particular astrophysics data analysis application. • Scalable: scales well to thousands of processors.

  14. The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe • Astrophysics uses dynamic, irregular data structures: • Astronomy deals with point-like data in an N-dimensional parameter space • Most efficient methods on these kind of data use space-partitioning trees. • The most common data structure is a kd-tree. • Build a targeted library for distributed-memory kd-trees that is scalable to thousands of processing elements

  15. Challenges for scalable parallel application development: • Things that make parallel programs difficult to write • Work orchestration • Data management • Things that inhibit scalability: • Granularity (synchronization, consistency) • Load balancing • Data locality Structured data Memory consistency

  16. Overview of existing paradigms: DSM • There are many existingdistributed shared-memory (DSM) tools. • Compilers: • UPC • Co-Array Fortran • Titanium • ZPL • Linda • Libraries • Global Arrays • TreadMarks • IVY • JIAJIA • Strings • Mirage • Munin • Quarks • CVM

  17. Overview of existing paradigms: DSM • The Good: These are quite simple to use. • The Good: Can manage data locality pretty well. • The Bad: Existing DSM approaches tend not to scale very well because of fine granularity. • The Ugly: Almost none support structured data (like trees).

  18. Overview of existing paradigms: DSM • There are some DSM approaches that do lend themselves to structured data: • e.g. Linda (tuple-space) • The Good: Almost universally flexible • The Bad: These tend not to scale even worse than simple unstructured DSM approaches. • Granularity is too fine

  19. Challenges for scalable parallel application development: • Things that make parallel programs difficult to write • Work orchestration • Data management • Things that inhibit scalability: • Granularity • Load balancing • Data locality DSM

  20. Overview of existing paradigms: RMI “Remote Method Invocation” rmi_broadcast(…, (*myFunction)); Computational Agenda Master Thread RMI layer Proc. 3 Proc. 0 Proc. 1 Proc. 2 RMI layer RMI Layer RMI Layer RMI Layer myFunction() myFunction() myFunction() myFunction() myFunction() is coarsely grained

  21. RMI Performance Features • Coarse Granulary • Thread virtualization • Queue many instances of myFunction() on each physical thread. • RMI Infrastucture can migrate these instances to achieve load balacing.

  22. Overview of existing paradigms: RMI • RMI can be language based: • Java • CHARM++ • Or library based: • RPC • ARMI

  23. Challenges for scalable parallel application development: • Things that make parallel programs difficult to write • Work orchestration • Data management • Things that inhibit scalability: • Granularity • Load balancing • Data locality RMI

  24. N tropy: A Library for Rapid Development of kd-tree Applications • No existing paradigm gives us everything we need. • Can we combine existing paradigms beneath a simple, yet flexible API?

  25. N tropy: A Library for Rapid Development of kd-tree Applications • Use RMI for orchestration • Use DSM for data management • Implementation of both is targeted towards astrophysics

  26. Proc 0 Proc1 Proc2 Proc3 Proc4 Proc5 Proc6 Proc7 Proc8 A Simple N tropy Example:N-body Gravity Calculation • Cosmological “N-Body” • simulation • 100,000,000 particles • 1 TB of RAM 100 million light years

  27. A Simple N tropy Example:N-body Gravity Calculation ntropy_Dynamic(…, (*myGravityFunc)); Computational Agenda Master Thread N tropy master RMI layer Proc. 3 Proc. 0 Proc. 1 Proc. 2 N tropy thread RMI layer N tropy thread RMI layer N tropy thread RMI layer N tropy thread RMI layer myGravityFunc() myGravityFunc() myGravityFunc() myGravityFunc() P1 P2 … … Pn Particles on which to calculate gravitational force

  28. Proc 0 Proc1 Proc2 Proc3 Proc4 Proc5 Proc6 Proc7 Proc8 A Simple N tropy Example:N-body Gravity Calculation • Cosmological “N-Body” • simulation • 100,000,000 particles • 1 TB of RAM To resolve the gravitational force on any single particle requires the entire dataset 100 million light years

  29. 0 1 8 2 5 9 12 3 4 6 7 10 11 13 14 A Simple N tropy Example:N-body Gravity Calculation Proc. 3 Proc. 0 Proc. 1 Proc. 2 N tropy thread RMI layer N tropy thread RMI layer N tropy thread RMI layer N tropy thread RMI layer myGravityFunc() myGravityFunc() myGravityFunc() myGravityFunc() N tropy DSM layer N tropy DSM layer N tropy DSM layer N tropy DSM layer Work Data

  30. N tropy Performance Features • DSM allows performance features to be provided “under the hood”: • Interprocessor data caching for both reads and writes • < 1 in 100,000 off-PE requests actually result in communication. • Updates through DSM interface must be commutative • Relaxed memory model allows multiple writers with no overhead • Consistency enforced through global synchronization

  31. N tropy Performance Features • RMI allows further performance features • Thread virtualization • Divide workload into many more pieces than physical threads • Dynamic load balacing is achieved by migrating work elements as computation progresses.

  32. N tropy Performance 10 million particles Spatial 3-Point 3->4 Mpc Interprocessor data cache, Load balancing Interprocessor data cache, No load balancing No interprocessor data cache, No load balancing

  33. 0 1 8 2 5 9 12 3 4 6 7 10 11 13 14 Why does the data cache make such a huge difference? Proc. 0 myGravityFunc()

  34. N tropy “Meaningful” Benchmarks • The purpose of this library is to minimize development time! • Development time for: • Parallel N-point correlation function calculator • 2 years -> 3 months • Parallel Friends-of-Friends group finder • 8 months -> 3 weeks

  35. Conclusions • Most approaches for parallel application development rely on providing a single paradigm in the most general possible manner • Many scientific problems tend not to map well onto single paradigms • Providing an ultra-general single paradigm inhibits scalability

  36. Conclusions • Scientists often borrow from several paradigms and implement them in a restricted and targeted manner. • Almost all current HPC programs are written in MPI (“paradigm-less”): • MPI is a “lowest common denominator” upon which any paradigm can be imposed.

  37. Conclusions • N tropy provides: • Remote Method Invocation (RMI) • Distributed Shared-Memory (DSM) • Implementation of these paradigms is “lean and mean” • Targeted specifically for problem domain • This approach successfully enables astrophysics data analysis • Substantially reduces application development time • Scales to thousands of processors • More Information: • Go to Wikipedia and seach “Ntropy”

More Related