1 / 15

Emulating Massively Parallel (Peta FLOPS ) Machines

Emulating Massively Parallel (Peta FLOPS ) Machines. Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kal é. http://charm.cs.uiuc.edu. Department of Computer Science Parallel Programming Laboratory. Roadmap. BlueGene Architecture Need for an Emulator

briana
Download Presentation

Emulating Massively Parallel (Peta FLOPS ) Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé http://charm.cs.uiuc.edu Department of Computer Science Parallel Programming Laboratory

  2. Roadmap • BlueGene Architecture • Need for an Emulator • Charm++ BlueGene • Converse BlueGene • Future Work

  3. BOARD PROCESSOR 1 GFlop/s, 0.5 MB NODE/CHIP 25 GFlop/s, 12.5 MB TOWER BLUE GENE 1 PFlop/s, 0.5 TB Blue Gene: Processor-in-memory Case Study • Five steps to a PetaFLOPS, taken from: • http://www.research.ibm.com/bluegene/ FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors.

  4. SMP Node • 25 processors • 200 processing elements • Input/Output Buffer • 32 x 128 bytes • Network • Connected to six neighbors via duplex link • 16 bit @ 500 MHz = 1 Gigabyte/s • Latencies: • 5 cycles per hop • 75 cycles per turn

  5. out in Processor • STATS: • 500 MHz • Memory-side cache eliminates coherency problems • 10 cycles local cache • 20 cycles remote cache • 10 cycles cache miss • 8 integer units sharing 2 floating point units • 8 x 25 x ~40,000 = ~8 x 106 processing elements!

  6. Need for Emulator • Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine

  7. Emulator Objectives • Emulate Blue Gene and other petaFLOPS machines. • Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. • Issues: • Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging. • Therefore don’t need complex event queue/rollback.

  8. Emulator Implementation • What are basic data structures/interface? • Machine configuration (topology), handler registration • Nodes with node-level shared data • Threads (associated with each node) representing processing elements • Communication between nodes • How to handle all these objects on parallel architecture? How to handle object-to-object communication? • Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm.

  9. Experiments on Emulator • Sample applications implemented: • Primes • Jacobi relaxation • MD prototype • 40,000 atoms, no bonds calculated, nearest neighbor cutoff • Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors ApoA-I: 92k Atoms

  10. Collective Operations • Explore different algorithms for broadcasts and reductions OCTREE LINE RING z y Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster x

  11. Converse BlueGene Emulator Objective • Performance estimation (with proper time stamping) • Provide API for building Charm++ on top of emulator.

  12. Bluegene Emulator Communication threads Worker thread inBuffer Affinity message queue Non-affinity message queue Node Structure

  13. Performance • Pingpong • Close to Converse pingpong; • 81-103 us v.s. 92 us RTT • Charm++ pingpong • 116 us RTT • Charm++ Bluegene pingpong • 134-175 us RTT

  14. Charm++ on top of Emulator • BlueGene thread represents Charm++ node; • Name conflict: • Cpv, Ctv • MsgSend, etc • CkMyPe(), CkNumPes(), etc

  15. Future Work: Simulator • LeanMD : Fully functional MD with only cutoff • How can we examine performance of algorithms on variants of processor-in-memory design in massive system? • Several layers of detail to measure • Basic: Correctly model performance, timestamp messages with correction for out-of-order execution • More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques

More Related