Emulating massively parallel peta flops machines
This presentation is the property of its rightful owner.
Sponsored Links
1 / 15

Emulating Massively Parallel (Peta FLOPS ) Machines PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on
  • Presentation posted in: General

Emulating Massively Parallel (Peta FLOPS ) Machines. Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kal é. http://charm.cs.uiuc.edu. Department of Computer Science Parallel Programming Laboratory. Roadmap. BlueGene Architecture Need for an Emulator

Download Presentation

Emulating Massively Parallel (Peta FLOPS ) Machines

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Emulating massively parallel peta flops machines

Emulating Massively Parallel (PetaFLOPS) Machines

Neelam Saboo, Arun Kumar Singla

Joshua Mostkoff Unger, Gengbin Zheng,

Laxmikant V. Kalé

http://charm.cs.uiuc.edu

Department of Computer Science

Parallel Programming Laboratory


Roadmap

Roadmap

  • BlueGene Architecture

  • Need for an Emulator

  • Charm++ BlueGene

  • Converse BlueGene

  • Future Work


Blue gene processor in memory case study

BOARD

PROCESSOR

1 GFlop/s, 0.5 MB

NODE/CHIP

25 GFlop/s, 12.5 MB

TOWER

BLUE GENE

1 PFlop/s, 0.5 TB

Blue Gene: Processor-in-memory Case Study

  • Five steps to a PetaFLOPS, taken from:

    • http://www.research.ibm.com/bluegene/

FUNCTIONAL MODEL:

34X34X36 cube of shared memory nodes each having 25 processors.


Smp node

SMP Node

  • 25 processors

  • 200 processing elements

  • Input/Output Buffer

    • 32 x 128 bytes

  • Network

    • Connected to six neighbors via duplex link

    • 16 bit @ 500 MHz = 1 Gigabyte/s

  • Latencies:

    • 5 cycles per hop

    • 75 cycles per turn


Processor

out

in

Processor

  • STATS:

  • 500 MHz

  • Memory-side cache eliminates coherency problems

    • 10 cycles local cache

    • 20 cycles remote cache

    • 10 cycles cache miss

  • 8 integer units sharing 2 floating point units

  • 8 x 25 x ~40,000 = ~8 x 106 processing elements!


Need for emulator

Need for Emulator

  • Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine


Emulator objectives

Emulator Objectives

  • Emulate Blue Gene and other petaFLOPS machines.

  • Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture.

  • Issues:

    • Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging.

    • Therefore don’t need complex event queue/rollback.


Emulator implementation

Emulator Implementation

  • What are basic data structures/interface?

    • Machine configuration (topology), handler registration

    • Nodes with node-level shared data

    • Threads (associated with each node) representing processing elements

    • Communication between nodes

  • How to handle all these objects on parallel architecture? How to handle object-to-object communication?

  • Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm.


Experiments on emulator

Experiments on Emulator

  • Sample applications implemented:

    • Primes

    • Jacobi relaxation

    • MD prototype

  • 40,000 atoms, no bonds calculated, nearest neighbor cutoff

  • Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors

ApoA-I: 92k Atoms


Collective operations

Collective Operations

  • Explore different algorithms for broadcasts and reductions

OCTREE

LINE

RING

z

y

Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster

x


Converse bluegene emulator objective

Converse BlueGene Emulator Objective

  • Performance estimation (with proper time stamping)

  • Provide API for building Charm++ on top of emulator.


Bluegene emulator

Bluegene Emulator

Communication threads

Worker thread

inBuffer

Affinity message queue

Non-affinity message queue

Node Structure


Performance

Performance

  • Pingpong

    • Close to Converse pingpong;

      • 81-103 us v.s. 92 us RTT

    • Charm++ pingpong

      • 116 us RTT

    • Charm++ Bluegene pingpong

      • 134-175 us RTT


Charm on top of emulator

Charm++ on top of Emulator

  • BlueGene thread represents Charm++ node;

  • Name conflict:

    • Cpv, Ctv

    • MsgSend, etc

    • CkMyPe(), CkNumPes(), etc


Future work simulator

Future Work: Simulator

  • LeanMD : Fully functional MD with only cutoff

  • How can we examine performance of algorithms on variants of processor-in-memory design in massive system?

  • Several layers of detail to measure

    • Basic: Correctly model performance, timestamp messages with correction for out-of-order execution

    • More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques


  • Login