1 / 33

Blue Gene Simulator

Blue Gene Simulator. Gengbin Zheng gzheng@uiuc.edu Gunavardhan Kakulapati kakulapa@uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu. Overview. Blue Gene Emulator Blue Gene Simulator

gefen
Download Presentation

Blue Gene Simulator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Blue Gene Simulator Gengbin Zheng gzheng@uiuc.edu Gunavardhan Kakulapati kakulapa@uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu

  2. Overview • Blue Gene Emulator • Blue Gene Simulator • Timing correction schemes • Performance and results

  3. BG/C Nodes Hardware thread Simulating (Host) Processor Emulation on a Parallel Machine

  4. Blue Gene Emulator: functional view Communication threads Worker threads inBuffer CorrectionQ Affinity message queues Non-affinity message queues One Blue Gene/C node

  5. Communication threads Communication threads Worker threads Worker threads inBuff inBuff CorrectionQ CorrectionQ Non-affinity message queues Non-affinity message queues Blue Gene Emulator: functional view Affinity message queues Affinity message queues Converse scheduler Converse Q

  6. What is capable … • Blue Gene API support • Blue Gene Charm++ • Structured Dagger • Trace Projections

  7. Emulator to Simulator • Emulator: • Study programming model and application development • Simulator: • performance prediction capability • models communication latency based on network model; • Doesn’t model memory access on chip, or network contention

  8. Simulator • Parallel performance is hard to model • Communication subsystem • Out of order messages • Communication/computation overlap • Event dependencies • Parallel Discrete Event Simulation • Emulation program executes in parallel with event time stamp correction. • Exploit inherent determinacy of application

  9. How to simulate? • Time stamping events • Per thread timer (sharing one physical timer) • Time stamp messages • Calculate communication latency based on network model • Parallel event simulation • When a message is sent out, calculate the predicted arrival time for the destination bluegene-processor • When a message is received, update current time. currTime = max(currTime,recvTime) • Time stamp correction

  10. Message sent: RecvT(msg) = curT+Latency Message scheduled: curT = max(curT, RecvT(msg)) Time Stamping messages and threads Thread Timer: curT

  11. Need for timestamp correction • Time stamp correction needed for out-of-order messages • Out-of-order delivery can occur: • A message arrives late while some other message updates the thread time to future • So late message executes in the context of future, although its predicted time is earlier

  12. Parallel correction algorithm • Sort message execution by receive time; • Adjust time stamps when needed • Use correction message to inform the change in event startTime. • Send out correction messages following the path message was sent • The events already in the timeline may have to move.

  13. RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M8 Timestamps Correction

  14. RecvTime Execution TimeLine M1 M2 M3 M8 M4 M5 M6 M7 Timestamps Correction

  15. RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M8 RecvTime Execution TimeLine M1 M2 M3 M8 M4 M5 M6 M7 Correction Message Timestamps Correction

  16. RecvTime Execution TimeLine M1 M2 M3 M4 M5 M6 M7 M4 M4 Correction Message (M4) RecvTime Execution TimeLine M1 M2 M4 M3 M5 M6 M7 Correction Message Correction Message (M4) RecvTime Execution TimeLine M1 M2 M3 M5 M6 M4 M7 Correction Message Timestamps Correction

  17. Linear-order correction • Works only when • Programs have no alternate orders of execution possible • Messages are processed in the same order for multiple executions • Eg: MPI programs with no-wildcard recvs, structured-dagger code with no “overlap” or “forall”.

  18. Reasons: • Correction algorithm breaks dependency logic • Only based on receive time; • Cases: • When an event depends on several messages • Last message triggers the computation • Message buffered until some condition holds • Example for invalid correction scheme: Jacobi-1D

  19. Solution • Use structured dagger to retrieve dependence information • As the program runs, form a chain of bluegene logs preserving the dependency information . • Bluegene logs for entry functions and structured dagger functions

  20. Timestamp correction scheme • Every event has a list of backward and forward dependents. • An event cannot start till its backward dependents have finished. • Define effRecvTime = max(recvTime, endOfBackDeps) • An event can start only after its effRecvTime. startTime = max(effRecvTime,timeline.last.endTime)

  21. Timestamp correction scheme • Timeline is not sorted on the recvTime of the event like the previous case. • Timeline is sorted based on the effRecvTime. • Steps to process a correction message • Find the earliest updated event due to the message • Cut the timeline from that event • Calculate new effRecvTimes from then. • Reinsert into the timeline in the order of effRecvTime

  22. Non-linear order correction scheme • The new scheme : • Takes into account the event dependencies • Works even when messages can be received in different orders in different runs. • Requires all the dependencies to be captured using structured dagger. • But the timing correction is very slow. Several optimizations possible.

  23. Optimizations to online correction scheme • Overwrite old corrections: • An event can get multiple correction messages. • Reduce the number of corrections • Same scheme if correction message arrives earlier than the message itself • Use multisend • Messages destined to same real processor but different events can be sent collectively.

  24. More optimizations • Prioritize messages based on their predicted recvTime. • Lazy processing • Process correction messages periodically. • Allows corrections to be overwritten. • Batch processing • Process many correction messages at a time • Many events will be affected • Choose the earliest and reinsert in the order of effRecvTime. • Ability to start corrections in the middle • Can ignore the startup events for timing correction

  25. Timing correction still very slow. • Observations: • Don’t let the execution go far ahead of the correction wave. • A large difference means many wrong events to be corrected. • Closely following the execution wave also may not help. • A new scheme • Similar to the one used for gvt (Global virtual time)

  26. GVT-like scheme • Use heartbeat • Periodically broadcast asking for gvt • Gvt • Is the time after which the events are invalid due to pending corrections • Compute the gvt as the minimum of predict recvTimes of all correction messages and startTimes of all affected events. • Use a parameter “leash”. Execution of the program cannot go beyond “gvt + leash”

  27. Projections before correction

  28. Projections after correction

  29. Correctness of the scheme (using Jacobi1D)

  30. Predicted time vs latency factor

  31. Predicted speedup

  32. More work • Ongoing work • Make sure gvt scheme is correct • Future work • The presented scheme is on-line correction • Explore the off-line (post-mortem) correction scheme using generated traces.

More Related