1 / 83

BigSim Tutorial

Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois at Urbana-Champaign. BigSim Tutorial. 1. Outline. Overview BigSim Emulator BigSim Simulator Post-mortem simulation BigNetSim build flow

Download Presentation

BigSim Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Charm++ Workshop 2009 Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois at Urbana-Champaign BigSim Tutorial 1

  2. Charm++ Workshop 2009 Outline • Overview • BigSim Emulator • BigSim Simulator • Post-mortem simulation • BigNetSim build flow • Generic network model: Simple Latency Model • Specific network models • Extensibility

  3. Charm++ Workshop 2009 BigSim Infrastructure • BigSim for whole-system simulation of a large parallel machine. • Goal: Support early application development and identification of performance bottlenecks. • What BigSim can do: • An execution environment that can run both Charm++ and MPI applications on large scale target machines • No or small changes to MPI application source codes. • facilitate code development and debugging • Predict parallel performance at varying levels of resolution • Tune/scale performance • Machine vendors designing future machines 4

  4. Charm++ Workshop 2009 BigSim Components • BigSim Emulator • Run AMPI/Charm++ on emulator • Capture computation and communication information • Parallel: Each physical processor is used to emulate multiple target processors, leveraging Charm++’s virtualization support • BigSim Simulator • PDES, Network contention • Produce performance data in a format compatible with the Projections graphical browser 5

  5. Charm++ Workshop 2009 What BigSim Can not Do • BigSim • Itself does not predict cycle-accurate timing (needs instruction-level simulation) • does not predict cache effect, virtual memory • does not model O.S. jitter

  6. Charm++ Workshop 2009 Outline • Overview • BigSim Emulator • BigSim Simulator • Post-mortem simulation • BigNetSim build flow • Generic network model: Simple Latency Model • Specific network models • Extensibility

  7. Charm++ Workshop 2009 BigSim Emulator • Emulate full machine on existing parallel machines • Actually run a parallel program • E.g. multi-million objects on 128K target processors • Emulator is implemented on Charm++ • Libraries that link to user application • Simple architecture abstraction • Many multiprocessor (SMP) nodes connected via message passing 9

  8. Charm++ Workshop 2009 Communication processors Communication processors Worker processors Worker processors inBuff inBuff CorrectionQ CorrectionQ Non-affinity message queues Non-affinity message queues Real Processor BigSim Emulator: functional view Affinity message queues Affinity message queues Target Node Target Node Converse scheduler Converse Q 10

  9. Charm++ Workshop 2009 Install BigSim Emulator • Download Charm++ v6.1.2 • http://charm.cs.uiuc.edu/download/downloads.shtml • Compile Charm++/AMPI with “bigemulator” option: • ./build AMPI net-linux-x86_64 bigemulator –O • This builds charm++ and emulator libraries under net-linux-x86_64-bigemulator • Compiler wrapper for MPI applications: • charm/net-linux-x86_64-bigemulator/bin/mpicc, mpicxx, mpif90, etc 11

  10. Charm++ Workshop 2009 Prepare MPI Applications • Make sure applications are AMPI-complaint • Adaptive MPI – an implementation of MPI standard on Charm++ • Multithreaded • Changes that may be needed: • Fortran: Program Main => Program MPI_Main • Handle global/static variables • Manual: group globals into a big structure, and allocate on heap • Semi-automatic: use thread local storage • Int static __thread var; • Automatic: -swapglobals compiler option (ELF binaries) • Only handles globals, not statics 12

  11. Charm++ Workshop 2009 Ring Example (ring.c) #include "mpi.h" #define TIMES 10 #if CMK_BLUEGENE_CHARM extern void BgPrintf(const char *); #define BGPRINTF(x) if (myid == 0) BgPrintf(x); #else #define BGPRINTF(x) #endif Int value = 0; int main(int argc, char *argv[]) { int myid, numprocs, i; double time; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); time = MPI_Wtime(); BGPRINTF("Start of major loop at %f \n"); for (i=0; i<TIMES; i++) { if (myid == 0) { MPI_Send(&value,1,MPI_INT,myid+1,999,MPI_COMM_WORLD); MPI_Recv(&value,1,MPI_INT,numprocs-1,999,MPI_COMM_WORLD,&status); } else { MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid; MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); } } BGPRINTF("End of major loop at %f \n"); if (myid==0) printf("Sum=%d, Time=%g\n", value, MPI_Wtime()-time); MPI_Finalize(); } 13

  12. Ring Example (AMPI-complaint) #include "mpi.h" #define TIMES 10 #if CMK_BLUEGENE_CHARM extern void BgPrintf(const char *); #define BGPRINTF(x) if (myid == 0) BgPrintf(x); #else #define BGPRINTF(x) #endif int main(int argc, char *argv[]) { int myid, numprocs, I, value=0; double time; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); time = MPI_Wtime(); BGPRINTF("Start of major loop at %f \n"); for (i=0; i<TIMES; i++) { if (myid == 0) { MPI_Send(&value,1,MPI_INT,myid+1,999,MPI_COMM_WORLD); MPI_Recv(&value,1,MPI_INT,numprocs-1,999,MPI_COMM_WORLD,&status); } else { MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid; MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); } } BGPRINTF("End of major loop at %f \n"); if (myid==0) printf("Sum=%d, Time=%g\n", value, MPI_Wtime()-time); MPI_Finalize(); } Charm++ Workshop 2009 Charm++ Workshop 2009 14

  13. Charm++ Workshop 2009 How to Compile and Run MPI Applications for the Emulator • Compile with AMPI and emulator • charm/net-linux-x86_64-bigemulator/mpicc –o ring ./ring.c • with performance trace module: • charm/net-linux-x86_64-bigemulator/mpicc –o ring ./ring.c –tracemode projections • Run: • Use mpirun provided by AMPI • Give number of target processors as well number of real processors • Define target machine • Command line options • +x +y +z • +cth +wth • E.g. • mpirun –np 4 ./ring +x10 +y10 +z10 +cth2 +wth4 • Or, use Config file • mpirun –np 4 ./ring +bgconfig config 15

  14. Charm++ Workshop 2009 Bgconfig File Format • +bgconfig ./bg_config x 10 y 10 z 10 cth 2 wth 4 stacksize 4000 timing walltime #timing bgelapse #timing counter #cpufactor 1.0 fpfactor 5e-7 traceroot /tmp log yes correct no network bluegene 16

  15. Charm++ Workshop 2009 Ring Std Output Justice> mpirun –np 4 ./pgm +bgconfig ./bg_config Reading Bluegene Config file ./bg_config ... BG info> Simulating 8x1x1 nodes with 1 comm + 1 work threads each. BG info> Network type: ibmpower. alpha: 1.000000e-06 bandwidth :1.700000e+09. BG info> cpufactor is 1.000000. BG info> floating point factor is 0.000000. BG info> BG stack size: 30000 bytes. BG info> Using WallTimer for timing method. BG info> Generating timing log. BG info> bgTrace root is './'. LB> Load balancer ignores processor background load. Start of major loop at 0.268719 End of major loop at 0.273697 Sum=280, Time=0.00497856 [0] Number is numX:8 numY:1 numZ:1 numCth:1 numWth:1 numEmulatingPes:4 totalWorkerProcs:8 bglog_ver:5 [2] Wrote to disk for 2 BG nodes. [3] Wrote to disk for 2 BG nodes. [1] Wrote to disk for 2 BG nodes. [0] Wrote to disk for 2 BG nodes. BG> BlueGene emulator shutdown gracefully! BG> Emulation took 0.692498 seconds! 17

  16. Charm++ Workshop 2009 Ring Output Files Justice> ls -l -rwxr-xr-x 1 gzheng kale 2194434 2009-04-15 00:03 ring -rw-r--r-- 1 gzheng kale 10105 2009-04-15 00:04 pgm.sts -rw-r--r-- 1 gzheng kale 0 2009-04-15 00:04 pgm.projrc -rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.7.log -rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.6.log -rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.5.log -rw-r--r-- 1 gzheng kale 4559 2009-04-15 00:04 pgm.4.log -rw-r--r-- 1 gzheng kale 4861 2009-04-15 00:04 pgm.3.log -rw-r--r-- 1 gzheng kale 5163 2009-04-15 00:04 pgm.2.log -rw-r--r-- 1 gzheng kale 5167 2009-04-15 00:04 pgm.1.log -rw-r--r-- 1 gzheng kale 6670 2009-04-15 00:04 pgm.0.log -rw-r--r-- 1 gzheng kale 23901 2009-04-15 00:04 bgTrace3 -rw-r--r-- 1 gzheng kale 23938 2009-04-15 00:04 bgTrace2 -rw-r--r-- 1 gzheng kale 24663 2009-04-15 00:04 bgTrace1 -rw-r--r-- 1 gzheng kale 24242 2009-04-15 00:04 bgTrace0 -rw-r--r-- 1 gzheng kale 60 2009-04-15 00:04 bgTrace 8 files Only 4 files

  17. Charm++ Workshop 2009 What is in the Trace Logs? Traces for2 target processors • Tools for reading bgTrace binary files: • charm/example/bigsim/tools/loadlog • Convert to human-readable format • charm/example/bigsim/tools/log2proj • Convert to trace projections log files • Each SEB has: • startTime, endTime • Incoming Message ID • Outgoing messages • Dependences 19

  18. Charm++ Workshop 2009 Ring Projections Timeline

  19. Charm++ Workshop 2009 Performance Prediction • How to predict sequential performance? • Different levels of fidelity: • User supplied timing expression • Wall clock time • Performance counters • Instruction level simulation 21

  20. Charm++ Workshop 2009 Sequential Time - BgElapse • BgElapse • Manually advance processor time MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid; ... BgElapse(0.000005); MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); • Run with +bgelapse 22

  21. Charm++ Workshop 2009 Sequential Time – using Wallclock • Wallclock measurement of the time can be used via a suitable multiplier (scale factor)‏ • T * factor • Run application with +bgwalltime and +bgcpufactor, or • +bgconfig ./bgconfig: timing walltime cpufactor 0.7 • Good for predicting a larger machine using a fraction of the machine 23

  22. Charm++ Workshop 2009 Sequential Time – Performance Counters • Count floating-point, integer, memory and branch instructions (for example) with hardware counters • Derive these hardware counters to expected time on target machine. • Cache performance and the memory footprint effects can be approximated • by percentage of memory accesses and cache hit/miss ratio. • Example of use, for a floating-point intensive code: +bgconfig ./bg_config timing counter fpfactor 5e-7 • Perfex and PAPI are supported 24

  23. Charm++ Workshop 2009 Sequential Time – Instruction level simulation • Run instruction-level simulator separately to get accurate timing information‏ • Issues: • It is a different third-party hardware simulator • Hard to integrate with BigSim • Sequential • Does not model communication • Slow! 25

  24. Charm++ Workshop 2009 Interpolation • BigSim and instruction-level simulator interact through logs • Reduce the problem size by sampling: An interpolation-based scheme • Run a smaller sized problem, or • Run just one processor • Assume computation can be modelled by a set of parameters: • TC = Fn(p1, p2, p3, ...) • Use sample data from the instruction-level simulation to interpolate large dataset • With sampling data, do a least-squares fit to determine the coefficients of an approximation polynomial function

  25. Charm++ Workshop 2009 Case study: BigSim / Mambo void func( )‏ { startTraceBigSim( )‏ … endTraceBigSim( )‏ } Mambo Prediction for Target System Cycle-accurate prediction of sequential blocks on POWER7 processor BigSim Parallel Emulation BigSim Parallel Simulation Interpolation + Replace sequential timing Trace files Parameter files for sequential blocks Adjusted trace files 27

  26. Charm++ Workshop 2009 Ring Example MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); startTraceBigSim(); value += myid; endTraceBigSim(); char param[128]; sprintf(param, “sum %d”, myid); tagTraceBigSim(param); MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD);

  27. Charm++ Workshop 2009 Output Files justice>ls -l total 2328 -rw-r--r-- 1 gzheng kale 60 2009-04-15 11:08 bgTrace -rw-r--r-- 1 gzheng kale 36757 2009-04-15 11:08 bgTrace0 -rw-r--r-- 1 gzheng kale 37023 2009-04-15 11:08 bgTrace1 -rwxr-xr-x 1 gzheng kale 94886 2009-04-14 09:46 charmrun* -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.0 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.1 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.2 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.3 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.4 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.5 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.6 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.7 -rwxr-xr-x 1 gzheng kale 2153700 2009-04-15 11:07 ring* -rw-r--r-- 1 gzheng kale 2965 2009-04-15 11:07 ring.C justice>cat param.7 48 sum 7

  28. Charm++ Workshop 2009 Run ring Through Instruction-level Simulator • Compile normal version of ring (not emulator) • Run sequentially through an instruction-level simulator • Sample line of Mambo output: 10900820693: (10718653772): TRACE_END: sum 7

  29. Charm++ Workshop 2009 Compile and Run Interpolation Tool • Install GSL, the GNU Scientific Library • cd charm/examples/bigsim/tools/rewritelog • Modify the file interpolatelog.C to match your particular tastes. • OUTPUTDIR specifies a directory for the new logfiles • CYCLE_TIMES_FILE specifies the file which contains accurate timing information • Make • Run interpolation tool under bgTrace dir: • ./interpolatelog 31

  30. Charm++ Workshop 2009 Record/Replay • Record only a subset of special logs when running full size emulation • With the special logs, replay the execution of a particular target processor through hardware simulator • Example: • ./pgm +x 32768 +y 1 +z1 +bgrecord +bgrecordprocessors 0-32767:1024 • ./pgm +bgreplay 31744

  31. Charm++ Workshop 2009 Out-of-core Emulation • Motivation • Applications with large memory footprint • VM system can not handle well • Use hard drive • Similar to checkpointing • Message driven execution • Peek msg queue => what execute next? (prefetch)‏ 34

  32. Charm++ Workshop 2009 Using Out-of-core • Change bigsim configuration file: • Charm/tmp/Conv-mach-bigemulator.h • #define BIGSIM_OUT_OF_CORE 1 • Recompile Charm++ and application • Run the application through the emulator, with an addintional command line option: • +bgooc 1024 36

  33. Charm++ Workshop 2009 Outline • Overview • BigSim Emulator • BigSim Simulator • Post-mortem simulation • BigNetSim build flow • Generic network model: Simple Latency Model • Specific network models • Extensibility

  34. Charm++ Workshop 2009 Postmortem Simulation • Run application once, get trace logs, and run simulation with logs for a variety of network configurations • Big Network Simulator (BigNetSim) implemented on POSE simulation framework • Particularly useful when message passing performance is critical and strongly affected by network contention • Note: BigSim emulator and BigSim simulator both use same network models for latency-only calculations located in charm/src/langs/bluegene/bigsim_network.h

  35. Charm++ Workshop 2009 Implementation • Post-Mortem Network simulators are Parallel Discrete Event Simulations • Parallel Object Simulation Environment (POSE)‏ • Network layer constructs (NIC, Switch, Node, etc.) implemented as poser simulation objects • Network data constructs (message, packet, etc.) implemented as event methods on simulation objects

  36. Charm++ Workshop 2009 POSE

  37. Charm++ Workshop 2009 Terms • Several network models available • Specific: e.g., BlueGene • Latency-only model – does not account for contention • Network contention model • Generic: Simple Latency Model – uses a simple equation for determining message transmission time • Emulating processors – physical processors on which emulation is run (+p?) • Simulating processors – physical processors on which simulation (BigNetSim) is run (+p?) • Target processors – virtual (or simulated) processors on which emulation and simulation are run (+vp?)

  38. Charm++ Workshop 2009 Outline • Overview • BigSim Emulator • BigSim Simulator • Post-mortem simulation • BigNetSim build flow • Generic network model: Simple Latency Model • Specific network models • Extensibility

  39. Charm++ Workshop 2009 BigNetSim Build Flow • Download and compile charm • Compile POSE • Compile bigsim • Download BigNetSim • Compile BigNetSim • Run simulator • Output

  40. Charm++ Workshop 2009 Download and compile charm (if not done already) • Download the latest version of charm from the PPL archives: http://charm.cs.uiuc.edu/download/downloads.shtml • Compile charm • cd charm • ./build charm++ net-linux

  41. Charm++ Workshop 2009 Compile POSE • cd charm • ./build pose net-linux • options are set in pose_config.h • stats enabled by POSE_STATS_ON=1 • user event tracing TRACE_DETAIL=1 • more advanced configuration options • speculation • checkpoints • load balancing

  42. Charm++ Workshop 2009 Compile bigsim • cd charm/net-linux/tmp • make bigsim

  43. Charm++ Workshop 2009 Download BigNetSim • Download latest revision from repository: svn co https://charm.cs.uiuc.edu/svn/repos/BigNetSim • Directory structure: BigNetSim/trunk/ • BlueGene/ RedStorm/ and others - network models • SimpleLatency/ - Simple Latency Model • Topology/ Routing/ InputVcSelection/ OutputVcSelection/ - network configuration choices • Main/ - main simulation files • tools/ - tools directory • tmp/ - working directory created during build

  44. Charm++ Workshop 2009 Compile BigNetSim • Fix BigNetSim/trunk/Makefile.common so CHARMBASE points to your charm directory • For the Simple Latency Model: • cd BigNetSim/trunk/SimpleLatency • For parallel simulator: make • For sequential simulator (runs only on 1 simulating processor): make SEQUENTIAL=1

  45. Charm++ Workshop 2009 Run Simulator • cd BigNetSim/trunk/tmp • Copy bgTrace files into /tmp directory • For parallel build, run with: • ./charmrun +p4 bigsimulator -lat 1 -bw 1 • For sequential build, run with: • ./bigsimulator -lat 1 -bw 1

  46. Charm++ Workshop 2009 Output • Simulation completion time • Specified in “GVT ticks” (GVT = Global Virtual Time) • GVT tick length is determined by the value of #define factor in BigNetSim/trunk/Main/TCsim.h • Divide final GVT by factor to get simulation time in seconds • factor = 1e8 => 1 tick = 10ns • factor = 1e9 => 1 tick = 1ns

  47. Charm++ Workshop 2009 Output (continued) • Use BgPrint(char *) in source code to print event times • Each BgPrint() called at execution time in online execution mode is stored in trace log as a printing event • In postmortem simulation, strings associated with BgPrint() events are printed when the event is committed • “%f” in the string will be replaced by committed time • Useful for determining iteration times during simulation as well as emulation

  48. Charm++ Workshop 2009 Output (continued) • Projections • Copy emulation Projections logs and sts file into BigNetSim/trunk/tmp • Two ways to use: • Command-line parameter: -projname <name> • Creates a new set of logs by updating the emulation logs • Assumes emulation Projections logs are: <name>.*.log • Output: <name>-bg.*.log • Disadvantage: emulation Projections overhead included • Command-line parameter: -tproj • Creates a new set of logs from the trace files, ignoring the emulation logs • Must first copy <name>.sts file to tproj.sts • Output: tproj.*.log • Advantage: no emulation Projections overhead included

  49. Charm++ Workshop 2009 Ring Example • ./bigsimulator -lat 1 -bw 1 Charm++: standalone mode (not using charmrun) Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Charm++> cpu topology info is being gathered! Charm++> 1 unique compute nodes detected! bgtrace: totalBGProcs=8 X=8 Y=1 Z=1 #Cth=1 #Wth=1 #Pes=1 Opts: netsim on: 0 Initializing POSE... POSE initialization complete. Using Inactivity Detection for termination. netsim skip_on 0 0 Info> timing factor 1.000000e+08 ... Info> invoking startup task from proc 0 ... [0:RECV_RESUME] Start of major loop at 0.347418 [0:RECV_RESUME] End of major loop at 0.349147 Simulation inactive at time: 38129444 Final GVT = 38129444 1 PE Simulation finished at 0.052671. Program finished.

  50. Charm++ Workshop 2009 Projections - Ring Example Emulation Simulation: -lat 1 (latency = 1s) generated with -tproj

More Related