1 / 62

Heterogeneous Thread Assignment Simulation

Heterogeneous Thread Assignment Simulation . Kris Lange Nopparat suwaanarat Pree Thiengburanathum. Agenda. Introduction Motivation Review concepts M5 architecture Configuring M5 Simulator Simulation Results and Analysis Conclusion. Introduction.

reya
Download Presentation

Heterogeneous Thread Assignment Simulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Heterogeneous Thread Assignment Simulation Kris Lange Nopparatsuwaanarat PreeThiengburanathum

  2. Agenda • Introduction • Motivation • Review concepts • M5 architecture • Configuring M5 Simulator • Simulation • Results and Analysis • Conclusion

  3. Introduction • Basis: "Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures“ • Paper makes 2 claims • Heterogeneous CMP outperform homogenous CMP (for a fixed total die size) • Benefits of heterogeneous CMP are enhanced using dynamic thread assignment policies

  4. Motivation • Gain deeper understanding of research paper • Verify results of this paper • Gain hands-on experience running a peer-reviewed experiment

  5. Review: Concepts • Heterogeneous CMP system • Homogeneous CMP system • Heterogeneous VS Homogenous in multi-programmed.

  6. Review: Concepts • Heterogeneous CMP system Many simple cores = higher thread parallelism Fewer cores, larger = lower thread parallelism We want to maximize resource utilization and achieve high degree of inter-thread parallelism.How? Mapping running tasks and using control mechanism.

  7. Review: Concept Which one has a better total execution time? Control mechanism: Thread Assignment Policies: Static thread assignmentrandombest Dynamic thread assignmentround robinIPC driven

  8. Concepts: Assignment Policies • Static thread Assignment • Usually assign thread to the faster core. • Well studies problem before assign. • Solution rely on heuristics • a random static assignment. Don’t know the work loads and IPC. • a pseudo best static assignment. Know the work loads and IPC, use heuristic to find out. • Disadvantages: Doesn’t assign thread in run time. • does not optimize faster core(s) usage. • slow” threads on slower core(s) penalize overall system performance.

  9. Concepts: Assignment Policies • Dynamic thread assignment • Round Robin Assignment • rotating the assignment of threads to processors in a round robin fashion. • ensures that the available faster are equally shared among the running programs.

  10. Concepts: Assignment policies • IPC driven Assignment • Considering the characteristics of the executing threads. • Look at IPC number and ratio between two cores to decide the thread mapping. • Thread with higher ratio run on faster core. • Thread with lower ratio run on lower core.

  11. Simulation Approach • Goal: duplicate experiment in paper (peer-reviewed) • 2-phase simulation • 1) Obtain IPC trace values for Spec2000 programs • Using M5 simulator • Alpha EV5 + EV6 cores • 2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies

  12. Which simulator is suitable ? Rsim Simple MP SimOS Simic TFsim SimFlex GEMS

  13. Introduction & Overview What is M5 ? A brief peek inside

  14. What is M5 ? • A modular platform for simulating systems • Encompass • system-level architecture • processor microarchitecture

  15. key properties of M5 Pervasively Object-oriented Multiple interchangeable CPU models Event-driven memory system Multiprocessor / multi-system capability

  16. Overview of M5 Architecture M5 M5 Mem Bus bridge B U S L1 cache B U S L2 cache B U S CPU M5 I/O device Bus bridge B U S M5 M5

  17. M5’s Architecture CPU Models ISA Memory System Cache Buses

  18. CPU model • A Simple CPU model • 2 Detail CPU models

  19. CPU model Backward Communication Fetch Decode Rename Issue execution writeback Commit

  20. Instruction Set Architecture (ISA) • goal • allow human-readable ISA description • two parts • A simple part- describes the decode • A declaration part-describes the global information

  21. Memory System • Goal • combine the timing and functional models into one model • Simplify the memory system code • Make changes easier

  22. MemoryArchitecture cache cache cache port port port peer peer peer port port port Bus mem port peer port mem

  23. Cache BASEPrefetcher Prefetcher BHB Prefetcher StirdePrefetcher TaggedPrefetcher Coherency Prefetching

  24. BUSES memory , I/O , CPUs Master- closer to memory Slave- closer to CPU

  25. Configuring the M5 Simulator • Setup for M5 Simulator • Window Vista running VMware on fedora core. • Download the simulator from the website. • www.m5sim.org (open source) • Required Software: • g++, python, scons, zlib, swig

  26. Building, Compiling and running M5 • FS mode • Full System mode. This mode simulates a complete system including a kernel, I/O devices, etc. This mode currently only works with the ALPHA architecture. • SE mode • Syscall Emulation mode. This mode simulates statically compiled binaries by functionally emulating any syscall they make. • Example of commands how to build and run M5 • % scons build/ALPHA_SE/m5.debug • % ./build/ALPHA_SE/m5.debug config/example/se.py

  27. Cross Compilation • What is cross compilation? • Compiling a program for a target platform different from the platform the compiler is run on • M5 test programs must be compiled Alpha+Linux • Why? • M5 implements Alpha ISA and Linux syscalls • Since we don’t own Alpha hardware: cross-compile

  28. Cross Compilation: Take 1 • Build toolchain must be built for specific target • gcc, glibc, binutils, etc. • Dan Kegel’scrosstool makes this easier: • http://www.kegel.com/crosstool • Of the 3 Spec2000 programs we considered, we were only able to successfully cross compile gzip

  29. Cross Compilation: Take 2 • Scour the net until you run across this link: • http://arch.cs.duke.edu/spec2000binaries.tar.bz2 • All Spec200 binaries compiled for alpha-linux!

  30. M5 Output • M5 produces simulation results at end: • ---------- Begin Simulation Statistics ---------- • host_inst_rate 86899 # Simulator instruction rate (inst/s) • host_mem_usage 543680 # Number of bytes of host memory used • host_seconds 0.07 # Real time elapsed on the host • host_tick_rate 28827895 # Simulator tick rate (ticks/s) • sim_freq 1000000000000 # Frequency of simulated ticks • sim_insts 5997 # Number of instructions simulated • sim_seconds 0.000002 # Number of seconds simulated • sim_ticks 2005326 # Number of ticks simulated • system.cpu0.dtb.accesses 0 # DTB accesses • system.cpu0.dtb.acv 0 # DTB access violations • system.cpu0.dtb.hits 0 # DTB hits • system.cpu2.num_refs 1960 # Number of memory references :

  31. Getting M5 to Output Trace • We want IPC trace every 1 million cycles • So we patched: • diff -Naursrc/cpu/o3/cpu.cc /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc • --- src/cpu/o3/cpu.cc 2007-11-01 19:13:05.000000000 -0600 • +++ /Users/klange/src/thirdparty/m5_2.0b4/src/cpu/o3/cpu.cc 2007-12-01 22:54:38.000000000 -0700 • @@ -422,6 +422,21 @@ • ++numCycles; • + ++totalCycles; // we could use numCycles...if only i could figure out how to stringificate • + ++currentCycles; • + if (currentCycles >= 1000000) { • + double currentIpc = (double)currentCommittedInsts / (double)currentCycles; • + • + cout << "IPC: " • + << totalCycles << "," • + << totalCommittedInstsInt << "," • + << currentIpc << std::endl; • + • + currentCommittedInsts = 0; • + currentCycles = 0; • + } • + • + • // activity = false; • //Tick each of the stages • @@ -452,8 +467,10 @@ • if (removeInstsThisCycle) {

  32. Build the processor core

  33. EV5 configuration on M5

  34. EV6 configuration on M5

  35. Simulation Approach • Goal: duplicate experiment in paper (peer-reviewed) • 2-phase simulation • 1) Obtain IPC trace values for Spec2000 programs • Using M5 simulator • Alpha EV5 + EV6 cores • 2) Use our own simulator to model various heterogeneous CMP configurations and evaluate assignment policies

  36. Choosing Workload • Spec 2000 • Paper: • - gzip • - gcc • crafty (chess program) • parser (Natural language processor) • bzip2 • wupwis (quantum chromdynamics) • swim (shallow water modeling) • mgrid (multi-grid solver in 3d potential field) • galgel (fluid dynamics modeling) • equake (earthquake modeling) • lucas (prime number test) • Us: • gzip • Bzip2 • crafty

  37. Workload Input • Spec 2000 input is proprietary • Compromise: • gzip/bzip2 input: Shakespeare plays • crafty input: sample chess game

  38. IPC Traces • Obtained from M5

  39. IPC Traces

  40. IPC Traces

  41. CMP Simulator • java • Modular design • Core simulator module • Common thread-assignment policy interface • Policy modules • Static • Round Robin (dynamic) • IPC-Driven (dynamic)

  42. CMP Simulator • Command-line interface • Example: CMPSim spec2000 10 2 1 roundrobin • Input: • Workload • Number of threads • Selected randomly from 3 Spec 2000 programs • # EV5 cores • # EV6 cores • Thread assignment policy

  43. CMP Simulator • Output: Threads,Experiment,System IPC 1,20EV5 RR,0.905097784767538 2,20EV5 RR,1.46127036511788 3,20EV5 RR,2.06244067869053 4,20EV5 RR,2.78590633860981 5,20EV5 RR,3.35373843898152 6,20EV5 RR,4.07299579068557 7,20EV5 RR,4.17449020511364 8,20EV5 RR,4.915937425 9,20EV5 RR,5.47383727613636 10,20EV5 RR,6.00090476193182 11,20EV5 RR,6.64824888522727 12,20EV5 RR,7.26460146590909 13,20EV5 RR,7.90477401704545 14,20EV5 RR,8.46545665397727 15,20EV5 RR,9.23393584545455 16,20EV5 RR,9.80104248465909 17,20EV5 RR,10.3671315159091

  44. CMP Simulator Issue • IPC data are temporal sequences

  45. Static Policy • Randomly assign threads to cores at startup • Repeat process whenever core becomes idle • Weaknesses: • When one core becomes idle, it will persist in that state unless some unassigned thread exists. • In the case of a heterogeneous system, this results in underutilization of "faster" cores. • Execution of "slow" threads on "slower" cores may penalize overall system performance.

  46. Round Robin Policy • Randomly assign threads to cores at startup • Define swap_period • Experimentally, swap_period = 20M cycles works well • if (current_cycle % swap_period == 0) • Migrate thread from EV6 -> wait queue • Migrate thread from EV5 -> EV6 • Migrate thread from wait queue -> EV6 • When core becomes idle, assign longest-waiting thread

  47. Modeling Thread Migration • Costs • Inter-core context switch • PC, registers, etc must be transferred • Cache warmup • Simple model • switch_loss: 50% • switch_duration: 1M cycles

  48. Round Robin Weakness • No effort is made to optimize thread-to-core mapping

  49. IPC-Driven Policy • Optimize thread-to-core mapping • Define IPC ratio = EV6 IPC / EV5 IPC • Heuristic: threads with highest IPC ratio are assigned to EV6 • System must compute average IPC for each core type • Requires forced migrations • To handle IPC spikes, use a weighted average: • Current IPC * 0.65 + Previous IPC * 0.35

  50. IPC-Driven Policy • Randomly assign threads to cores at startup • Again, define swap_period • Experimentally, swap_period = 20M cycles works well • if (current_cycle % swap_period == 0) • Sort threads by weighted IPC ratio • Migrate accordingly • When core becomes idle, assign thread from wait queue with highest IPC ratio

More Related