1 / 36

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

Introduction to SimpleScalar (Based on SimpleScalar Tutorial). CSCE614 Texas A&M University. Overview. What is an architectural simulator a tool that reproduces the behavior of a computing device Why use a simulator Leverage a faster, more flexible software development cycle

Download Presentation

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CSCE614 Texas A&M University

  2. Overview • What is an architectural simulator • a tool that reproduces the behavior of a computing device • Why use a simulator • Leverage a faster, more flexible software development cycle • Permit more design space exploration • Facilitates validation before H/W becomes available • Level of abstraction is tailored by design task • Possible to increase/improve system instrumentation • Usually less expensive than building a real system

  3. Advantages of SimpleScalar • Highly flexible • functional simulator + performance simulator • Portable • Host: virtual target runs on most Unix-like systems • Target: simulators can support multiple ISAs • Extensible • Source is included for compiler, libraries, simulators • Easy to write simulators • Performance • Runs codes approaching ‘real’ sizes

  4. Architectural Simulators Functional Performance Trace-Driven Exec-Driven Inst Schedulers Cycle Timers Interpreters Shaded tools are included in SimpleScalar Tool Set Simulation Tools DirectExecution

  5. Functional vs. Performance Simulators • Functional simulators implement the architecture • perform real execution • Implement what programmers see • Performance simulators implement the microarchitecture • Model system resources/internals • Concern about time • Do not implement what programmers see

  6. Trace-Driven Simulator reads a ‘trace’ of the instructions captured during a previous execution Easy to implement No functional components necessary No feedback to trace (eg. mis-prediction) Execution-Driven Simulator runs the program (trace-on-the-fly) Hard to implement Advantages Faster than tracing No need to store traces Register and memory values usually are not in trace Support mis-speculation cost modeling Trace Driven vs. Execution Driven Simulators

  7. Instruction Schedulers vs. Cycle Timers • Instruction Schedulers • Simulator schedules instruction when resources are available • Instructions proceeded one at a time • Simpler, but less detailed • Cycle Timers • Simulator tracks microarch. state each cycle • Simulator state == microarchitecture state • Perfect for microarchitecture simulation

  8. SimpleScalar Release 3.0 • SimpleScalar now executes multiple instruction sets: SimpleScalar PISA (the old "SimpleScalar ISA") and Alpha AXP. • All simulators now support external I/O traces (EIO traces). Generated with a new simulator (sim-eio) • Support more platforms • explicit fault support • And many more

  9. Simulator Suite Sim-Fast Sim-Safe Sim-Profile Sim-Cache Sim-Cheetah Sim-BPred Sim-Outorder • 300 lines • functional • 4+ MIPS • 350 lines • functional w/checks • 900 lines • functional • Lot of stats • < 1000 lines • functional • Cache stats • Branch stats • 3900 lines • performance • OoO issue • Branch pred. • Mis-spec. • ALUs • Cache • TLB • 200+ KIPS Performance Detail

  10. Sim-Fast • Functional simulation • Optimized for speed • Assumes no cache • Assumes no instruction checking • Does not support Dlite! • Does not allow command line arguments • <300 lines of code

  11. Sim-Safe • Functional simulation • Checks for instruction errors • Optimized for speed • Assumes no cache • Supports Dlite! • Does not allow command line arguments

  12. Sim-Cache • Cache simulation • Ideal for fast simulation of caches (if the effect of cache performance on execution time is not necessary) • Accepts command line arguments for: • level 1 & 2 instruction and data caches • TLB configuration (data and instruction) • Flush and compress • and more • Ideal for performing high-level cache studies that don’t take access time of the caches into account

  13. Sim-Cache (cont'd) • generates one- and two-level cache hierarchy statistics and profiles • extra options (also supported on sim-outorder): -cache:dl1 <config> - level 1 data cache configuration -cache:dl2 <config> - level 2 data cache configuration -cache:il1 <config> - level 1 instruction cache configuration -cache:il2 <config> - level 2 instruction cache configuration -tlb:dtlb <config> - data TLB configuration -tlb:itlb <config> - instruction TLB configuration -flush <config> - flush caches on system calls -icompress - remaps 64-bit inst addresses to 32-bit equiv. -pcstat <stat> - record statistic <stat> by text address

  14. Specifying Cache Configurations • all caches and TLB configurations specified with same format: <name>:<nsets>:<bsize>:<assoc>:<repl> • where: <name> - cache name (make this unique) <nsets> - number of sets <assoc> - associativity (number of “ways”) <repl> - set replacement policy l - for LRU f - for FIFO r - for RANDOM • examples: il1:1024:32:2:l 2-way set-assoc 64k-byte cache, LRU dtlb:1:4096:64:r 64-entry fully assoc TLB w/ 4k pages,random replacement

  15. Sim-Bpred • Simulate different branch prediction mechanisms • Generate prediction hit and miss rate reports • Does not simulate the effect of branch prediction on total execution time nottaken taken perfect bimod bimodal predictor 2lev 2-level adaptive predictor comb combined predictor (bimodal and 2-level)

  16. Sim-Profile • Program Profiler • Generates detailed profiles, by symbol and by address • Keeps track of and reports • Dynamic instruction counts • Instruction class counts • Branch class counts • Usage of address modes • Profiles of the text & data segment

  17. Sim-Outorder • Most complicated and detailed simulator • Supports out-of-order issue and execution • Provides reports • branch prediction • cache • external memory • various configuration

  18. Sim-Outorder: Detailed Performance Simulator • generates timing statistics for a detailed out-of-order issue processor core with two-level cache memory hierarchy and main memory • extra options: -fetch:ifqsize <size> - instruction fetch queue size (in insts) -fetch:mplat <cycles> - extra branch mis-prediction latency (cycles) -bpred <type> - specify the branch predictor -decode:width <insts> - decoder bandwidth (insts/cycle) -issue:width <insts> - RUU issue bandwidth (insts/cycle) -issue:inorder - constrain instruction issue to program order -issue:wrongpath - permit instruction issue after mis-speculation -ruu:size <insts> - capacity of RUU (insts) -lsq:size <insts> - capacity of load/store queue (insts) -cache:dl1 <config> - level 1 data cache configuration -cache:dl1lat <cycles> - level 1 data cache hit latency

  19. Sim-Outorder: Detailed Performance Simulator -cache:dl2 <config> - level 2 data cache configuration -cache:dl2lat <cycles> - level 2 data cache hit latency -cache:il1 <config> - level 1 instruction cache configuration -cache:il1lat <cycles> - level 1 instruction cache hit latency -cache:il2 <config> - level 2 instruction cache configuration -cache:il2lat <cycles> - level 2 instruction cache hit latency -cache:flush - flush all caches on system calls -cache:icompress - remap 64-bit inst addresses to 32-bit equiv. -mem:lat <1st> <next> - specify memory access latency (first, rest) -mem:width - specify width of memory bus (in bytes) -tlb:itlb <config> - instruction TLB configuration -tlb:dtlb <config> - data TLB configuration -tlb:lat <cycles> - latency (in cycles) to service a TLB miss

  20. Sim-Outorder: Detailed Performance Simulator -res:ialu - specify number of integer ALUs -res:imult - specify number of integer multiplier/dividers -res:memports - specify number of first-level cache ports -res:fpalu - specify number of FP ALUs -res:fpmult - specify number of FP multiplier/dividers -pcstat <stat> - record statistic <stat> by text address -ptrace <file> <range> - generate pipetrace

  21. Specifying the Branch Predictor • specifying the branch predictor type: -bpred <type> • the supported predictor types are: nottaken always predict not taken taken always predict taken perfect perfect predictor bimod bimodal predictor (BTB w/ 2 bit counters) 2lev 2-level adaptive predictor • configuring the bimodal predictor (only useful when “-bpredbimod” is specified): -bpred:bimod <size> size of direct-mapped BTB

  22. Specifying the Branch Predictor (cont'd) • configuring the 2-level adaptive predictor (only useful when “-bpred 2lev” is specified): -bpred:2lev <l1size> <l2size> <hist_size> <xor> Configurations: N, M, W, X N:# entries in first level (# of shift register(s)) M:# entries in 2nd level (# of counters, or other FSM) W:width of shift register(s) (# of bits in each shift register) X:(yes-1/no-0) xor history (We use 0 for this homework.) and address for 2nd level index Sample predictors: GAg: 1,M,W,0 where M = 2^W GAp: 1,M,W,0 where M = C*2^W, C is # of per-address prediction tables PAg: N,M,W,0 where M = 2^W PAp: N,M,W,0 where M = N * 2^W

  23. Performance Comparison of GAg,GAp, PAg and PAp • GAp: 1 global history register and 8 per-address prediction tables Branch address 4 2-bits per branch predictor Prediction 2-bit global branch history (b) (2,2) predictor (a) GAp

  24. Hack the state machine of Branch Predictor! (a) A3 (Same as shown in the textbook) (b) A2 (Original Simplescalar Implementation)

  25. Sim-Outorder HW Architecture Register Scheduler Exe Writeback Commit Fetch Dispatch Mem Memory Scheduler I-Cache I-TLB D-Cache D-TLB Virtual Memory

  26. Sim-Outorder (Main Loop) • sim_main() insim-outorder.c ruu_init(); for(;;){ ruu_commit(); ruu_writeback(); lsq_refresh(); ruu_issue(); ruu_dispatch(); ruu_fetch(); } • Executed once for each simulated machine cycle • Walks pipeline from Commit to Fetch • Reverse traversal handles inter-stage latch synchronization by only one pass

  27. Sim-Outorder (RUU/LSQ) • RUU (Register Update Unit) • Handles register synchronization/communication • Serves as reorder buffer and reservation stations • Performs out-of-order issue when register and memory dependences are satisfied • LSQ (Load/Store Queue) • Handles memory synchronization/communication • Contains all loads and stores in program order • Relationship between RUU and LSQ • Memory dependencies are resolved by LSQ • Load/Store effective address calculated in RUU

  28. Sim-Outorder: Fetch • ruu_fetch() • Modelsmachine fetch bandwidth • Fetches instructions from one I-cache/memory • block until I-cache misses are resolved • Instructions are put into the instruction fetch queuenamed fetch_data in sim-outorder.c (it is also called dispatch queue in the paper) • Probesbranch predictor to obtain the cache line for next cycle

  29. Sim-Outorder: Dispatch • ruu_dispatch() • Models instruction decoding and register renaming • Takes instructions from fetch_data • Decodes instructions • Enters and links instructions into RUU and LSQ • Splits memory operations into two separate instructions

  30. Sim-Outorder: Scheduler • lsq_refresh() • Models instruction selection, wakeup and issue • Separate schedulers track register and memory dependences. • Locates instructions with all register inputs ready and all memory inputs ready • Issue of ready loads is stalled if there is a store with unresolved effective address in LSQ. • If earlier store address matches load address, target value is forwarded to load.

  31. Sim-Outorder: Execute • ruu_issue() • Models functional units, D-cache issue and executes latencies • Gets instructions that are ready • Reserves free functional unit • Schedules writeback events using latency of the functional unit • Latencies are hardcoded in fu_config[] in sim-outorder.c

  32. Sim-Outorder: Writeback • ruu_writeback() • Models writeback bandwidth, detects mis-predictions, initiated mis-prediction recovery sequence • Gets execution finished instructions (specified in event queue) • Wakes up instructions that are dependent on completed instruction on the dependence chains of instruction output • Detects branch mis-prediction and roll state back to checkpoint

  33. Sim-Outorder: Commit • ruu_commit() • Models in-order retirement of instructions, store commits to the D-cache, and D-TLB miss handling • While head of RUU/LSQ ready to commit • D-TLB miss handling • Retire store to D-cache • Update register file and rename table • Reclaim RUU/LSQ resources

  34. Sim-Outorder:Processor core and other specifications • Instruction fetch, decode and issue bandwidth • Capacity of RUU and LSQ • Branch mis-prediction latency • Number of functional units • integer ALU, integer multipliers/dividers • FP ALU, FP multipliers/dividers • Latency of I-cache/D-cache, memory and TLB • Record statistic by text address

  35. Global Options • These are supported on most simulators -h print help message -d enable debug message -i start up in Dlite! Debugger -q quit immediately (use with -dumpconfig) -config read config parameters from <file> -dumpconfig save config parameters into <file>

  36. How to get help from us • Drop by during TA’s office hour • E-Mail khkim@cse.tamu.edu

More Related