1 / 31

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors. Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab, EECS UC Berkeley. March 2010. Outline. Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure

nola
Download Presentation

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab, EECS UC Berkeley March 2010

  2. Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work

  3. Overview • Purpose of RAMP Gold • An FPGA-based simulator for shared-memory multicore target for Parlab • Usage case: Architecture, OS and applications • Highlight of RAMP Gold • Works on $750 Xilinx XUP v5 board • Written in systemverilog, no special CAD tools required, works with standard FPGA CAD flows (Synplify/ISE/Modelsim) • Two orders of magnitude faster than Simics+GEMS • Runtime configurable parameters without resynthesis • Full RTL verification environment and software infrastructure • BSD and GNU license

  4. Simulation Jargon • Target vs. Host • Target: System/architecture being simulated, e.g. SPARC v8 CMP • Host : The platform on which the simulator runs, e.g. FPGAs • Functional model and timing model • Functional: compute instruction result • Timing: how long to compute the instruction

  5. RAMP Gold Overall Setup • Both functional and timing models on FPGA • App server: control and service syscall/IO

  6. Target Machine Template • 64-core SPARC v8 shared-memory machine • Configurable two-level cache + multichannel DRAM

  7. RAMP Gold Performance vsSimics PARSEC parallel benchmarks running on a research OS >250x faster than full system simulator for a 64-core multiprocessor target

  8. Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work

  9. Timing Model Pipeline Functional Model Pipeline Arch State Timing State RAMP Gold Model Key Concepts • Decoupled functional/timing model, both in hardware • Enables many FPGA fabric friendly optimizations • Increase modeling efficiency and module reuse • Host multithreading of both functional and timing models • Hide emulation latencies and improve resource utilization • Time-multiplexed effect patched by the timing model

  10. X Y IR PC 1 PC 1 PC 1 PC 1 Host multithreading CPU0 CPU1 CPU2 CPU3 Target Model Functional CPU model on FPGA ALU GPR1 GPR1 I$ DE GPR1 GPR1 D$ +1 Thread Select 2 2 2 • Example: simulating four independent CPUs

  11. Functional Model • Full SPARC v8 support (FP, MMU, I/Os) • Pass the SPARC v8 certification test • Run Linux and research OS

  12. Timing Model • Simple CPU timing but detailed memory timing model (i.e. every instruction takes 1 cycle except LD/ST) • Cache models: only store tags in BRAMs • Runtime configurable parameters: associativity, size, line size, # of banks, latency and etc • Model 3C but not 4C (coherent support soon) • DRAM model: bandwidth-delay pipe with optional QoS

  13. Debugging and Simulation Configuration • Frontend app server • Reliable Gigabit Ethernet connection to FPGA • Periodically pulls the simulator to serve I/O requests • Transparent to target (no side effect on simulated timing) • 64-bit hardware performance counters to collect runtime stats • 657 counters in timing model + 10 host counters • Can be read by either target apps or the app server • Ring interconnect for counters (easy to add and remove)

  14. Host Performance Timing synchronization is the largest overhead Tiny host $/TLBs are not on the performance critical path Host DRAM bandwidth is not a problem (<15% utilization)

  15. Implementation Single FPGA: 64-core @ 90 MHz, 2 GB DDR2 SODIMM ~2 hours CAD turnaround time on a mid-range workstation BRAM bounded, but have logic resources to fit more pipelines

  16. Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work

  17. Software Tools • SPARC cross compiler with binutils/gcc/glibc • Support most of POSIX programs • Static & dynamic linking support • Built from GNU GCC (4.3.2) • Full software and HW debugging suite • Low-cost XUP boards sometimes do not work out-of-box • FPGA CAD tools are very bad

  18. Target Software • Proxy Kernel: single-protection-domain application host • Runs programs statically linked against glibc • Forwards I/O system calls to x86/Linux host PC • Presents simple “hard-threads” API for multithreaded programs • Very easy to modify • ROS: UCB’s manycore research OS • Provides multiprogramming support • Sufficiently POSIX compliant to run many programs • Much easier to modify than linux • Run more than 64-cores

  19. Infrastructure

  20. Case studies • Parallel application studies for software programmers • Parallel OS for system researchers • Adding hardware performance counter for advanced debugging • Micro-architecture studies - adding features and modifying existing timing models • Adding new instructions – changing the functional model

  21. Appserver 101 • Appserver command-line options: Usage: sparc_app [-f<conf>] [-p<nprocs>] [-s] <htif> <kernel> [binary] [args] • Platform memory test: • App server memory test: sparc_app –p64 hw memtest none • Proxykernel memory test (stress test) sparc_app –p64 hw pathlkernel.ramp path/memtest

  22. For application programmers • Main usage scenario: use runtime configurable timing model without any FPGA hardware change • Use ‘hard-threads’ to write a parallel ‘hello world’ program running on the proxykernel • Compile the program using the cross toolchain sparc-ros-gcc –o hello hellp.cpp -lhart • Measure performance using performance counters sparc_app –s1 –p64 hw kernel.ramp hello • Change target machine configuration on the fly and rerun the experiment edit file ‘appserver.conf’

  23. For OS Developer • Similar usage model like application programmers • Proxykernel is a good start to learn the bootstrapping process • ROS is a full functional kernel • Demo: Boot the ROS kernel using the appserver sparc_app –p64 –fappserver_ros.conf hw your_kernel none

  24. Adding Hardware Performance Counters • Two types of counter interface • Global counter: <EN> • Local (per core) counter: <TID, EN> • Modify the verilog file to add more counters on the ring. perfctr_io #(.NLOCAL(num_of_local), .NGLOBAL(num_of_global)) gen_tm_counter(.gclk, .rst, .bus_out(io_out), .bus_in(io_in), .bus_sel(), //IO bus interface .global_inc(global_counter_inc), .local_inc(local_counter_inc), .local_tid(local_counter_tid)); • Modify the app server to support more counters: • Add your counter definition in ‘TestAppServer/perfcnt.h’

  25. Adding Features to Timing Models • Timing models are much simpler than functional models • ~1000 LoCvs 35,000 LoC • Example 1: Changing the cache replacement policy • Example 2: Adding memory QoS • Lee et al. “Globally-Synchronized Frames for, Guaranteed Quality-of-Service in On-Chip Networks”, ISCA’08 • ~100 lines of code added in the timing model • A new DRAM model • Several memory mapped register added on the functional I/O bus for configuration purpose

  26. Adding New Instructions • Adding instructions to a feed-through pipeline is straightforward • FPU instructions were added as “new” instructions within a week • Including: new register file, decode, exception/commit and microcode • Example: Adding new atomic instructions through microcode • 4 global scratchpad registers (not visible to programmer) in the main integer register file for temporary storage • Two write-port for supporting scratchpad registers update along with architecture register change

  27. Steps of Adding Instructions • Add proper decoding logic in function “decode_dsp_add_logic“ of “regacc_dma.sv” • Update the writeback/exception stage in file “exception_dma.sv” to trap to microcode. • Edit function “decode_microcode_mode” to trap to microcode • Edit function “rd_gen” to write address to scratch register 0, and load data to scratch register 1 • Edit microcode ROM ‘Microcode.sv’ //----------SWAP*------- 9: begin uco.uend = '0; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {LDST, 5'b0, ST, REGADDR_SCRATCH_0 | UCI_MASK, 1'b1, 13'b0}; end 10: begin uco.uend = '1; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {FMT3, 5'b0, IADD, REGADDR_SCRATCH_1 | UCI_MASK, 1'b1, 13'b0}; end

  28. Future work Cache Coherence models (soon) Realistic interconnect model (soon) Better CPU core model (next major version) Support other ISAs (next major version)

  29. Further References • Research papers • Usage case: A Case for FAME: FPGA Architecture Model Execution, ISCA’10 • RAMP Gold design: RAMP Gold: An FPGA-based Architecture Simulator for Multiprocessors, DAC’10 • Beta release http://sites.google.com/site/rampgold

  30. Backup Slides

  31. Functional/Timing Model Interface // FM -> TM typedefstruct { bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit run; //cpu states bit replay; //this instruction needs to replay by FM bit retired; //retiring an instruction bit [31:0] inst; //the instruction that was retired bit [31:0] paddr; //load/store physical address bit [31:0] npc; //PC of next fetched insn }tm_cpu_ctrl_token_type; // TM -> FM typedefstruct { bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit run; //run bit }tm2cpu_token_type;

More Related