1 / 40

HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing

HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing. Michael Adler Elliott Fleming Michael Pellauer Joel Emer. Simulating Multicores. Simulating an N- multicore target Fundametally N times the work Plus on-chip network. CPU. CPU. CPU. CPU. CPU. Network.

tino
Download Presentation

HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing Michael AdlerElliott FlemingMichael PellauerJoel Emer

  2. Simulating Multicores • Simulating an N-multicore target • FundametallyN times the work • Plus on-chip network CPU CPU CPU CPU CPU Network CPU CPU CPU CPU Duplicating cores will quickly fill FPGA Multi-FPGA will slow simulation

  3. Trading Time for Space • Can leverage separation of model clock and FPGA clock to save space • Two techniques: serialization and time-multiplexing • But doesn’t this just slow down our simulator? • The tradeoff is a good idea if we can: • Save a lot of space • Improve FPGA critical path • Improve utilization • Slow down rare events, keep common events fast LI approach enables a wide range of tradeoff options

  4. Serialization: A First Tradeoff

  5. Example Tradeoff: Multi-Port Register File • 2 Read Ports, 2 Write Ports • 5-bit index, 32-bit data • Reads take zero clock cycles • Virtex 2Pro FPGA: 9242(>25%) slices, 104 MHz 2R/2W Register File rd addr 1 rd addr 2 rd val 1 wr addr 1 wr val 1 rd val 2 wr addr2 wr val 2

  6. Trading Time for Space • Simulate the circuit sequentially using BlockRAM • 94 slices (<1%), 1 BlockRAM, 224 MHz (2.2x) • Simulation rate is 224 / 3 = 75 MHz FPGA-cycle to Model Cycle Ratio (FMR) rd addr 1 1R/1W Block RAM rd addr 2 rd val 1 wr addr 1 wr val 1 rd val 2 FSM wr addr 2 wr val 2 • Each module may have different FMR • A-Ports allow us to connect many such modules together • Maintain a consistent notion of model time

  7. Part FET IMEM Example: Inorder Front End Legend: Ready to simulate? 1 redirect No Yes (from Back End) training 1 Line Pred (from Back End) Branch Pred 1 2 fault • Modules may simulate at any wall-clock rate • Corollary: adjacent modules may not be simulating the same model cycle vaddr pred 1 mispred 0 1 inst or fault 0 first FET ITLB IMEM PC Resolve Inst Q 0 1 1 1 0 vaddr paddr enq or drop 0 deq paddr 0 rspImm 0 I$ 1 rspDel 1 slot

  8. Simulator “Slip” • Adjacent modules simulating different cycles! • In paper: distributed resynchronization scheme • This can speed up simulation • Case study: Achieved 17% better performance than centralized controller • Can get performance = dynamic average FET DEC FET DEC vs 1 1 Let’s see how...

  9. Traditional Software Simulation = model cycle

  10. Global Controller “Barrier” Synchronization = model cycle Challenges in Conducting Compelling Architecture Research

  11. run-ahead in time until buffering fills long-running ops can overlap even if on different CC A-Ports Distributed Synchronization Takeaway: LI makes serialization tradeoffs more appealing

  12. [With Parashar, Adler] Leveraging Latency-Insensitivity • Modeling large caches • Expensive instructions LEAP SRAM (MBs, 10s CCs) L2$ System Memory (GBs, 100s CCs) RAM 256 KB Cache Controller BRAM (KBs, 1 CC) 1 1 LEAP Scratchpad FPGA LEAP FPU Instruction Emulator (M5) RRR 1 1 EXE FPGA CPU

  13. Time-Multiplexing: A Tradeoff to Scale Multicores(resume at 3:45)

  14. Drawbacks: Probably won’t fit Low utilization of functional units Benefits: Simple to describe Maximum parallelism Multicores Revisited CORE 0 CORE 1 CORE 2 state state state What if we duplicate the cores?

  15. Module Utilization • A module is unutilized on an FPGA cycle if: • Waiting for all input ports to be non-empty or • Waiting for all output ports to be non-full • Case Study: In-order functional units were utilized 13% of FPGA cycles on average FET DEC FET DEC 1 1 1 1

  16. Benefits: Better unit utilization Time-Multiplexing: First Approach state state state physical pipeline virtual instances Duplicate state, Sequentially share logic • Drawbacks: • More expensive than duplication(!)

  17. Benefits: Much better area Good unit utilization Round-Robin Time Multiplexing state state state • Need to limit impact of slow events • Pipeline at a fine granularity • Need a distributed, controller-free mechanism to coordinate... physical pipeline • Fix ordering, remove multiplexors • Drawbacks: • Head-of-line blocking may limit performance

  18. Port-Based Time-Multiplexing • Duplicate local state in each module • Change port implementation: • Minimum buffering: N * latency + 1 • Initialize each FIFO with: # of tokens = N * latency • Result: Adjacent modules can be simultaneously simulating different virtual instances

  19. IMEM FET The Front End Multiplexed Legend: Ready to simulate? 1 redirect No CPU 1 CPU 2 (from Back End) training 1 Line Pred (from Back End) Branch Pred 1 2 fault vaddr pred 1 mispred 0 1 inst or fault 0 first FET ITLB IMEM PC Resolve Inst Q 0 1 1 1 0 vaddr paddr enq or drop 0 deq paddr 0 rspImm 0 I$ 1 rspDel 1 slot

  20. On-Chip Networks in a Time-Multiplexed World

  21. Problem: On-Chip Network CPU L1/L2 $ msg credit [0 1 2] [0 1 2] CPU 1 L1/L2 $ CPU 0 L1/L2 $ CPU 2 L1/L2 $ r r r r Memory Control • Problem: routing wires to/from each router • Similar to the “global controller” scheme • Also utilization is low msg msg r credit credit router

  22. Multiplexing On-Chip Network Routers Router 0 Router 1 Router 0..3 reorder σ(x) = (x + 1) mod4 Router 3 Router 2 reorder σ(x) = (x + 2) mod4 σ(x) = (x + 3) mod4 reorder 1 2 3 1 2 3 Simulate the network without a network 2 3 0 2 3 0 3 0 1 3 0 1 0 1 2 0 1 2

  23. Ring/Double Ring Topology Multiplexed “from prev” Router 0 Router 1 Router 2 Router 3 “to next” Router 0..3 σ(x) = (x + 1) mod4 ??? 1 3 2 0 3 1 0 2 Opposite direction: flip to/from

  24. Implementing Permutations on FPGAs Efficiently 1000 FSM 0001 • Side Buffer • Fits networks like ring/torus (e.g. x+1 modN) • Indirection Table • More general, but more expensive σ(x) = (x + 1) mod4 Move first to Nth Move Nth to first Move every K to N-K Perm Table RAM Buffer

  25. Torus/Mesh Topology Multiplexed Mesh: Don’t transmit on non-existent links

  26. Dealing with Heterogeneous Networks Compose “Mux Ports” with Permutation Ports In paper: generalize to any topology

  27. Putting It All Together

  28. Typical HAsim Model Leveraging these Techniques • 16-core chip multiprocessor • 10-stage pipeline (speculative, bypassed) • 64-bit Alpha ISA, floating point • 8 KB lockup-free L1 caches • 256 KB 4-way set associative L2 cache • Network: 2 v. channels, 4 slots, x-y wormhole F BP1 BP2 PCC IQ D X DM CQ C ITLB I$ DTLB D$ L/S Q L2$ Route • Single detailed pipeline, 16-way time-multiplexed • 64-bit Alpha functional partition, floating point • Caches modeled with different cache hierarchy • Single router, multiplexed, 4 permutations

  29. Time-Multiplexed Multicore Simulation Rate Scaling

  30. Time-Multiplexed Multicore Simulation Rate Scaling

  31. Time-Multiplexed Multicore Simulation Rate Scaling

  32. Time-Multiplexed Multicore Simulation Rate Scaling

  33. Takeaways • The Latency-Insensitive approach provides a unified approach to interesting tradeoffs • Serialization: Leverage FPGA-efficient circuits at the cost of FMR • A-Port-based synchronization can amortize cost by giving dynamic average • Especially if long events are rare • Time-Multiplexing: Reuse datapaths and only duplicate state • A-Port based approach means not all modules are fully utilized • Increased utilization means that performance degradation is sublinear • Time-multiplexing the on-chip network requires permutations

  34. Next Steps • Here we were able to push one FPGA to its limits • What if we want to scale farther? • Next, we’ll explore how latency-Insensitivity can help us scale to multiple FPGAs with better performance than traditional techniques • Also how we can increase designer productivity by abstracting platform

  35. Resynchronizing Ports • Modules follow modified scheme: • If any incoming port is heavy, or any outgoing port is light, simulate next cycle (when ready) • Result: balanced w/o centralized coordination • Argument: • Modules farthest ahead in time will never proceed • Ports in (out) of this set will be light (resp. heavy) • Therefore those modules will try to proceed, but may not be able to • There’s also a set farthest behind in time • Always able to proceed • Since graph is connected, simulating only enables modules, makes progress towards quiescence

  36. Other Topologies • Tree • Butterfly

  37. Generalizing OCN Permutations • Represent model as Directed Graph G=(M,P) • Label modules M with simulation order: 0..(N-1) • Partition ports into sets P0..Pm where: • No two ports in a set Pmshare a source • No two ports in a set Pmshare a destination • Transform each Pm into a permutation σm • Forall {s, d} in Pm, σm(s) = d • Holes in range represent “don’t cares” • Always send NoMessage on those steps • Time-Multiplex module as usual • Associate each σmwith a physical port

  38. Example: Arbitrary Network

  39. Results: Multicore Simulation Rate • Must simulate multiple cores to get full benefit of time-multiplexed pipelines • Functional cache-pressure rate-limiting factor

More Related