1 / 12

P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator

P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator. Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai {echung, enurvita, jhoe, babak, kenmai}@ece.cmu.edu. P ROTO F LEX/ S IM F LEX. IRQ controller. DMA controller. I/O MMU controller. CPU. Terminal. PCI Bus.

infinity
Download Presentation

P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PROTOFLEX: FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi,James C. Hoe, Babak Falsafi, Ken Mai {echung, enurvita, jhoe, babak, kenmai}@ece.cmu.edu PROTOFLEX/SIMFLEX

  2. IRQ controller DMAcontroller I/O MMUcontroller CPU Terminal PCI Bus Ethernetcontroller SCSIcontroller Graphics card Disk Disk Multiprocessor Functional Simulation • Functionally simulating one processor in software is slow • Simulating many processors is of course even slower • Parallelism of FPGAs can scale up functional MP simulation perf  conduct large-scale (>64-way)SW research, cache simulations, perf sampling studies, etc. • But we can’t forfeit full-ISA, full-system fidelity (run stock OS) FPGAs = unprecedented level of scalability but full-system building effort can outweigh any benefits CPU FPGAs Memory

  3. I/O instr DMA 1 1 2 2 3 3 Combining FPGAs and simulators Target design FPGA Simulator Cpu Cpu • Advantages: • Leverage full-system simulators for reference designs • Infrequent, complex behaviors remain simulated: • TLB misses, block memory instrs, disk I/O instrs, SCSI disks, graphics, … Disk Mem Disk Mem • 3 ways to map target object to hybrid-simulation host Emulation-only Simulation-only Transplantable • Transplant runtime system • target processors switch modes between FPGA & simulator hosts • processors need not execute 100% in FPGA mode e.g., implement only the frequently used ISA subset in FPGA

  4. It Really Works Virtutech Simics (commercial simulator) Xilinx XUP Virtex-II Pro 30 OurSPARCV9 core Simics UltraSPARC Embedded PowerPC Transplant& messageinterface • BlueSPARC specs: • 7k lines Bluespec • UltraSPARC III ISA • Validated against Simics w/ real apps (e.g., Solaris 8, SPEC2000, DB2, Oracle, etc.) • 41% all instr groups implemented + MMU • 8kB I/D direct-mapped caches • multi-cycle func model (CPIideal = 5 @ 100MHz) • 16K LUTs (50% of XUP Virtex-II Pro 30) Simulated target devices DDR memory Ethernet = SUN 3800 Server (1x UltraSPARC III, Solaris 8) + developed in 6 monthsx86 also works

  5. Is this the best we can do? CPIeffective= 1.1 CPIeffective= 2 • Reality check: transplants are expensive! (10ms=1,000,000 cycles) • given CPI= 1 @ 100 Mhz (100 MIPS), 1 transplant per 1 million instructionsincreases CPI to 2 (50 MIPS) • Recall lessons in hierarchical cache design … • Hierarchical transplants • Run “simulator kernel” on nearby embedded PowerPC • write SW to cover the entire ISA • only I/O operations need full transplant to SIMICS(a 10x reduction in our case) FPGA fabric coverage=99.9999% CPIraw= 1 coverage=99.9999% CPIraw= 1 • Advantages: • Now it makes sense to optimize towards CPIraw = 1 • You actually need fewer instructions in hardware (especially beneficial for x86) Embedded PPC ISAsim coverage=99.99999% CPI=1,000 full-system SIMICS coverage=100% CPItplant=1,000,000

  6. Demo

  7. How to build a 1024-node MP functional emulator, without building 1024 nodes?

  8. How fast do you need to simulate? “fast enough” for 1024-way arch. studies In the uniprocessor world • up to 100x slowdown for interactive software research (e.g. Simics) • 1k to 10k slowdown for design exploration (e.g. cache simulation) Aggregate Throughput

  9. Different approaches to scale to 1K • Even for 1K-node MP, only 1000 to 10,000 MIPS (aggregate) to do useful work • The obvious approach • build fast ISA core (estimate 100 MIPS per core) • physically replicate the core 1000 times  10x to 100x faster than needed, why spend effort and area on perf I don’t need? • The better approachthink in terms of MIPS • build 100 MIPS ISA emulation engine supporting multiple contexts • map 100 simulated processors onto single engine • with just 10 physical engines, I can emulate 1000-way system(10 x 100 MIPS = 1000 MIPS)

  10. CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPUN CPU CPU CPUP P-way FPGA emulation engines, P<<N Memory PROTOFLEXMP • Build 1000-MIPS simulator from 10s of emulation engines • multiplex large # of emulated contexts onto few emulation engines Decide # of emulation engines to build from desired performance, not from # nodes to emulate N-way target system

  11. Interleaved Emulation Engine • Statically interleaved emulation engine (ala HEP) • issue new instr from new context per cycle  maximize engine throughput • simple pipeline (no fwding or interlock if # context > # pipe stages) • deeper pipelines for higher frequency (or complex x86 instrs) • hide the latency of memory and transplants It is actually easier to optimize instruction throughput • Open issues • How to manage very large # of contexts? Do we have to dynamically “page” clusters of contexts in and out of the engine? • How to “fake” memory capacity? How much DRAM to emulate 1000-node system?

  12. Conclusion • Contributions • hybrid transplant simulation reduces FPGA development effort • proof-of-concept demonstrates up to 16 MIPS on select SPECINT  plan to run TPC-C on DB2 and Oracle on BEE2 (not enough DRAM on XUP) • Future work • 1024-way system on 10-way interleaved emulation engines • Thanks! Questions? echung@ece.cmu.edu • PROTOFLEX/SIMFLEX (http://www.ece.cmu.edu/~simflex)

More Related