P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator

PROTOFLEX: FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi,James C. Hoe, Babak Falsafi, Ken Mai {echung, enurvita, jhoe, babak, kenmai}@ece.cmu.edu PROTOFLEX/SIMFLEX

IRQ controller DMAcontroller I/O MMUcontroller CPU Terminal PCI Bus Ethernetcontroller SCSIcontroller Graphics card Disk Disk Multiprocessor Functional Simulation • Functionally simulating one processor in software is slow • Simulating many processors is of course even slower • Parallelism of FPGAs can scale up functional MP simulation perf  conduct large-scale (>64-way)SW research, cache simulations, perf sampling studies, etc. • But we can’t forfeit full-ISA, full-system fidelity (run stock OS) FPGAs = unprecedented level of scalability but full-system building effort can outweigh any benefits CPU FPGAs Memory

I/O instr DMA 1 1 2 2 3 3 Combining FPGAs and simulators Target design FPGA Simulator Cpu Cpu • Advantages: • Leverage full-system simulators for reference designs • Infrequent, complex behaviors remain simulated: • TLB misses, block memory instrs, disk I/O instrs, SCSI disks, graphics, … Disk Mem Disk Mem • 3 ways to map target object to hybrid-simulation host Emulation-only Simulation-only Transplantable • Transplant runtime system • target processors switch modes between FPGA & simulator hosts • processors need not execute 100% in FPGA mode e.g., implement only the frequently used ISA subset in FPGA

It Really Works Virtutech Simics (commercial simulator) Xilinx XUP Virtex-II Pro 30 OurSPARCV9 core Simics UltraSPARC Embedded PowerPC Transplant& messageinterface • BlueSPARC specs: • 7k lines Bluespec • UltraSPARC III ISA • Validated against Simics w/ real apps (e.g., Solaris 8, SPEC2000, DB2, Oracle, etc.) • 41% all instr groups implemented + MMU • 8kB I/D direct-mapped caches • multi-cycle func model (CPIideal = 5 @ 100MHz) • 16K LUTs (50% of XUP Virtex-II Pro 30) Simulated target devices DDR memory Ethernet = SUN 3800 Server (1x UltraSPARC III, Solaris 8) + developed in 6 monthsx86 also works

Is this the best we can do? CPIeffective= 1.1 CPIeffective= 2 • Reality check: transplants are expensive! (10ms=1,000,000 cycles) • given CPI= 1 @ 100 Mhz (100 MIPS), 1 transplant per 1 million instructionsincreases CPI to 2 (50 MIPS) • Recall lessons in hierarchical cache design … • Hierarchical transplants • Run “simulator kernel” on nearby embedded PowerPC • write SW to cover the entire ISA • only I/O operations need full transplant to SIMICS(a 10x reduction in our case) FPGA fabric coverage=99.9999% CPIraw= 1 coverage=99.9999% CPIraw= 1 • Advantages: • Now it makes sense to optimize towards CPIraw = 1 • You actually need fewer instructions in hardware (especially beneficial for x86) Embedded PPC ISAsim coverage=99.99999% CPI=1,000 full-system SIMICS coverage=100% CPItplant=1,000,000

Demo

How to build a 1024-node MP functional emulator, without building 1024 nodes?

How fast do you need to simulate? “fast enough” for 1024-way arch. studies In the uniprocessor world • up to 100x slowdown for interactive software research (e.g. Simics) • 1k to 10k slowdown for design exploration (e.g. cache simulation) Aggregate Throughput

Different approaches to scale to 1K • Even for 1K-node MP, only 1000 to 10,000 MIPS (aggregate) to do useful work • The obvious approach • build fast ISA core (estimate 100 MIPS per core) • physically replicate the core 1000 times  10x to 100x faster than needed, why spend effort and area on perf I don’t need? • The better approachthink in terms of MIPS • build 100 MIPS ISA emulation engine supporting multiple contexts • map 100 simulated processors onto single engine • with just 10 physical engines, I can emulate 1000-way system(10 x 100 MIPS = 1000 MIPS)

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPUN CPU CPU CPUP P-way FPGA emulation engines, P<<N Memory PROTOFLEXMP • Build 1000-MIPS simulator from 10s of emulation engines • multiplex large # of emulated contexts onto few emulation engines Decide # of emulation engines to build from desired performance, not from # nodes to emulate N-way target system

Interleaved Emulation Engine • Statically interleaved emulation engine (ala HEP) • issue new instr from new context per cycle  maximize engine throughput • simple pipeline (no fwding or interlock if # context > # pipe stages) • deeper pipelines for higher frequency (or complex x86 instrs) • hide the latency of memory and transplants It is actually easier to optimize instruction throughput • Open issues • How to manage very large # of contexts? Do we have to dynamically “page” clusters of contexts in and out of the engine? • How to “fake” memory capacity? How much DRAM to emulate 1000-node system?

Conclusion • Contributions • hybrid transplant simulation reduces FPGA development effort • proof-of-concept demonstrates up to 16 MIPS on select SPECINT  plan to run TPC-C on DB2 and Oracle on BEE2 (not enough DRAM on XUP) • Future work • 1024-way system on 10-way interleaved emulation engines • Thanks! Questions? echung@ece.cmu.edu • PROTOFLEX/SIMFLEX (http://www.ece.cmu.edu/~simflex)

P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator

P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator

Presentation Transcript

Generating FPGA-Accelerated DFT Libraries

FASED: FPGA-Accelerated Simulation and Evaluation of DRAM

FPGA Based Smoke Simulator

GPU Functional Simulator

FPGA Accelerated 3-D Tomography

P P F

LEX

Implementing a Functional/Timing Partitioned Microprocessor Simulator with an FPGA

A brief [f]lex tutorial

Hybrid Simulator for Compressible Fluid Flow

Lex

Single-Payment Factors (P/F, F/P)

FPGA-Accelerated Simulation Technologies (FAST)

Roto-Rooter

GPU Functional Simulator

Lex

Lex