1 / 21

Mapping CMPs to Xilinx FPGAs

Mapping CMPs to Xilinx FPGAs . Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list). Outline. Why am I here? (1) FPGA CMPs: a brief personal history (1) A methodology for great quality of results (7) Mapping a scalar RISC PE to an FPGA (5) CMP and RAMP Comments (4).

lexiss
Download Presentation

Mapping CMPs to Xilinx FPGAs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mapping CMPs to Xilinx FPGAs Jan GrayArchitect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list) CMPs on FPGAs

  2. Outline • Why am I here? (1) • FPGA CMPs: a brief personal history (1) • A methodology for great quality of results (7) • Mapping a scalar RISC PE to an FPGA (5) • CMP and RAMP Comments (4) CMPs on FPGAs

  3. Why Am I Here? • End of rapid clock freq scaling – parallelism imperative – we get it… • The vast design space of 1 B trans. SoCs ~2010 • Enables dozens of cores, integration, 100s GFLOPS, but how can the millions of real world developers (not grad students) exploit it? • Particular a challenge in ‘client personal computing’ settings • ~2010, industry must deliver loveable (mainstream, evolutionary) concurrency programming models • RAMP promises rapid iteration → rapid innovation in tools and architectural support for loveable concurrency models • [How] can we study mainstream commercial workloads, tools, and platforms on RAMP CMPs? CMPs on FPGAs

  4. My Journey into FPGA CMPs • Inspired by comp.arch, many Hot Chips conferences, H&P • 91: Freidin’s RISC4005: 20 MHz |4| 16-bit RISC in 4005 • 95: jr32: 33÷2 MHz |4|, 32-bit RISC + SoC in 4010 • 98: XSOC/xr16: 40 MHz |3|, 16-bit RISC in 260 LCs in S10 • Lcc, sims, SoC, Circuit Cellar series, fpgacpu.org, fpga-cpu list • 00: Altera NIOS • 01: gr1040 : 200 LUT+1 BRAM, 80 MHz |2|; 60 in V600E [P] • 01: Xilinx MicroBlaze (125 MHz |3| in 2Vxxx) • 04: 10 MicroBlaze in 2V2000 via EDK 6.3i • 04: 24 multithreaded-’MB’ in 4VLX25 [PD] • 06: 24 ‘PowerPC’ in 4VLX25 [PD], 200 MHz |4|, 133 MHz |2| • [P] = PAR/TRCE only; [D] = PAR/TRCE of datapath only; |x| = x stg pipe, no FP CMPs on FPGAs

  5. A Methodology for Great Quality of Results: It’s Essential for CMPs! • It’s the golden age of FPGA development • Was: timing whack a mole, synthesis pushing on a rope • Now: good fast tools, fast computers, better fabrics • But: 2-10x better delay×area by tailoring ISA, HW/SW partitioning, datapath, pipeline, tech mapping, floorplanning to the FPGA • Prefer 40 200 MHz processors/die to 20 100 MHz ones • Example: always @(posedge clk) q <= add ? q + a : b; • Hand tech mapped and floorplanned: 1 LUT/bit • Synthesis: 2 LUT/bit, +0.5 ns delay • 5X faster place and route  rapid (methodical) expts • Efficiency of ASIC CPU models on FPGAs? CMPs on FPGAs

  6. The Art of High Performance FPGA Design / How to Hack FPGAs like Ray Andraka • Great FPGA designers have The Knowledge • Best practices for great datapath QoR: • Choose the datapath’s technology mapping, pipeline regime, and floorplan, and then write the HDL • Bottom up experiments • Where it matters, use technology mapping tricks • Build up libraries of optimal datapath elements • Floorplan datapaths via Relationally Placed Macros (RPMs) • VHDL (generate + attributes) or Python + Verilog • Synthesize 95% of control unit – life is too short • Careful timing constraints, grok TRCE reports • Tune architecture and implementation together • Sweat the muxes • To iterate is divine CMPs on FPGAs

  7. The Knowledge in a Nutshell • The LUT and its DFF • Tech mapping opts to quash mux LUTs • The BRAM • The DSP48 • (The DCM) CMPs on FPGAs

  8. The LUT and its D-FF • 4-input LUT • Ripple-carry adder MUXCY and XORCY: ~2.5 ns 32-bit adder • MULTAND • P[i] = B[i] ? (P[i-1] + (A<<i)) : P[i-1]; • Mux cascades – MUXF5 etc. • 16x1-bit LUT-RAM; sp, dp • SRL16 – 16-bit tapped shift reg • D-Flip-Flops • Clock enable, synchronous reset, system reset regime CMPs on FPGAs

  9. 1 LUT/bit Technology Mapping Opts • ADDSUB: o <= add ? (a+b) : (a-b); • MUX2K: o <= k ? sel : (sel ? a : b); • MULTAND + carry-chain: • ADDMUX: o <= add ? (a+b) : c; • MUXADD: o <= addb ? (a+b) : (a+c); • ALU: o <= s1 ? (s2 ? (a+b) : (a-b)) : (s2 ? (a&b) : (a^b)); • Fast carry-chain-logic • EQV: o[i/2] <= a[i+1:i] == b[i+1:i] • EQZ: o[i/4] <= a[i+3:i] == 0 • C, V conditions • Other cheap mux ideas • 4-1 MUX using 2 2-1 MUXES and a MUXF5 • LUT-RAM / SRL16 is a 16-1 MUX • 4-input OR of 4 clearable registers CMPs on FPGAs

  10. BRAM • 18 Kb dual port synch SRAM • Up to dual x32+x4 D/Q • 0 cycle : tAS ~0.5 ns; tCO ~2.1 ns • BRAM … adder … BRAM  6 ns • Virtex-4 • Opt. 1 cycle: DO*_REG: 0.9 ns • 400+ MHz • Byte write enables • FIFOs for ser/des rate matching • The Myriad Uses of BRAMs on fpgacpu.org CMPs on FPGAs

  11. MULT/DSP48 • Dozens to hundreds in V-4 • Pipeline at 400+ MHz • Faster adders than the fabric • Basis for interesting fast simulated FPUs CMPs on FPGAs

  12. QoR Examples • Xr16 core • ISA codesigned with datapath • Elide 1 result forwarding mux, compensate in SW • Map result mux, shifter to TBUFs • gr1040 core – 200 LUTs + 1 BRAM • 2 stage pipeline – elide all result forwarding muxes • BRAM for instructions and data • Use 1 LUT/bit ‘ALU’ – delete OR operator – 67% smaller • Use ADDMUX – faster, 30% smaller • C, V, branch, and i-cache tag check in carry-chain-logic CMPs on FPGAs

  13. Mapping a Scalar RISC PE to an FPGA • Instruction cache, data cache • Cache lines – 1+ BRAMs • Cache tags – LUT RAM or BRAM • Read first mode for write-back caches • Register file • Single or dual ported LUT RAM • Multicontext reg files in BRAM • ALU • Tech mapping tricks; DSP48? • Result forwarding muxes • Multithreading – MicroUnity, HEP  deep pipelines OK • Clock pipeline faster than operand regs  ALU  forwarding  operand regs recurrence • LUT RAM PCs, SPRs, PSWs; BRAM reg files • But probably too much pressure on tiny i-caches and d-caches? CMPs on FPGAs

  14. Simple Is Beautiful • Simpler is smaller • Smaller is cheaper • More PEs per part • Smaller can be faster • Interconnect is slow, so the less, the better • Easier to optimize (retiming, floorplanning, technology mapping) • Smaller is more power frugal • Simpler is easier to verify • Move complexity out the ISA, trap into software, or use dynamic translation to the simpler ISA • (WCED?) CMPs on FPGAs

  15. “Jan’s Razor” • In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die. • Small clusters of cores share mul/div, barrel shift, FPU, TLB, even d-cache port CMPs on FPGAs

  16. Silly Example: 70 ‘PowerPC-lite’ datapaths in a 2VP70 CMPs on FPGAs

  17. Which ISAs for RAMP PEs? • Best fit in an FPGA fabric (==austerity) • MicroBlaze, MIPS, SPARC, PowerPC, x86 • x86 + PC periphs via dynamic translation? • Extant soft cores: MB, SPARC • 2VP/4VFX + EDK (CoreConnect *) bonus • MB, PowerPC • Commercial workloads and tools • PowerPC! CMPs on FPGAs

  18. PE Figures of Merit • Area: #[LUTs, BRAMs, DSPs, DCMs] • Frequency, power, floorplanned? (fast PAR) • Simplicity / ease of modification • Some experiments will augment base CPU ISAs • Facilities • Validation • Debug support • Tools integration • Workloads • IP Rights CMPs on FPGAs

  19. X86 HW seems too complex for area and time efficient large-n FPGA CMP 386, x64, v8086, x87, MMX, SSE2/3, SMM, hypervisor exts, … Don’t underestimate complexity of rest of system components / cores Build a ‘PowerPC’ CMP, run a port of the Virtual PC for Mac x86 dynamic translation engine upon, run apps on that Save/restore PC workloads to VHD images (When you have many cores, you don’t mind if your simulator spends a few on dyn translation) Speculation: How to Experiment Upon Commercial X86 Workloads on RAMP CMPs on FPGAs

  20. Other Thoughts • Compose optimized building blocks into synthesized (floorplanned?) system architectures • Synplify Pro has a great RTL viewer • MicroBlaze is an excellent, Type B core • EDK is a great framework • Can plug in HW and SW components, bus masters and slaves, new CPU cores and OS and periphs, BSPs • EDK ships with a broad complement of cores • Don’t reinvent all that! EDK vs. RDL? • QinetiQ (?) FPU IP CMPs on FPGAs

  21. Comments? Thanks. CMPs on FPGAs

More Related