Using GPCE Principles for Hardware Systems and Accelerators

(bridging the gap to HW design) Using GPCE Principles for Hardware Systems and Accelerators GPCE 09 October 4, 2009 Rishiyur S. Nikhil CTO, www.bluespec.com

This seems to be a conference about improving software development ... Generative and component approaches are revolutionizingsoftwaredevelopment ... GPCE provides a venue for researchers and practitioners interested in foundational techniques for enhancing the productivity, quality, and time-to-market insoftware development ... In addition to exploring cutting-edge techniques for developing generative and component-basedsoftware, our goal is to foster further cross-fertilization between thesoftware engineering research communityand the programming languages community. ... so why am I here talking about hardware design? Two reasons ....

Reason (1): you may be interested in seeing how the principles highlighted below ... ... Generative Programming (developingprograms that synthesize other programs), Component Engineering (raising the level of modularization and analysis in application design), and Domain-Specific Languages (elevating program specifications to compact domain-specific notations that are easier to write, maintain, and analyze) are key technologies for automating program development. ... enhancing the productivity, quality, and time-to-marketin software development that stems fromdeploying standard components and automating program generation. ... ... are used with equal capability and effectiveness in HW design

Reason (2): I would like to tempt you to upgrade from being not only a software engineer (v 1.0) ... ... to “The Compleat Computation-ware Engineere (v 2.0)” ... ... where you think of hardware computation as an important (and easy to use) component in your toolbox, when you solve your next problem. HW SW

The traditional HW creation “flow” (early 1990s to present) Source code (Verilog/VHDL) run/debug/edit: “instant” 10s of months $10M-50M RTL simulation Traditional FPGA synthesis* Traditional ASIC synthesis minutes/ hours $100-10K Gate-level Verilog/VHDL Gate-level Verilog/VHDL Place&Route, ..., FPGA download Place&Route, ..., tape out, ... manufacture ... * “synthesis” is just jargon for a certain kind of compilation

New flows (not yet mainstream) Source code (High Level Language) Simulation by compiled execution • By raising level of abstraction, • improve design time by 10x (or more) • expressive power, simulation speed • with no loss of silicon quality (area, speed, power) “High Level” synthesis • In fact, sometimes with bettersilicon quality (because improved flexibility can result in better architectures) Source code (Verilog/VHDL) RTL simulation Traditional FPGA synthesis Traditional ASIC synthesis Gate-level Verilog/VHDL Gate-level Verilog/VHDL Place&Route, ..., FPGA download Place&Route, ..., tape out, ... manufacture ...

Some candidate high level languages Source code (C/C++/SystemC) Source code (BSV) Classic limitations of automatic parallelization from sequential codes, cf. “dusty deck Fortran” ca. 1970s • Bluespec’s fresh approach, inspired by • Term Rewriting Systems (parallel atomic transactions) to describe complex concurrent behaviorRelated to: UNITY, TLA+, EventB, ... • Haskell (types, overloading, parameterization, generativity) “High Level” synthesis Source code (Verilog/VHDL) RTL simulation Traditional FPGA synthesis Traditional ASIC synthesis Gate-level Verilog/VHDL Gate-level Verilog/VHDL Place&Route, ..., FPGA download Place&Route, ..., tape out, ... manufacture ...

HW languages have always been “generative” Example (Verilog) Two visualizations of the resulting module instance hierarchy: module mkM1 (…); mkM3 m3b ( … ); // instantiates mkM3 mkM2 m2 ( … ); // instantiates mkM2 endmodule module mkM2 (…); mkM3 m3a ( … ); // instantiates mkM3 endmodule module mkM3 (…); … endmodule m1 (instance of mkM1) m2 m3b m3a m1 (instance of mkM1) m3b m2 m3a

HW languages have long been “generative” (contd.) • Verilog/VHDL have poor generative capabilities (weak afterthought!): • Not orthogonal, not reflective, not Turing-complete Source code (C/C++/SystemC) Source code (BSV) “High Level” synthesis • Static Elaboration(jargon for “generation”) • Execute the structural aspects of the program to produce the module hierarchy (structure) Source code (Verilog/VHDL) RTL simulation Traditional FPGA synthesis Traditional ASIC synthesis Static Elaboration Gate-level Verilog/VHDL Gate-level Verilog/VHDL • Execution within the fixed structure (behavior) • Essentially just the execution of a giant FSM Execution Place&Route, ..., FPGA download Place&Route, ..., tape out, ... manufacture ...

I’m now going to show you some code examples for some non-trivial HW designs. I hope, at the end of this, you’ll say: “Hey! I could do that!” even if you’ve never designed HW before!

Verilog/VHDL module interfaces: wire oriented Example: transferring a datum from one module to another data data declare input and output wires declare input and output wires RDY RDY ENA ENA declaration of wires;connections to module interface wires;logic for RDY/ENA Protocol (proper behavior) specified separately using waveforms and English text RDY Very verbose, very error-prone ENA data

BSV module interfaces: “transactional” (object-oriented) Get Put Get#(Packet) g1 <- mkM1 (...); Put#(Packet) p1 <- mkM2 (...); Empty e <- mkConnection (g1, p1); These interface definitions are sufficiently useful and reusable that they’re in standard BSV libraries interface Get #(type t); // polymorphic method ActionValue #(t) get(); endinterface interface Put #(type t); method Action put (t x); endinterface module mkConnection #(Get#(t) g, Put#(t) p) (Empty); rule connect; let x <- g.get(); p.put (x); endrule endmodule parameters

Interfaces can be composed Get/Put pairs are very common, and duals of each other, so the BSV library defines Client/Server interfaces for this purpose interface Client #(req_t, resp_t); interface Get#(req_t) request; interface Put#(resp_t) response; endinterface interface Server #(req_t, resp_t); interface Put#(req_t) request; interface Get#(resp_t) response; endinterface module mkConnection #(Client#(t1,t2), Server#(t1,t2)); mkConnection (t1.request, t2.request); mkConnection (t2.response, t1.response); endmodule client Get Put ENA RDY data data RDY ENA req_t resp_t ENA RDY data data RDY ENS Put Get server Note overloaded mkConnection (BSV uses Haskell’s Typeclass mechanism for user-extensible, recursive, statically typed overloading)

Example: a Butterfly cross-bar switch Recursive structure: 1x1  2x2  4x4 …  NxN Basic building blocks: buffer (FIFO) 2x1 merge routing logic The entire interface can be defined in a few lines (polymorphic in the data type of packets flowing through the switch): interface XBar #(type t); interface List#(Put#(t)) input_ports; interface List#(Get#(t)) output_ports; endinterface

Module parameters Butterfly switch: module implementation Size of switch (# of ports) module mkXBar #(Integer n, function UInt #(32) destinationOf (t x), Module #(Merge2x1 #(t)) mkMerge2x1) ( XBar #(t)) endmodule: mkXBar used by routing logic 2x1 merge module Interface Parameters are static arguments, and so can be of any type, including (unbounded) Integers, functions, modules, etc. Interfaces represent dynamic communications and can only carry hardware-representable types.

Butterfly switch: module implementation module mkXBar #(...) ( XBar #(t)); List #(Put#(t)) iports; List #(Get#(t)) oports; if (n == 1) begin // ---- BASE CASE (n = 1) FIFO #(t) f <- mkFIFO; iports = cons (toPut (f), nil); oports = cons (toGet (f), nil); end else begin // ---- RECURSIVE CASE (n > 1) end interface input_ports = iports; interface output_ports = oports; endmodule: mkXBar buffer (FIFO)

Butterfly switch: module implementation module mkXBar #(...) ( XBar #(t)); if (n == 1) begin // ---- BASE CASE (n = 1) end else begin // ---- RECURSIVE CASE (n > 1) XBar#(t) upper <- mkXBar (n/2, destinationOf, mkMerge2x1); XBar#(t) lower <- mkXBar (n/2, destinationOf, mkMerge2x1); List#(Merge2x1#(t)) merges <- replicateM (n, mkMerge2x1); iports = append (upper.input_ports, lower.input_ports); function Get#(t) oport_of (Merge2x1#(t) m) = m.oport; oports = map (oport_of, merges); ... routing behavior ... end endmodule: mkXBar

Butterfly switch: module implementation module mkXBar #(...) ( XBar #(t)); if (n == 1) begin // ---- BASE CASE (n = 1) end else begin // ---- RECURSIVE CASE (n > 1) let ps = append (upper.output_ports, lower.output_ports); for (Integer j = 0; j < n; j = j + 1) rule route; let x <- ps[j].get (); case (flip (destinationOf (x), j, n)) matches tagged Invalid : merges [j] .iport0.put (x); tagged Valid .jFlipped : merges [jFlipped].iport1.put (x); endcase endrule end endmodule: mkXBar

Butterfly switch: atomicity of rules The hardware control logic the manage these complex, dynamic (data-dependent), reactive, control conditions is the most tedious and error-prone aspect of designing with RTL (Verilog, VHDL) and even with SystemC. Creation of this logic is automated (synthesized), based on the atomicity semantics of rules. May not be a packet to get • May not be able to put a packet: • flow control • contention for (Integer j = 0; j < n; j = j + 1) rule route; let x <- ps[j].get (); case (flip (destinationOf (x), j, n)) matches tagged Invalid : merges [j] .iport0.put (x); tagged Valid .jFlipped : merges [jFlipped].iport1.put (x); endcase endrule

Butterfly switch: summary observations • The core mkXBar module is expressed in ~40-50 lines of code • Parameterized by packet type, size, routing function, 2x1 merge module • It’s fully synthesizable(550 MHz using Magma Synthesis, TSMC 0.18 micron libraries) • Static elaboration (“generativity”) has the full power of Haskell evaluation • Higher-order functions, lists/vectors, recursion, ... • There is no syntactic distinction between the “static elaboration” part and the “dynamic” part of the source code • An expression “a+b” may be used both for static elaboration and as a dynamic computation (i.e., an adder in the hardware) • 2-layers: static elaboration produces a module hierarchy with rules • The rules are then synthesized according to atomicity semantics into the correct data paths and control logic

Cyclic Extend Controller Scrambler Encoder Interleaver Mapper IFFT IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers accounts for 85% area Example: IFFT in 802.11a wireless transmitter headers 24 Uncoded bits data

+ + Bfly4 out0 in0 Bfly4 - - Bfly4 in1 out1 Bfly4 Permute_1 x16 out2 in2 Bfly4 Bfly4 Bfly4 Permute_2 Permute_3 out3 in3 + + … … out4 in4 Bfly4 Bfly4 … … * t0 in63 out63 - - * t1 * t2 *j * t3 The IFFT computation (specification) All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ...

IFFT: the HW implementation space (varying in area, power, clock speed, latency, throughput) Direct combi-national circuit In any stage, use fewer than 16 Bfly4s fewer Bfly4s serialization unserialization Iterate 1 stage thrice Varying degrees of pipelining

Higher-order functions for building linear pipelines (“linear combinator”) Pipe Put 0 Get mkStage () Pipe Put n_stages stage_j mkLinearPipe () Get Pipe Put n_stages-1 Get module mkLinearPipe #(Integer n_stages, Bool with_registers, function Module #(Pipe#(a,a) mkStage (Integer stage_j)) (Pipe#(a,a))); ... endmodule

Higher-order functions for building looped pipelines (“loop combinator”) n (a,j) (x,j) Pipe Pipe Put Put mkLoopPipe () Get Get x a module mkLoopPipelined #(Integer n, function Module#(PipeF #(Tuple2#(a, UInt#(logn)), a)) mkLoopBody ()) (PipeF #(a,a))

Direct combi-national circuit In any stage, use fewer than 16 Bfly4s fewer Bfly4s serialization unserialization Iterate 1 stage thrice Varying degrees of pipelining Generating all versions of IFFT “PAClib” (Pipeline Architecture Constructor Library) is a library of such higher-order pipeline combinators. Using PAClib, IFFT can be succinctly expressed in a single source code which, depending on the parameters supplied, will elaborate (unfold) into any one of the possible architectures in the space of architectures illustrated. PAClib enables a “pipeline DSL” • Which architecture is “best” depends on the requirements • Desired latency, throughput • Area, power, clock speed • Target silicon technology (FPGA, ASIC 90nm, ASIC 65nm, ...)

Another important reason for generativity—enables rapid experimentation to determine optimal architecture • Architectural effects can be quite unpredictable. E.g., • Hypothesis: linear pipe will take more silicon area than looped pipe • But the looped pipe has other silicon costs: • Needs multiplexers, control logic  area cost • Needs higher clock speed for same throughput  area cost, power cost • A kicker: disables some constant propagations  area cost, power cost • (for ASICs, silicon area directly affects price of chip) • Bottom line: • Need to be able to experiment with different architectures • Generativity allows scripting the exploration of the space

I hope that by now you’re saying: • “Hey! Writing HW programs doesn’t look too hard!” • (Has all the creature comforts of a modern high-level programming language.) • But, so what? • Why would I want to compute something directly in HW? • Even if I want to, aren’t the costs and logistics of actually putting something in HW just too high a barrier?

Why implement things in HW? Speed Reason (1): Speed Speed • Direct implementation in HW typically • removes a layer of interpretation, and interpretation generally costs an order of magnitude in speed • can exploit more parallelism instructions (program) for application X Interpret: fixed machine(e.g., x86, GPGPU, Cell) X-machine (fine-grain parallel) Run: Run: • Caveat: lots of devils in the details • Interpretation at GHz may still be faster than direct execution at MHz • Interpretation with monster memory bandwidth may still be faster than direct execution with anemic memory bandwidth

Why implement things in HW? • Reason (2): Power consumption • Interpretation on fixed computing architectures costs power instructions (program) for application X Interpret: Pay energy cost for X-execution fixed machine(e.g., x86, GPGPU, Cell) X-machine Also pay for fetch, decode, register management, cache management, extra data movement, branch misprediction, ... Portable devices: battery life Server farms/ clouds: cost of power supply, air conditioning

Opportunity with today’s FPGA technology(Field Programmable Gate Arrays) Your application software on host FPGAsubsystem Your computation on FPGA • FPGA capacity: • millions of gates • FPGA  host communication links: • USB • 1Gb/10Gb Ethernet • PCI Express Example of what is possible: a single FPGA can easily run H.264 decoding at VGA resolution (640x480) and, with a good design, at HDTV (1920x1080) resolution • FPGA speeds: • 100s of MHz • ... new and exciting: • FPGA-in-processor-socket: • AMD Hypertransport bus • Intel Front-Side Bus • FPGA-on-processor-chip: • Coming soon • FPGA board costs: • As low as $100s • $1K-$10K typical • $10K-$100K for multi-FPGA boards)

Making FPGA acceleration easy and routine • Atop today’s FPGA technology, we provide the communication infrastructure: • Make it easy for SW to invoke a HW service or vice versa • Concurrent, pipelined, ... • Model: Concurrent RPCs (Remote Procedure Calls) • Auto-generate SW and HW (BSV) stubs from service specs • (like using IDL to specify distributed client/server communication) SW app HW app (BSV/RTL) A “Communications Protocol Stack”. Analogy: RPC socket TCP/IP Ethernet services services SCE-MI SCE-MI sockets/ PCIe/ USB/ Ethernet/ FSB/ Hypertransport Link layer Link layer Software HW agnostic: FPGA (or Bluesim/Verilog sim)

mkConnection connections Putting it all together: BSV applies GPCE concepts to HW design—generation, parameterization, changeability; reusability; easy exploration of architecture space, ... Get/Put/Client/Server interfaces Get/Put/Client/Server interfaces SW part (e.g., C++) HW part (BSV) Yourapplication generate BSV synthesis gcc FPGA synthesis etc. services services SCE-MI SCE-MI Link layer Link layer link/ load link/ load FPGAs are compelling due to speed, lower power, low cost, fast communication with host FPGA

http://www.ece.cmu.edu/~protoflex Example: CMU ProtoFlex Virtex5 FPGA BSV Ethernet Virtutech Simics UltraSparc model • Virtutech Simics: commercial SW simulator for whole-systems (OS/devices/apps) • (“Virtual Platform” for early SW development, before ASIC is available) • Problem: very clever tricks for fast simulation, but steady slowdown • for each added thread and core • for each added bit of instrumentation • CMU ProtoFlex: • Fully operational model of 16-cpu UltraSPARC III SunFire 3800 Server, running unmodified Solaris 8; running on FPGA at 90 MHz • Hybrid simulation: continue to use Simics for modeling rest of system (I/O devices, ...) • Benchmark: TPC-C OLTP on Oracle 10g Enterprise Database Server • Also SPECINT (bzip2, crafty, gcc, gzip, parser, vortex) • Performance: 10-60 MIPS • 39x faster than Virtutech Simics alone on same system/benchmark • Written in BSB by 1 graduate student (Eric Chung) in 1 year!

Example: Univ. of Glasgow document retrieval experiment FPGA (match algorithm) Document stream Score stream • E.g., • find spam in emails • find similar patents • find relevant news stories SRAM (search terms) • Experiments on 3 collections, from ~1M to 1.5M documents each • Ran same algorithm • 1.6 GHz Itanium-2 • Virtex-4 FPGA • Power consumption: 130 Watts (Itanium), 1.25 Watts (FPGA) • Speedup: ~ 10x – 20x • Itanium slows down as profile (search database) size increases • FPGA does not (parallelism) “FPGA-Accelerated Information Retrieval: High-Efficiency Document Filtering”,W. Vanderbauwhede, L. Azzopardi , and M. Moadeli,in Proc. 19th IEEE Intl. Conf. on Field Programmable Logic and Applications (FPL'09), Prague, Czech Republic, Aug 31-Sep 2, 2009

Example: MEMOCODE’08 Design Contest Goal: Speed up a software reference application running on the PowerPC on Xilinx XUP reference board using SW/HW codesign The application: • decrypt • sort • re-encrypt large db of records in DRAM Time allotted: 4 weeks Xilinx XUP http://rijndael.ece.vt.edu/memocontest08/

Example: MEMOCODE’08 Design Contest Results (BSV) Records had to be repeatedly streamed through a “merge-sort” block. Advantage to those who could rapidly generate a variety of merge-sort architectures and find the best one to “fit” into the FPGA Reference: http://rijndael.ece.vt.edu/memocontest08/everybodywins/

mkConnection connections Get/Put/Client/Server interfaces Get/Put/Client/Server interfaces SW part (e.g., C++) HW part (BSV) generate BSV synthesis gcc FPGA synthesis etc. services services SCE-MI SCE-MI Link layer Link layer link/ load link/ load FPGA In summary With languages that use GPCE principles, HW design is now ready for incorporation into yourprogramming toolbox! Thank you for your kind attention!

Acknowledgements James Hoe (MIT/CMU) and Arvind (MIT) for original technology for high-level synthesis from rules to RTL used in BSV today, 1997-2000 Lennart Augustsson (Chalmers/Sandburst) for Haskell-based generative technology used in BSV today, 2000-2003 My colleagues in the engineering teams at Sandburst and Bluespec for continuous and substantial improvements, 2000-2009 Prof. Arvind’s group at MIT for their research and ideas, 2000-2009

Using GPCE Principles for Hardware Systems and Accelerators

Using GPCE Principles for Hardware Systems and Accelerators

Presentation Transcript

Hardware Support for Operating Systems

Principles for Collaboration Systems

Principles and Pragmatics for Embedded Systems

Increasing Hardware Efficiency with Multifunction Loop Accelerators

Integrable Systems for Accelerators

Hardware Support for Trustworthy Systems

Hardware Support for Trustworthy Systems

Hardware Support for Trustworthy Systems

Basic principles of accelerators (part II) Linear accelerators

Chapter 7 Hardware Accelerators

Hardware Accelerators Project

Basic principles of accelerators

Using Construction Fasteners and Hardware

Using FPGAs with Embedded Processors for Complete Hardware and Software Systems

Hardware Support for Trustworthy Systems

Principles for Collaboration Systems

Efficient Communication Between Hardware Accelerators and PS

Efficient Communication Between Hardware Accelerators and PS