Reconfigurable Architectures

ReconfigurableArchitectures AMANO, Hideharu hunga＠am．ics．keio．ac．jp

ReconfigurableSystem（CustomComputingMachine） • A target algorithm is executed directly with a hardware on SRAM-style FPGA/PLDs. • High performance of special purpose machines. • High degree of flexibility of general purpose machines. • A completely different execution mechanism from stored program computers.

ＰＬＤ（Ｐｒｏｇｒａｍｍａｂｌｅ　Ｌｏｇｉｃ　Ｄｅｖｉｃｅ）ＰＬＤ（Ｐｒｏｇｒａｍｍａｂｌｅ　Ｌｏｇｉｃ　Ｄｅｖｉｃｅ） • IntegratedCircuit whose logic function can be defined by users. Standard ＩＣ，ＡＳＩＣ(Ａｐｐｌｉｃａｔｉｏｎ　Ｓｐｅｃｉｆｉｃ　ＩＣ） • ＳＰＬＤ（Simple PLD) / ＰＬＡ（Programmable Logic Array) • Small scale IC with AND-OR array • ＣＰＬＤ(Complex PLD) • Middle scale IC with AND-OR array • ＦＰＧＡ（Ｆｉｅｌｄ　Ｐｒｏｇａｒｍｍａｂｌｅ　Ｇａｔｅ　Ａｒｒａｙ） • Large scale IC with LUT Caution! Terms are not well defined!

Rapidly development of PLD Gate number Increasing Performance From 1991-2000 Amount of gate: X45 Speed: X12 Cost:1/100 10M 1M Anti-fuse FPGA SRAM- FPGA Hierarchical structure Embedded Core Low voltage 100K CPLD EEPROM- SPLD Fuse- PLA 10K 1980 1990 2000

SPLD（SimplePLD:AND-OR/Product-term） OR AND NOT Arbitrary logic is realized by changing the AND-OR connection

A&B | C&D C & D AND/OR connection example C D A B OR AND NOT A & B

C B ABC Z Z A ０００００００１０００１００００１１１１１００００１０１００１１０００１１１１１ LUT：Look Up Table Look Up Table ROM/RAM Address Data … … A simple ROM/RAM can used as a random logic. A combination of memory and multiplexers are commonly used.

Z Z ００００００１１００００００１１ An example using LUT：Look Up Table 110 C B ABC A ０００００１ 1 ０１００１１１００１０１１１０１１１

AND-OR arrayvs.LUT • AND-OR array（product-term） • Efficient for logic with multiple outputs • There is a type of logic which cannot be realized. • Suitable for EEPROM and Flash-ROM • LUT • Any logic can be realized. • Efficient for logic with a single output • Suitable for Flash-ROM, Anti-fuse, and SRAM.

Q From AND/ORarray D Feed back Q Output Module Q Q Q Q D D D D Sequential circuits AND・OR array or LUT Input Output Feed Back Sequential circuit (state machine) can be built by attaching Flip-flops and feed back loops.

Programmable Switch SPLD SPLD SPLD SPLD SPLD SPLD CPLD(ComplexPLD) Matrix of SPLDs Programmable Switch SPLD Altera’s MAX 2-dimensional Array

FPGA(Field Programmable Gate Array) LUT Connection Block F.F Configurable Logic Block island style Switch Block LUT and interconnection is decided with configuration data IOB

Device for flexibility（１） • Anti-fuse type • Program by destruction of isolation with high voltage • High speed but One-time • ACTEL、Quicklogic • EEPROM・Flash-ROM • Switches for connections are realized by floatinggates. • Re-programmable • Lattice、Altera’sMAX series

Device for flexibility（２） • SRAM • Data on SRAM represents look up table and wire connection. • ISP (In System Programming) is available. • The configuration data is erased, when the power turns off. • Suitable for a large scale FPGA. Recently, rapidly advanced. • XilinxXC、　AlteraFLEX,LucentORCA • The advanced series:XilinxVirtex,AlteraStratix • その他 • Magnetic memory • DRAM

Architectures and devices High speed middle size One-time ACTEL，Quicklogic Anti-fuse SPLD CPLD EEPROM High speed small/middle size Re-programmable Delay is predictable Lattice，Altera，Xlinx Flash-ROM FPGA Large scale Rapidly development Xilinx、Altera SRAM

Recent PLDs • High-end: a large scale chip with hierarchical structure：　 • Xilinx’s Virtex ＩＩ、Virtex-4/LX, Altera’s Stratix-3 • System on Programmable Device • Providing DLL，CPU、DSP, ROM, RAM, Multiplier, High speed link, and other hard IPs. • Xilix’s Virtex-4/EX,FX, Altera’s Stratix-3 • Specialized for mass-production • Low cost：Xilinx’s Spartan, Altera’s Cyclone • Low voltage, Multiple voltages, and Low power consumption

Process and parameters（Xilinx co.）

Slice X 2　→　CLB(Configurable Logic Block) LUT Q Carry D DCM IOB Global Clock MUX LUT Q Carry D Slice 100000 CLBs 3Mbit Configurable Logic Multiplier RAM Programmable IOs XilinxVirtex II

DSPBlocks Mega RAM Blocks PLL M4K RAM Blocks LAB：Logic Array Block consisting of 10 LE ( 4-input LUT and F.F.) M512 RAM Blocks Hierarchical Interconnect Altera Stratix II

SoPD (System on Programmable Device) Rocket I/O, Multi-Gigabit Transceiver DCM Xilinx Virtex-II Pro Power-PC Multiplier Block RAM CLBs Various kinds of cores are embedded on an FPGA

FPGA vs. ASIC[Kuon:FPGA2006] • Pure FPGA without hard macros • Area：４０X • Speed：１/3.2X • Power: 12X • FPGA with hard macros • Area: 21X • Speed: 1/2.1X • Power: 9X

Technologies vs. Product 45nm 40nm 65nm 60nm 90nm 28nm High-end Virtex-7 T/XT/HT 2000000LC Virtex-5LX/LXT/SXT/ FXT/TXT 330000LC Virtex-6LXT/SXT/ HXT/CXT 760000LC Virtex-4LX/FX/SX 200000LC Stratix-V /E/GX/GS/GT 359200ALM Stratix-IV /E/GX/GT 531200LE Stratix-II/GX 179400LE Stratix-III/L/E 338000LE X1.5-X2.5／generation Kintex-7 480000LC Middle range Arria-II Arria-IV 174000LE Arria Low-cost Artix-7 360000LC Spartan-3A N/DSP 53000LC Spartan-6LX/LXT 150000LC Cyclone II 68416LE Cyclone IV/E/GX 149760LE Cyclone V /E/GX/GS/GT 301000LE Cyclone III/LS 119088LE High-end/Low-cost: X3－X5

FF FF FF FF Carry Carry Carry Carry MUX MUX MUX MUX 6bit 6bit 6bit 6bit LUT 6inX1 5inX2 LUT 6inX1 5inX2 LUT 6inX1 5inX2 LUT 6inX1 5inX2 MUX MUX MUX MUX FF FF FF FF Slice structure of Virtex-6 Virtex-6 manual

Virtex-6 CLBs COUT COUT COUT COUT CLB CLB Slice X1Y1 Slice X3Y1 Slice X1Y0 Slice X2Y1 CIN CIN CIN CIN COUT COUT COUT COUT CLB CLB Slice X1Y0 Slice X3Y0 Slice X0Y0 Slice X2Y0 Virtex-6 manual

Stratix-IV ALM Structure carry reg_carry shared_arith adder2 4bit data MUX LUT 6in MUX FF adder1 4bit data MUX LUT 6in MUX FF 4-in LUT X 2 5-in LUT + 3-in LUT 5-in LUT + 4-in LUT 1-input shared 5-in LUT + 5-in LUT 2-input shared 6-in LUT 6-in LUT + 6-in LUT 4-input shared Stratix-IV manual

ALMs Local Interconnect LAB MLAB Local Interconnect Stratix-IV LAB structure Stratix-IVマニュアルより

Low-power FPGAs • Actel: ProASIC 3/E→IGLOO • Flush ROM • IGLOO: with ARM core • Flash freeze: Low power stand-by mode(2μW) • Silicon Blue ICE65 series • Embedded Flash memory (NVCM) • 5mA(1792cells,32MHz) • 9mA(3520cells,32MHz) • AlteraArria、Arria-II • Low Power Mid-range • 8-input LUT

Spartan-3 Power Consumption[tuan2006] Dynamic Power about 200mW (3S1000) Static Power about 60mW(3S1000)

Config SRAMs Config SRAMs Interconnect Switch Matrix CLB Virtual ground Power Gate Config SRAMs Power Gating for Spartan-3 Tile FPGA Core

Partial Reconfiguration • A part of configuration on FPGA can be replaced during operation. • Efficient use of FPGA area • Virtex-IIPro →Column by column • Virtex-4 →Rectangle shape • Virtex-6→DynamicReconfiguration Port（DRP) is provided. • Partial reconfiguration is controlled by the logic in FPGA • Operating Systems for FPGA have been developed.

Partial Reconfiguration Clock Region Boundary PRM Reconf Frame PRM(Partial Reconfigurable Module) is placed on a fixed frame. The problem is the interconnection between partial reconfiguration module and static part.

2.5D FPGA (http://www.xilix.com)

QuickLogic LatticeGAL

AlteraFLEX10K

XilinxVertex Qucklogic

Design of PLDs • Mostly designed with common HDL（Verilog-HDL,VHDL) • C level entry is used recently: Impulse-C, Vibado(Xilinx), Open-CL(Altera) • Synthesis, optimization, place and route is automatically done by vendors’ tools. • Integration and combination of tools from various venders are used recently. • For large circuit, a long time is required especially for place and route. • Using IPs, clock/DLL adjustment is manually done. • Optimization techniques are different from vendors/products.

ReconfigurableSystem（CustomComputingMachine） • A target algorithm is executed directly with a hardware on SRAM-style FPGA/PLDs. • High performance of special purpose machines. • High degree of flexibility of general purpose machines. • A completely different execution mechanism from a stored program computers.

ASIC Refonfigurable Systems Performance FPGAs High Performance and Flexibility Design A Design D CPU Design B CPU Software Design C for i=0; i<K; i++ X[i]=X[i+j] ..... Flexibility

How enhance the performance？ • Performance enhancement by hardware execution itself • The overhead of software execution (Instruction fetch, data load to registers, and etc.) • The overhead of using fixed size data. • The overhead of using only two way branches. However, these benefits are not so large, for embedded CPU and DSP are highly optimized. The key of performance improvement is parallel processing

Parallel processing in reconfigurable systems • Various techniques can be used • SIMD execution • Pipelined structure • Systolic algorithm • Data driven control • Parallel execution other than calculation • Parallel data access using internal memory units • Parallel data transfer including I/O accesses

StreamData in StreamData out SIMD (Single Instruction-stream/Multiple Data-stream)-like calculation The same instruction is applied to different data stream In ReconfigurableSystems, the operation is not required to be same （SIMD-like calculation） Processing part Internal Memory module

StreamData １ StreamData 2 StreamData 3 StreamData 4 StreamData 5 StreamData １ StreamData 2 Pipelined structure The stream is divided and inserted periodically. Processing part Internal Memory module

Systolic Algorithm Data x Computational array Data y Data stream x，y are inserted with a certain interval. When two stream meet each other, a calculation is executed. →Systolic:The beat of heart

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax y0 y1 y2 y3 x0 x1 x2 x3 = a ｙｉｙｏＸ＋ｙｏ＝ａｘ＋ｙｉ x

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax a23 a32 a22 a12 a21 a11 Ｘ＋ x1

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax a33 a23 a32 a22 a12 a21 y1=a11x1 Ｘ＋Ｘ＋ x2 x1

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax a34 a43 a33 a23 a32 a22 y1=a11 x1+ a12 x2 y2=a21 x1 X ＋ x3 x2 x1

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax a44 a34 a43 a33 a23 a32 y2=a21 x1+ a22 x2 Ｘ＋Ｘ＋ x2 x3

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax a44 a34 a43 y2=a21 x1+ a22 x2+ a23 x3 a33 y3= a32 x2 Ｘ＋ x3 x2

Data flow algorithm ｄｅ The process is activated with the available of tokens (data) ｃｘａｂ＋＋ｘ（ａ＋ｂ）ｘ（ｃ＋（ｄｘｅ）） The overhead of synchronization is large.

Reconfigurable Architectures