1 / 38

Field-Programmable Technology: Today’s and Tomorrow’s

Field-Programmable Technology: Today’s and Tomorrow’s. Wayne Luk Imperial College London TWEPP-07 Prague 4 September 2007 Imperial College London April 2005. Outline: technology = devices + design. 1. overview: motivation and vision 2. field-programmable devices: today

job
Download Presentation

Field-Programmable Technology: Today’s and Tomorrow’s

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Field-Programmable Technology: Today’s and Tomorrow’s Wayne Luk Imperial College London TWEPP-07 Prague 4 September 2007 Imperial College London April 2005

  2. Outline: technology = devices + design 1. overview: motivation and vision 2. field-programmable devices: today - Xilinx Virtex-4, Virtex-5; Stretch S5 3. field-programmable design : today - enhance optimality and re-use 4. field-programmable devices: tomorrow - hybrid FPGA, die stacking 5. field-programmable design : tomorrow - guided synthesis, representation, upgradability 6. summary Thanks to colleagues, students and collaborators from Imperial College, University of British Columbia, Chinese University of Hong Kong, University of Massachusetts Amherst, UK Engineering and Physical Sciences Research Council, Stretch, Xilinx

  3. 1. Motivation: good - Moore’s Law • rising • capacity • speed • falling • power/MHz • price Source: Xilinx

  4. Not so good: design productivity gap 10,000,000 100,000,000 Logic (Moore’s Law) Transistors/Chip .10m Transistor/Staff Month 1,000,000 10,000,000 58%/Yr. compound Complexity growth rate 100,000 1,000,000 Productivity Trans./Staff - Month 10,000 100,000 .35m Logic Transistors per Chip`(K) 1,000 10,000 x 100 1,000 x x x x x x 100 21%/Yr. compound Productivity growth rate 10 2.5m 10 1 1991 1999 2001 2003 2007 1987 1989 1993 1995 1997 2005 2009 1983 1985 1981 Source: SEMATECH challenge: reduce the design productivity gap

  5. Our vision: unified synthesis + analysis application description /Progol Java/ manual/automatic partition, compile, analysis, validate system-specific programming interface FPGAs + sensors CPUs + DSPs system-specific architecture

  6. 2a. Devices today: Virtex-4 FPGA 450MHz PowerPC™ Processors with Auxiliary Processor Unit 500MHz multi-port Distributed RAM 500MHz DCM Digital Clock Management 500MHz Flexible Soft Logic Architecture 0.6-11.1Gbps Serial Transceivers 1Gbps Differential I/O 500MHz Programmable DSP Execution Units Source: Xilinx • fine-grain fabric + special function units • challenge: use resources effectively

  7. 2b. Virtex-5 FPGA • capacity • rises • logic speed • rises • I/O speed • rises • all good news? Source: Xilinx

  8. Growing gap: amount of gates vs I/O • I/O off-chip • serial • I/O on-chip • scalable • flexible • easy to use • interconnect • heterogeneous • customisable Source: Xilinx

  9. 2c. Software configurable engine: S5 I-CACHE 32KB D-CACHE 32KB SRAM 256KB DATA RAM 32KB MMU 32-BIT RF 128-BIT WRF 32-BIT RF CONTROL ISEF Instruction-Set Extension Fabric ALU FPU ALU FPU S5 ENGINE • Wide Register File (WRF) • 32 Wide Registers (WR) • 128-bit Wide • Load/Store Unit • 128-bit Load/Store • Auto Increment/Decrement • Immediate, Indirect, Circular • Variable-byte Load/Store • Variable-bit Load/Store • RISC Processor • Tensilica – Xtensa V • 32 KB I & D Cache • On-Chip Memory, MMU • 24 Channels of DMA, FPU • ISEF • Instruction Specialization Fabric • Compute Intensive • Arbitrary Bit-width Operations • 3 Inputs and 2 Outputs • Pipelined, Bypassed, Interlocked • Random Logic Support • Internal State Registers Source: Stretch

  10. 3. Design today: overview • structural or register-transfer level (RTL) • e.g. VHDL, Verilog • low-level, little automation, small designs • behavioural, system-level descriptions • e.g. SystemC (public-domain: systemc.org) • MARTES: +UML for real-time embedded systems • general-purpose software languages • e.g. C, Java; with hardware support: Handel-C • high-level, large automation, large designs • special-purpose descriptions • e.g. System Generator (signal processing) • high-level, domain-specific optimisations

  11. 3a. Enhance optimality and re-use • design optimality: quality • select algorithm and devices: meet requirements • mapping: regular to systolic, rest to processor • I/O: dictates on-chip parallelism, buffering schemes • control speed/area/power: pipelining, layout plan • partitioning: coarse vs fine grain logic and memory • design re-use: productivity • separate aspects specific to application/technology • library of customisable components with trade-offs • compose and customise to meet requirements • uniform interface to memory and I/O: hide details • pre-verified parts: ease system verification

  12. Example: systolic summation tree • n-input adder • tree of (n-1) 2-input adders • each adder • has k stages • each stage has s-bit adder • figure shows • k = 3 • s = 3 • high k, low s • less cycle time • more area • less power

  13. Finance application: value-at-risk • sampling from multivariate Gaussian distribution • DSP units: matrix multiplication • systolic tree: accumulate result • RC2000 board • Virtex-4 xc4vsx55 • 400MHz 33 times faster than 2.2GHz quad Opteron (including all IO overheads, PC-FPGA communications, and using AMD optimised BLAS for software)

  14. 3b. Stretch program development CONTROL • profile code • identify hotspots • special instruction • implement ‘C’ functions in single instructions • bit-width optim. • software compiler • generate instruction • schedule instruction • multiple data (WR) • perform operations in parallel • efficient data movement • intrinsic load store operations • 20+ DMA channels COMPILED MACHINE CODE APPLICATION C/C++ NEW INSTRUCTIONS Instruction Definition Compiler TAILOR ISEF TO APPLICATION INSTRUCTION GENERATION AUTOMATIC Source: Stretch

  15. 4. Devices tomorrow: more diversed 450MHz PowerPC™ Processors with Auxiliary Processor Unit 500MHz multi-port Distributed RAM 500MHz DCM Digital Clock Management 500MHz Flexible Soft Logic Architecture 0.6-11.1Gbps Serial Transceivers 1Gbps Differential I/O Source: Xilinx 500MHz Programmable DSP Execution Units Replaced by other functional units, e.g. floating-point units

  16. 4a. Hybrid FPGA: architecture • most digital circuits • datapath: regular, word-based logic • control: irregular, bit-based logic • hybrid FPGA • customised coarse-grained block: domain-specific requirements • fine-grained blocks: existing FPGA architecture • good match to computing applications for given domain

  17. Coarse-grained fabric library D=9, M=4, R=3, F=3, 2 add, 2 mul: best density over benchmarks

  18. Evaluation • 6 benchmark circuits • digital signal processing kernels: e.g. bfly (for FFT) • linear algebra: e.g. matrix multiplication • complete application: e.g. bgm (financial model) • circuits: partitioned to control + datapath • control: vendor tools to fine-grained units • datapath: manually map to coarse-grained units • comparison • directly synthesized to Xilinx Virtex-II devices

  19. Results

  20. 4b. On-chip memory bandwidth • storage hierarchy: registers, LUT RAM, block RAM • processor cache: address lack of I/O bandwidth Source: Xilinx

  21. High density die stacking Source: Xilinx

  22. Programmable circuit board • die stacking: 3D interchip connections • customisable system-in-package: productivity gap? Source: Xilinx

  23. 5a. Design tomorrow: guided synthesis • guided transformation of design descriptions • automate tedious and error-prone steps • applicable to various levels of abstraction • focus: two timing models • strict timing model: cycle-accurate - efficiency • flexible timing model: behavioural - productivity • combine cycle-accurate and behavioural models • rapid development with high quality • design maintainability and portability • based on high-level language • library developer: provide optimised building blocks • application developer: customise building blocks

  24. Timing models: strict vs flexible 2 a c MUX MUX b * b * << - 0 > true false c c 2 0 = == true false c c 0 num_sol = = 1 num_sol num_sol scheduling unscheduling b a c 2 * << * - executed at cycle1 delta strict timing(Handel-C) > flexible timing 0 == 0 2 1 executed at cycle 2 0 { delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; } num_sol behavioural model - partial-ordering - abstract operations cycle accurate model - total-ordering - resource-bound

  25. 2 a c MUX MUX b a c b * b stage 1-7 * pmult[0] pmult[1] << stage 8 << 2 - tmp0 tmp1 0 > - stage 9 cycle1 true false tmp2 c c 2 0 = == stage 10 true false c c > 0 0 num_sol = = 1 == 2 1 0 num_sol num_sol num_sol delta Rapid design: automated scheduling scheduling par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } { delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; } constraints + unscheduling(flexible timing) synthesis(strict timing) • support combination of manual and automatic scheduling

  26. Maintainability: retarget design 2 a c MUX MUX MUX MUX b a c b * b stage 1-7 * pmult [0] pmult [1] << stage 8 << 2 - tmp0 tmp1 0 > - stage 9 cycle1 true false tmp2 c c 2 0 = == stage 10 true false c c > 0 0 num_sol = = 1 == 2 1 0 num_sol num_sol num_sol delta scheduling par { { // ================= [stage 1] pipe_mult[0].in(b,b); pipe_mult[0].in(a,c); } .... } par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } synthesis(strict timing) constraints + b a c stage 1-4 cycle 1 pmult[0] pmult[0] tmp0 cycle 2 << stage 5 2 cycle 1 tmp1 - cycle 2 tmp2 stage 6 unscheduling(flexible timing) 0 > cycle 1 == 2 1 0 num_sol cycle 2

  27. Implementation

  28. Automatically generated results with respect to software with respect to smallest • ffd: free-form deformation; dct: discrete cosine transform

  29. 5b. Data representation optimisation • In1..In5 • known width • Out1..Out2 • width determines accuracy • defined by user • find representation • minimise width of nodes, e.g. X, Y • trade-off in speed, area, power, error

  30. Fixed-point - range: integer - precision: fraction Floating-point - range: exponent - precision: mantissa

  31. FIR filter and DFT: area vs error 1% more error: 65% less area

  32. FIR filter and DFT: speed vs error 2% more error: 20% higher speed

  33. 5c. Upgradable design Number of new users upgrade 2 upgrade 1 initial release upgradable design non-upgradable design 30 20 10 Time (months) • upgradability: minimise time-to-market • maximise time-in-market • add new functions, fix bugs • very rapid upgrade? Source: Xilinx

  34. Dynamic upgrade: turbo coder • error correction code: add redundancy • need: fast, low-power, adapt to noise level • recursive systematic convolutional (RSC) encoder, decoder, interleaver In Out Decoder Decide RSC noisy commun. channel Interleaver Interleaver Interleaver De-Interleaver Decoder RSC Turbo encoder Turbo decoder Source: Liang, Tessier, Goeckel

  35. Self-tuning: run-time reconfiguration controller Nmax Configure FPGA LUT =? SNR current Nmax • adapt: less channel noise, so lower power • larger Nmax: better correction, more area/power • sample channel noise every 250K bits • find Signal to Noise Ratio (SNR), select Nmax • if Nmax  current Nmax, configure new bitstream • configuration overhead: about 30ms, 54.8mW Source: Liang, Tessier, Goeckel

  36. Performance of dynamic upgrade • up to 0.5x power, 2x speed over static decoder • 100 times faster than processor decoder Source: Liang, Tessier, Goeckel

  37. Other directions • domain-specific design automation • languages + tools: for particle physics systems? • multi-core, sensor network co-design • multiple hardware/software: FPGA + CPU + sensors • extending processor and compiler capabilities • static and dynamic optimizations, self-tuning • power-aware, radiation-aware design • transforms e.g. pipelining, damage monitoring • rapid and informative design validation • simulation + FPGA prototype + formal verification

  38. 6. Summary • good: Moore’s Law, bad: productivity gap • vision: unified design synthesis and analysis • devices and design today • growing gap: amount of I/O and amount of logic • enhance optimality and re-use: I/O driven • devices tomorrow • hybrid FPGA: multi-granularity fabric • 3D FPGA: customisable system-in-package • design tomorrow • guided synthesis: optimised and portable design • data representation optimisation • upgradable and self-tuned design

More Related