1 / 61

ENG6530 Reconfigurable Computing Systems

ENG6530 Reconfigurable Computing Systems. Hardware Software Co-design. Topics. H/S Co-Design Definition Motivation Design Steps, Profiling, Partitioning Allocation Xilinx EDK. References. “Embedded System Design: A Unified Hardware/Software Introduction” by Frank Vahid, Wiley, 2002.

denim
Download Presentation

ENG6530 Reconfigurable Computing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ENG6530 Reconfigurable Computing Systems Hardware Software Co-design ENG6530 RCS

  2. Topics • H/S Co-Design Definition • Motivation • Design Steps, • Profiling, • Partitioning • Allocation • Xilinx EDK ENG6530 RCS

  3. References • “Embedded System Design: A Unified Hardware/Software Introduction” by Frank Vahid, Wiley, 2002. • “Hardware/Software Codesign: A systematic approach targeting data-intensive applications”, Wayne Luk, IEEE Signal processing Magazine, May 2005. • “Hardware-Software Co-synthesis for Digital Systems”, R.Gupta, G. De Micheli, G., IEEE Design & Test of Computers, September 1993, pp. 29-41 • “Hardware/Software Design Space Exploration for a Reconfigurable Processor”, A. Rosa, 2003. • “A Framework for Hardware/Software Co-design”, S. Kumar, Q. Wulf, IEEE 1993. ENG6530 RCS

  4. Definition – Hardware/Software Co-Design • The design of computer systems that incorporates both standardized off the shelf processors, or software, as well as specialized hardware. • The cooperative design of hardware and software components. • The unification of currently separate hardware and software paths. • The movement of functionality between hardware and software. ENG6530 RCS

  5. Input Decoding FIR Filter Tick to Speed Inversion Output Encoding H/S Co-design: Example • Optical wheel speed sensor. • System constraints  Area – 40 units, time – 100 cycles • This could be implemented using either standardized processors, specialized hardware or a combination of both ENG6530 RCS

  6. H/S Co-design: Software • Design implemented in software • System constraints • Area – 48 units > 40 units • Time – 132 cycles > 100 cycles • Design Time – 2 months Processor #1 Processor #2 ENG6530 RCS

  7. H/S Co-design: Hardware • Design implemented in custom RTL hardware • System constraints • Area – 24 units, < 40 units • Time – 52 cycles< 100 cycles • Surpasses both area and timing constraints by 40% • Design Time – 9 months • Delay in design is unacceptable in a competitive world. ENG6530 RCS

  8. H/S Co-design • Design implemented in hardware & software • System constraints • Area – 37 units, < 40 units • Time – 95 cycles< 100 cycles • Design Time – 3.5 months • Not as efficient as design II • However, it establishes a balance between two extremes. Processor #1 ENG6530 RCS

  9. Motivations • Achieve performance by moving software bottlenecks to hardware • Use hardware to meet time & area constraints which cannot be met alone using general purpose processors. • Not possible to put everything in hardware due to limited resources • Some code more appropriate for sequential implementation (i.e. achieve flexibility) • Today’s designs are focusing on Embedded Systems which require both hardware and software modules ENG6530 RCS

  10. Motivations … cont • The complexity and functionality of computer systems are increasing at a dramatic rate  SystemOnChip (SOC). • It is difficult for custom systems to be designed, built, verified within an acceptable time period even with advanced CAD tools unless standardized parts are used. (Solution?) • Take advantage of previously designed (IPs) and tested processor to reduce time and improve reliability. ENG6530 RCS

  11. Trade-offs/Decisions • Given a set of specified goals and implementation technology, constraints, … designers consider trade-offs in how hardware and software components work together. • Decisions, Constraints and Evaluations? • Performance. • Area. • Power. • Flexibility (Programmability). • Development & Manufacturing costs. • Reliability • Robustness • Maintenance • Design evolution. ENG6530 RCS

  12. Hw/Sw Co-Design: Research • Research in hardware-software co-design encompasses many interesting areas of research such as: • System specification and modeling • Design Exploration • System co-verification and co-simulation • Code generation for hardware/software • Hardware/Software interfacing • Partitioning • Scheduling • However the most important objective is to develop a unified design methodology/tool for creating systems containing both hardware and software. ENG6530 RCS

  13. A Simple Approach Profiling Application Partitioning Evaluation Decision Schedule tasks H/W S/W ENG6530 RCS

  14. Profiler Profiler Critical Regions HW ______ ______ ______ SW ______ ______ ______ SW ______ ______ ______ SW ______ ______ ______ ASIC/FPGA Processor Processor Processor Processor Processor Profiling and Partitioning • Benefits • Speedups of 2X to 10X typical • Far more potential than dynamic SW optimizations (1.2x) • Energy reductions of 25% to 95% typical SW ______ ______ ______ SW ______ ______ ______ ENG6530 RCS

  15. Profiling • Profiling allows you to learn where your programspent its time and which functions called which other functions while it was executing. • The profiler uses information collected during the actual execution of your program, therefore, it can be used on programs that are too large or toocomplex to analyze by reading the source. • This information can show you which pieces of your program are slower than you expected. • These might be candidates for either: • Rewriting code to make your program execute faster. • Moving these functions to hardware. ENG6530 RCS

  16. Profiling: Steps • You must compile and link your program with profiling enabled. • cc -o myprog.exe myprog.c utils.c –g –pg • You must then execute your program to generate a profile data file • Your program will write the profile data into a file called `gmon.out’ just before exiting. • You must run gprof to analyze the profile data. • gprofoptions myprog.exe gmon.out > outfile • The gprof program prints a flat profile and a call graph ENG6530 RCS

  17. Profiling: Useful Hints • Options: • -efunction_name : tells gprof to NOT print information about the function function_name (and its children …) in the call graph. • -ffunction_name: causes gprof to limit the call graph to the function function_name and its children. • -b : gprof doesn’t print the verbose blurbs that try to explain the meaning of all of the fields in the tables. ENG6530 RCS

  18. Profiling: Flat Profile • % time : is the percentage of the total execution time your program spent in this function. • cumulative seconds: This is the cumulative total number of seconds the computer spent executing this function plus time spent in all the functions above. • self seconds: This is the number of seconds accounted for by this function alone. • calls: this is the total number of times the function was called. • self ms/call: This represents the average number of milliseconds spent in this function per call. • total ms/call: This represents the average number of milliseconds spent in this function and its descendants per call. • name: This is the name of the function. ENG6530 RCS

  19. Simple Approach: Drawbacks • Some functions might not be easily mapped onto hardware. • Decisions taken very early at profiling phase might not be optimal. • No consideration for interfacing and communication. • If the application changes slightly then we need to re-profile and re-partition. ENG6530 RCS

  20. Applications Not suitable for RCS Not all applications are suitable for Reconfigurable Computing: • Applications that involve extensive recursion, for example, are a poor match because the synthesized “hardware” must be of fixed size. • Applications that have only a small percentage of parallelism (1-5%) will not make advantage of RCS. • Applications that are I/O bound will also suffer due to memory I/O transfer • Applications that require floating point arithmetic ENG6530 RCS

  21. Scheduling/Arbitration Computation Templates Communication Templates TDMA EDF Cipher FPGA proportionalshare FCFS WFQ DSP DSP RISC SDRAM dynamicfixed priority static LookUp mE Architecture # 2 Architecture # 1 Which architecture is better suited for our application? Design Space Exploration LookUp RISC EDF mE mE mE TDMA static Priority mE mE mE WFQ Cipher DSP ENG6530 - Design Exploration

  22. H/S Codesign: A Framework System Representation System Evaluation CoDesign Refinement (Produce a hardware software alternative via evaluation) Decomposition (Break down system functions into a collection of sub-functions) H/S Partitioning (Determine which of the sub-functions should be implemented in H/S) System Integration ENG6530 RCS

  23. Co-Synthesis/Co-Design ENG6530 RCS

  24. Partitioning & Scheduling • Task partitioning and task scheduling are required in many applications, for instance co-design systems, Multi Processing Systems and High Level Synthesis. • Sub-tasks extracted from the input description should be implemented in the • Where?  The right place (using the Partitioner/Placer) • When?  The right time (using the scheduler) • It is well known that such scheduling and partitioning problems are NP-complete. • Optimization techniques based on heuristic methods are generally employed to explore the search space so that feasible and near-optimal solutions can be found. ENG6530 RCS

  25. process (a, b, c) in port a, b; out port c; { read(a); … write(c); } Specification System Partitioning Line () { a = … … detach } Interface Partition • Good partitioning mechanism: • Minimize communication across bus • Allows parallelism  both hardware (FPGA) and processor operating concurrently • Load Balancing  Near peak processor utilization at all times (performing useful work) Model FPGA Capture Synthesize Processor ENG6530 RCS

  26. Terminology: Hypergraphs a hypergraph H = <V, Eh> V is a set of vertices h Eh is a subset of vertices, 2V a graph G = <V, E> V is a set of vertices e  E is a pair of vertices (u,v) • a netlist is a hyper-graph • Hyper-graphs can be approximated as graphs, breaking each hyper-edge into a clique of edges ENG6530 RCS

  27. Bi-partitioning Problem • given a hyper/graph G • find a partition P of V V1, V2 s.t V1V2=, V1V2=V • minimizing number of edges that cross the cut min c(P) = all h w(h) if (uV1 and vV2) where u and v are connected by edge h • subject to a capacity constraint a a-1 > |V1| / |V2| > a ENG6530 RCS

  28. Bipartitioning Approaches • Exact Methods: • Mixed Integer Programming (using Branch and Bound) !! • min-cut / max-flow (Ford-Fulkerson 1962) • maximum flow through graph = minimum cut • useful for establishing unconstrained bound • Heuristics (Local Search) • Kernighan-Lin (1970) • operates on graphs • swap all nodes once, in pairs that yield max. gain • choose greatest gain over pass,repeat until no improvement • O(n2log n) • Fiduccia-Mattheyses (1982) • operates on hypergraphs • O(p), linear time! • Meta Heuristics (avoid getting stuck in local minima) • Simulated annealing • select some random moves based on “temperature” • design hopefully “cools” into optimal solution • computationally intensive • Tabu Search • Genetic Algorithms • Particle Swarm Optimization ENG6530 RCS

  29. Fiduccia-Mattheyses - generate initial partition - calculate gain g(c) of moving each cell while improvement { clear cells being locked; while max g(c) > 0 | c  locked { select cell with max g(c) | c  locked; move c across the cut; c → locked; update g(c) for all of c’s neighbors; } } one pass O(p) ENG6530 RCS

  30. Example goal:partition graph into two disjoint halves so as to minimize the number of hyperedges that span the cut c a b e d • all edges have unit weight • given balance criteria: |V1| -1 ≥ |V2| ≥ |V1| + 1 f ENG6530 RCS

  31. Example (cont’d) c a b Step 1. random partition assigned to keep balance e d f number of cuts = 5 ENG6530 RCS

  32. Example (cont’d) +1 +2 Step 2. initial gains are calculated for each cell results are placed into bucket array +2 c a b e d -1 +1 +2 d number of cuts = 5 ENG6530 RCS

  33. Example (cont’d) 0 d +1 0 Step 3. cell is selected gains of critical nets are updated cell is locked from further movement 0 c a b e d -1 +1 number of cuts = 3 ENG6530 RCS

  34. Example (cont’d) 0 d 0 Step 3. Another cell is selected gains of critical nets are updated cell is locked from further movement 0 c b 0 e a d -1 -1 number of cuts = 2 ENG6530 RCS

  35. Co-design: Tools • Co-design tools should provide an almost automatic framework for producing a balanced and optimized design from some initial high level specification. • The goal of co-design tools and platforms is not to push towards this kind of total automation. • The designer interactionsand continuous feedback is considered essential. • The main goal is to incorporate in the black box of co-design tools that support for shifting functionality and implementation between HW   SW with effective and efficient evaluation. ENG6530 RCS

  36. H/S Co-Design: Approaches • Opposite strategies • Vulcan (“primal” approach) • Functionality all in HW (HardwareC) initially • Move some to CPU to reduce architecture cost • Cosyma (“dual” approach) • Functionality all in SW (Cx) initially • Move some to ASIC to meet performance goals • Lycos • Convert all functionality to neutral form ENG6530 RCS

  37. Partitioning Algorithms Software Hardware • Assume everything initially in software • Select task for swapping • Migrate to hardware and evaluate cost? • Timing, hardware resources, program and data storage, synchronization overhead • Cost evaluation and move evaluation similar to what we’ve seen regarding min-cut FM Algorithm. task List of tasks List of tasks ENG6530 RCS

  38. Automation • Compiler profiler determines dependence and rough performance estimates • Result of compilation is synthesizable HDL and assembly code for the processor ENG6530 RCS

  39. Interfacing System Description • Interfacing between software and hardware modules is crucial for successful Co-design • How data is passed between sub-modules efficiently. • The rate of exchange of information between modules Hw/Sw Partitioning Co Synthesis Interface Software Hardware System Integration Co-Simulation ENG6530 RCS

  40. Interface Models: FIFO • Synchronization through a FIFO • FIFO can be implemented either in hardware or in software • Effectively reconfigure hardware (FPGA) to allocate buffer space as needed • Interrupts used for software version of FIFO r3 p1 p2 p3 r2 d1 FPGA Control/Data FIFO d3 d2 ENG6530 RCS

  41. Warp Processors 2 Profile application to determine critical regions 1 Initially execute application in software only 3 Profiler Partition critical regions to hardware MIPS/ARM I$ 5 D$ Partitioned application executes faster with lower energy consumption Configurable Logic Dynamic Part. Module (DPM) 4 Program configurable logic & update software binary ENG6530 RCS

  42. Summary • Hardware/Software co-design is becoming the common design style for building systems. • H/S co-design allows the majority of a system to be designed quickly with standardized parts while special purpose hardware is used for time critical portions of the system. • Xilinx and Altera provide complete flow for H/S co-design. • Issues: • How to partition the system? • Communication overhead!! • Platforms to be used • Languages that support this paradigm. ENG6530 RCS

  43. Extra Slides ENG6530 RCS

  44. Embedded CPUs • PowerPC 405 (hard core) • 32 bit embedded PowerPC RISC architecture • Up to 450 MHz • 2x16 kB instruction and data caches • Memory management unit (MMU) • Embedded in Virtex-II Pro and Virtex-4/5/6 • ARM Cortex –A9 (hard core) • 32 bit multicore processor • Up to 900 MHz • Xilinx Zynq 7000 Processing platform • Device is processor based attached to FPGA • High level of performance • Reduces power, cost, size • MicroBlaze (soft core) • 32 bit RISC architecture • 2 64 kB instruction and data caches • Hardware multiply and divide • OPB and LMB bus interfaces... ENG6530 RCS

  45. PowerPC MicroBlaze PicoBlaze MicroBlaze Embedded Processors • Hard core • Faster • Fixed position • Few devices • Virtex-4 Processors: • Soft core • Slower • Can be placed anywhere • Applicable to many devices ENG6530 RCS

  46. Power Supply CLK EthernetMAC CLK AudioCodec InterruptController Timer GP I/O AddressDecodeUnit CPU(uP / DSP) UART Co-Proc. LC Memory Controller customIF-logic CLK SRAM SRAM SRAM DisplayController SDRAM SDRAM Soft and Hard cores in current FPGAs ENG6530 RCS

  47. Power Supply CLK CLK AudioCodec FPGA LC SRAM SRAM SRAM SDRAM SDRAM Next Step... EthernetMAC InterruptController Timer GP I/O AddressDecodeUnit CPU(uP / DSP) UART Co-Proc. Memory Controller customIF-logic CLK DisplayController ENG6530 RCS

  48. AudioCodec EPROM Power Supply LC SRAM SRAM SRAM SDRAM SDRAM Configurable System on a Chip (CSoC) ENG6530 RCS

  49. Soft CPU Core: „MicroBlaze“ (Xilinx Inc.) ENG6530 RCS

  50. IBM CoreConnect™ on-chip bus standard PLB, OPB, and DCR RocketIO Dedicated Hard IP DSOCM BRAM ISOCM BRAM Flexible Soft IP PowerPC 405 Core DCR Bus Instruction Data PLB OPB Bus Bridge Arbiter Arbiter Processor Local Bus On-Chip Peripheral Bus e.g. Memory Controller Hi-Speed Peripheral GB E-Net On-Chip Peripheral UART GPIO Off-Chip Memory ZBT SRAM DDR SDRAM SDRAM PowerPC-based Embedded Design Full system customization to meet performance, functionality, and cost goals ENG6530 RCS

More Related