1 / 27

Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths

. Dynamically Reconfigurable Datapaths. Speed-up kernel loops using reconfigurable hardware. Trivial Codes.

emily
Download Presentation

Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths Zhining Huang, Sharad Malik Department of Electrical Engineering Princeton University

    2. Dynamically Reconfigurable Datapaths Speed-up kernel loops using reconfigurable hardware

    3. Outline Application specific programmable platforms Methodology overview and architectural model Datapath design for kernel loops Direct Mapping, Pipelining Reconfigurable datapath design Case studies GSM, MPEG II Conclusion

    4. Why programmable platforms? Design cost, time to market Different programmable platforms Bit level: FPGA based Word level: specialized VLIW, coarse grained reconfigurable coprocessors Thread level: Multiple PEs with on-chip communication networks Application Specific Programmable Platforms

    5. Application Specific Programmable Platforms (contd.) Goal: Approach the flexibility of GPPs with the efficiency of ASICs Part of the MESCAL project Modern Embedded Systems, Compilers, Architectures and Languages A disciplined effort for application specific programmable platform development

    6. Related Research Various reconfigurable coprocessors Garp [Hauser+97], PipeRench [Goldstein+99], Pleiades [Wan+00] Chameleon Systems, Morphics Technology General reconfigurable fabrics + compiler Hardware resource, routing, compiler Our approach Design automation of the application specific reconfigurable fabrics Coarse grained dynamically reconfigurable logic

    7. Architectural Model RISC + Coarse grained reconfigurable datapath Fixed function units Reconfigurable interconnections

    8. Methodology Overview Designing the application specific reconfigurable datapath.

    9. Mapping Kernel Loops from C to Hardware Generating a datapath for each kernel loop.

    10. Direct Mapping Direct mapping from IR to hardware One instruction to one function unit

    11. Direct Mapping (contd.) Branch condition transforms

    12. Intra-iteration Scheduling Schedule FUs into different pipe stages

    13. Inter-iteration Scheduling Pipelining the execution of loop iterations Determine the Initial Interval (II) of a loop datapath

    14. Inter-iteration Scheduling (contd.) Data dependence from FU i to FU j across loop iterations Feedback connection II = PipeStage(i) PipeStage(j) + FU_Delay(j), if II > 0

    15. Inter-iteration Scheduling (contd.) Data dependence on memory access No feedback connections needed II = ?[ PipeStage(i) PipeStage(j) + 1 ] / k? K: distance of dependent iterations, from data dependence analysis

    16. Execution Time Estimation S: total # of pipeline stages of the datapath II: initial interval between the fetch of 2 consecutive iterations N: loop iteration number O: configuration overhead W: system write back Example: T = 5 + 2x(32-1) + 4 = 71

    17. Reconfigurable Datapath Design Embed individual datapaths into a single datapath. Datapath graph Gi Vertices are hardware resources (memories, registers, function units) Edges are connections between them Construct a single graph G such that each Gi ? G and G has the fewest edges and vertices Bipartite matching based algorithm [Huang+ 2001]

    18. Reconfigurable Datapath Merged graph G to reconfigurable datapath Vertices to function units Edges to reconfigurable interconnects By selecting subset of interconnections, any selected datapath can be generated and executed on reconfigurable datapath Appropriate interconnects in merged datapath are enabled using configuration bits

    19. Routing Useful interconnections are selected Routing box to select between multiple connections Configuration contexts Configuration bits for routing box Control bits for some FU Static registers initialization

    20. Reconfiguration Overhead Store configuration contexts of limited number of kernel loops in distributed RAMs Fast context switch for reconfigurable fabrics NEC OmniPath [Furuta+00], Chameleon systems Reconfiguration overhead read live-in register set write live-out register set

    21. Critical Path and Clock Speed Critical path in the reconfigurable datapath Delay of FU Delay of routing box Delay of directly connected wires Critical path in general processor No longer in FU stage Branch control, decoding stage The clock speed of reconfigurable datapath should be no less than that for a general processor

    22. Benchmark Studies MPEG Overall speedup: 3.57 10 kernel loops: 86% execution time Max possible speedup 7.14

    23. Datapath Mapping Results Significant overlap between datapaths is obtained. Configuration bits: MPEG < 500bits, GSM < 1000bits

    24. Speed-up vs. Memory Bandwidth Make multiple copies of datapath Constraint: number of memory ports

    25. Clustered VLIW machine? Application specific clustered VLIW processor with one instruction per kernel loop Reconfiguration contexts as instructions Interconnections as application specific bypassing networks

    26. Reconfigurable Datapath (RD) vs. VLIW

    27. Applicable Application Domain computation intensive applications localized operational parallelism a few areas account for most of the execution time

    28. Conclusion A methodology for the design of a dynamically reconfigurable datapath coprocessor Kernel loop IR to datapath hardware Datapath hardware merged into reconfigurable hardware MPEG, GSM benchmark case studies Examined reconfigurable datapaths vs. VLIW processors Future research Increasing the datapath pipelining throughput through FU merging Fully automating the process

More Related