1 / 37

Custom Instruction Generation Using Temporal Partitioning Techniques for a R econfigurable Functional Unit

Custom Instruction Generation Using Temporal Partitioning Techniques for a R econfigurable Functional Unit. Farhad Mehdipour † , Hamid Noori †† , Morteza Saheb Zamani † , Kazuaki Murakami †† , Koji Inoue †† , Mehdi Sedighi †

bertha
Download Presentation

Custom Instruction Generation Using Temporal Partitioning Techniques for a R econfigurable Functional Unit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit Farhad Mehdipour†, Hamid Noori††, Morteza Saheb Zamani†, Kazuaki Murakami††, Koji Inoue††, Mehdi Sedighi† †Computer and IT Engineering Department, Amirkabir University of Technology {mehdipur,szamani,msedighi}@aut.ac.ir ††Department of Informatics, Graduate School of Information Science and Electrical Engineering, Kyushu University noori@c.csce.kyushu-u.ac.jp, {murakami,inoue}@i.kyushu-u.ac.jp

  2. Agenda • Introduction • Application-specific instruction set extension • Temporal Partitioning • Some Definitions • General overview of the architecture • RFU Architecture: A Quantitative Approach • Generating Custom Instructions • Mapping Custom Instructions • Integrating RFU with base processor • Integrated framework for generating and mapping custom instructions • Performance Evaluation • References

  3. Introduction • An extensible processor with a reconfigurable functional unit (RFU) • can be an alternative to General Purpose Processors (GPPs), Application-Specific Integrated Circuits (ASICs) and Application-Specific Instruction set Processors (ASIPs) • to achieve enhanced performance in embedded systems • ASICs • not flexible • expensive and time consuming design process • GPPs • very flexible • may not offer the necessary performance

  4. Introduction • ASIPs • more flexible than ASICs • more potential to meet the high-performance demands of embedded applications, compared to GPPs • needs to generation of a complete instruction set architecture for the targeted application • full-custom solution is too expensive and has long design turnaround times

  5. Application-specific instruction set extension • Another Method for performance improvement • An extensible processor with a reconfigurable functional unit • favorable tradeoff between efficiency and flexibility • keeping design turnaround time much shorter. • Critical portions of an application’s dataflow graph (DFG) are accelerated by using custom functional units • The nodes of DFGs -> instructions of critical potions • Edges of DFGs -> dependencies between instructions

  6. Temporal Partitioning • Partitioning a data flow graph into a number of partitions such that • each partition can fit into the target hardware and • dependencies among the graph nodes are not violated.

  7. Some definitions • Hot Basic Block (HBB) • A basic block which execution frequency is greater than a given threshold specified in the profiler • Custom Instructions (CIs) • Are the extended Instruction Set Architecture (ISA) that are executed on the RFU • Reconfigurable Functional Unit (RFU) • Custom hardware for executing CIs

  8. General overview of the architecture Adaptive Dynamic Extensible Processor N-way in-order general RISC Detects start addresses of Hot Basic Blocks (HBBs) Base Processor Fetch Reg File Augmented Hardware Decode Switches between main processor and RFU Profiler Execute RFU Memory Sequencer Write Executes Custom Instructions

  9. Operation modes Training Mode Training Mode Normal Mode Running Tools for Generating Custom Instructions, Generating Configuration Data for ACC and Initializing Sequencer Table Monitors PC and Switches between main processor and ACC Detecting Start Address of HBBs Applications Applications Applications Binary-Level Profiling Processor Processor Processor Profiler Profiler Profiler Profiler ACC ACC ACC Sequencer Sequencer Sequencer Binary Rewriting Executing CIs

  10. Tool Chain

  11. Reconfigurable Functional Unit (RFU) • RFU is a matrix of Functional Units (FUs) • RFU has a two level configuration memory • A multi-context memory (keeps two or four config) • A cache • FUs support only logical operations, add/subtract, shifts and compare • RFU updates the PC • RFU has variable delay which depends on size of Custom Instruction

  12. RFU Architecture: A Quantitative Approach • 22 programs of MiBench were chosen • Simplescalar toolset was utilized for simulation • RFU is a matrix of FUs • No of Inputs • No of Outputs • No of FUs • Connections • Location of Inputs & Outputs • Some definitions: • Considering frequency and weight in measurement • CI Execution Frequency • Weight (To equal number of executed instructions) • Average = for all CIs (ΣFreq*Weight) • Rejection: Percentage of CI that could not be mapped on the RFU • Coverage: Percentage of CI that could be mapped on the RFU • Basic Blocks:A sequence of instructions terminates in a control instruction • Hot Basic Blocks: A basic block executed more than a threshold

  13. RFU Architecture • Distributing Inputs in different rows • Row1 = 7 • Row 2 = 2 • Row 3 = 2 • Row 4 = 2 • Row 5 = 1 • Connections with Variable Length • row1  row3 = 1 • row1  row4 = 1 • row1  row5 = 1 • row2  row4 = 1 Synthesis results using Hitachi 0.18 μm Area : 1.1534 mm2 Delay : 9.66 ns

  14. Integrating RFU with the Base Processor Reg0 Reg31 ………………………………………………………………. Config Mem Decoder Sequencer DEC/EXE Pipeline Registers FU1 FU2 FU3 FU4 RFU Sequencer EXE/MEM Pipeline Registers

  15. Generation of Custom Instructions • Custom instructions • Exclude floating point, multiply, divide and load instructions • Include at most one STORE, at most one BRANCH/JUMP and all other fixed point instructions • Simple algorithm for generating custom instructions • HBBs usually include 10~40 instructions for Mibench • Custom instruction generator is going to be executed on the base processor (in online training mode)

  16. Mapping Custom Instructions • Mapping is the same as the well-known placement problem: • Determining the appropriate positions for DFG nodes on the RFU. • Assigning CI instructions to FUs is done based on the priority of the nodes.

  17. Mapping Custom Instructions • Slackof each node represents its criticalityand alsotheir priority for partitioning. • Slackequal to 0 means that it is on the critical path of DFG and should be scheduled with the highest priority. • For the nodes with the same criticality, ASAP level of them determines their mapping order.

  18. Mapping Algorithm (1/2) • First Step: determining an appropriate row for that node • Row number= Last Row (if the selected node is on a critical path with the length more than or equal to RFU depth) • Row number= ALAP- slack -1(to prevent the occupation of FUs in the lower RFU rows by the nodes do not belong to critical paths )

  19. Mapping Algorithm (2/2) • Second Step: Determining an appropriate column • That is determined according to the minimum connection length criterion. • For each row, a maximum capacity is considered to prohibit gathering many nodes in a row. • Capacity of rows is determined with respect to longest critical path and the number of critical paths in the DFG.

  20. An Example: Mapping of a CI on the RFU

  21. Generating Custom Instruction for the Target RFU • In our primary CI generator we did not consider any constraints for the generated CIs and tried to generate CIs as large as possible. • Therefore, some of the generated CIs can not be mapped on the proposed RFU due to its constraints.

  22. Customizing CI generator for the Target RFU – First Approach • Some primary constraints of RFU (number of inputs, number of outputs and number of nodes) were added to our CI generator tool to generate CIs that are mappable. • In this approach the CI generator is unaware of the mapping process results • Some of CIs may not be ultimately mapped to the RFU due to the routing constraints

  23. Customizing CI generator for the Target RFU – Second Approach • Integrated Framework • Performs an integrated temporal partitioning and mapping process • Takes rejected CIs as input • Partitions them to appropriate mappable CIs • Adds nodes to the current partition while architectural constraints are satisfied • The ASAP level of nodes represents their order to execute according to their dependencies • Advantages • Reducing the number of rejected CI • Using a mapping-aware temporal partitioning process

  24. Integrated Framework- Temporal Partitioning Algorithms • HTTP • Traverses DFG nodes horizontally according to the ASAP level of the nodes • usually brings about more parallelism for instruction execution • may require large intermediate data • The size of intermediate data affects data transfer rate and the size of configuration memory. • VTTP • Traverse the DFG nodes vertically • Creates partitions with longer critical paths • Reduces the size of intermediate data

  25. Integrated Framework- Incremental Temporal Partitioning Algorithm • Incremental temporal partitioning process is performed iteratively • Each partition which does not satisfy RFU constraints is modified • A new iteration starts. • Two different partition modification strategies are used for HTTP and VTTP • The main difference is in the way of selecting the nodes to be moved to the next partition.

  26. Integrated Framework- Incremental Temporal Partitioning Algorithm • Incremental HTTP • The node with the highest ASAP level is selected and moved to the subsequent partition. • Nodes selection and moving order: 15, 13, 11, 9, 14, 12, 10, 8, 3 and 7.

  27. Integrated Framework- Incremental Temporal Partitioning Algorithm • Incremental VTTP: • A node with the highest ASAP level is selected and moved. • The other nodes are selected from the path where the previous moved node had been located in their ASAP level order. • Nodes selection and moving order:15, 14, 6, 13, 12, 5, 11, 10, 4 and 7.

  28. Customizing Mapping Tool Spiral shaped mapping is possible thanks to the horizontal connections in the third and fourth rows of RFU

  29. Performance Evaluation • Simplescalar was configured to behave as a 4-issue in-order RISC processor. The base processor supports MIPS instruction set. • 22 applications of Mibench

  30. Delay of RFU according to CI length • Synopsys Tools + Hitachi 0.18μm

  31. CIs length for Mibench applications

  32. Intermediate data size

  33. Maximum critical path length for CIs

  34. Speedup comparison

  35. References • Arnold, M., Corporaal, H., Designing domain-specific processors. In Proceedings of the Design, Automation and Test in Europe Conf, 2001, pp. 61-66. • Atasu, K., Pozzi, L., Lenne, P., Automatic application-specific instruction-set extensions under microarchitectural constraints, 40th Design Automation Conference, 2003. • Bobda, C., Synthesis of dataflow graphs for reconfigurable systems using temporal partitioning and temporal placement, Ph.D thesis, Faculty of Computer Science, Electrical Engineering and Mathematics, University of Paderborn, 2003. • Clark, N., Kudlur, M., Park, H., Mahlke, S., Flautner, K., Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization, In Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, 2004. • Karthikeya, M., Gajjala, P., Dinesh, B., Temporal partitioning and scheduling data flow graphs for reconfigurable computer, IEEE Transactions on Computers, vol. 48, no. 6, 1999, pp.579–590.

  36. References • Kastner, R. Kaplan, A., Ogrenci Memik, S., Bozorgzadeh, E., Instruction generation for hybrid reconfigurable systems, ACM TODAES, vol. 7, no. 4, 2002, pp. 605-627. • Ouaiss, I., Govindarajan, S., Srinivasan, V., Kaul M., Vemuri R., An integrated partitioning and synthesis system for dynamically reconfigurable multi-FPGA architectures, In Proceedings of the Reconfigurable Architecture Workshop, 1998, pp. 31-36. • Spillane, J., Owen, H., Temporal partitioning for partially reconfigurable field programmable gate arrays, IPPS/SPDP Workshops, 1998, pp. 37-42. • Tanougast, C., Berviller, Y., Brunet, P., Weber, S., Rabah, H., Temporal partitioning methodology optimizing FPGA resources for dynamically reconfigurable embedded real-time system, International Journal of Microprocessors and Microsystems, vol. 27, 2003, pp. 115-130. • Yu, P., Mitra, T., Characterizing embedded applications for instruction-set extensible processors, In Proceedings of Design and Automation Conference, 2004, pp. 723- 728.

  37. Thank you for your listening

More Related