RISPP: R otating I nstruction S et P rocessing P latform

RISPP: Rotating Instruction SetProcessing Platform Lars Bauer, Muhammad Shafique, Simon Kramerand Jörg Henkel Chair for Embedded Systems (CES) University of Karlsruhe

Outline • Motivation • Related Work • Our RISPP Approach: • Special Instructions (SIs) composition • Forecasting SI usages • Run-time architecture • Results & Evaluation

Development of Embedded Systems • Typical: • Static analysis of hot spots • Building tightly optimized system • Nowadays: • Increasing complexity • More functionality • Problem: • Statically chosen design point has to match all requirements • Typically inefficient for individual components (e.g. tasks or hot spots) nokia.com

Possible Solution:Extensible Processors

Related Work:Extensible Processors • S Kobayashi, K Mita, Y Takeuchi, M Imai: “Design space exploration for dsp applications using the ASIP development system PEAS-III”, ICASSP 2002 • A Hoffmann, T Kogel, A Nohl, G Braun, O Schliesbusch, O Wahlen, A Wieferink, H Meyr “A novel methodology for the design of application-specific instruction-set processors (ASIPs) using a machine description language”, IEEE Trans. on CAD of Int. Circ. and Syst. 01 • K Atasu, L Pozzi, P Ienne “Automatic application-specific instruction-set extensions under microarchitectural constraints”, DAC, 2003 • F Sun, S Ravi, A Raghunathan, NK Jha “A scalable application-specific processor synthesis methodology”, ICCAD, 2003 • N Cheung, S Parameswaran, J Henkel “A quantitative study and estimation models for extensible instructions in embedded processors”, ICCAD, 2004 • …

Problem: Various Hot-Spots

Related Work:Reconfigurable Computing • K Compton, S Hauck “Reconfigurable computing: a survey of systems and software”, ACM Computing Surveys 2002 • F Barat, R Lauwereins “Reconfigurable instruction set processors: a survey”, RSP 2000 • RD Wittig, P Chow “OneChip: an FPGA processor with reconfigurable logic”, IEEE Symp. FCCM, 1996 • S Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, E.M. Panainte, “The MOLEN polymorphic processor”, IEEE Transaction on Computers, 2004 • …

Dynamic System Behavior • Extensible Processor: choosing points in designspace at design time • Reconfigurable Computing: typically fix at compile time when and how to deploy reconfigurable hardware • How to handle situations that areunknown at design- & compile- time? • (while still supporting various extensible instructions) • Depending on input data(e.g. different computational paths in video encoder) • Which tasks/applications will be executed together?

Our New Concept:Basic Idea and Overview • At design time: fix the amount of reconfigurable hardware • At compile time: compose Special Instructions (SIs) out of high re-usable datapaths • At run time:dynamicallydetermine theimplementa-tion of a SI • Altogether:Rotate theInstructionSet

Fundamental Idea:Atom / Molecule Model Example Atom Example Molecule Example Molecule • Key: • Multiple implementations per SI (Molecules) • Each Molecule is composed out of Atoms • Implementation hierarchy • Atoms are more reusable • Molecules are more specific • Advantage: Enables dynamic trade-off • Drawback: Higher design effort • Atom: elementary data path (smaller granularity) • Molecule: combination of Atoms (bigger granularity) • Special Instr.: Application specific assembly instruction

Legend: Molecule Relation “is bigger or equal than” Infimum of the Molecules Supremum of the Molecules Formal Atom / Molecule Model: Example • Molecule relations are e.g. needed when Molecules comprise each other • In such cases we can first configure the smallest possible Molecule with required functionality and then upgrade to faster implementations # Atoms A2 (in general: n-dimensional) (3,5) (1,4) 1 # Atoms A1 1

Formal Atom / Molecule Model: Details • Main data structure:Set of all Molecules • Meta-Molecule to implement two Molecules, such that they can be executed consecutively, i.e. temporal domain (Abelian Group) • Meta-Molecule for the common Atoms (indicator for compatibility) • Relation (Complete Lattice), with • Supremum: Meta-Molecule that is needed to implement all Molecules • Infimum: Meta-Molecule that is col-lectively needed for all Molecules

Formal Atom / Molecule Model: Details • Determinant: number of Atoms needed to implement a Molecule • Upgrading: Atoms that are additionally needed to implement o, assuming m is already available

Instruction Set Rotation Time For our examples: 0.84 – 0.95 ms • Loading time depends on: • Atom size • Reconfiguration bandwidth Execution and Reconfiguration times for SATD_4x4 for 1 frame: • Altogether: Hardware has to be available when needed start loading early

“forecast SATD_4x4, 42” Executions of SATD_4x4 SI Forecasting: Example • Control-flow graph • Each node is a Base-Block (BB) • At compile time: • Determine points to forecast a SI • Add Forecast Instructions with forecast values (about the SI importance) to these points • At run time: • Use the Forecasts to determine the Instruction Set rotation • Dynamically update the importance of the forecasted SIs Time for Instruction set rotation Return fromsubroutine

Inserting Forecast Points (FCs):General Idea of Algorithm Pre-computations from profiling data for each Special Instruction (SI) I. For every SI determine Forecast Candidates II. Optimize list of FC-Candidatesand select final forecasts III.

I. Pre-Computations • Pre-computations are done on control-flow graph using profiling-information • Temporal Distance from Base Block to SI execution • Probability that the SI executions are reached • Number of executions of this SI (if it is executed)

II. Forecast Decision Function (FDF)

General Idea: While the forecasted SIs in a Base Blockconsume too many area:remove the forecast with the worst Achieved Speedup Exclusively used Atoms III. Optimize list of FC Candidates

Main Tasks of theRun-Time Architecture • Monitoring Forecasts and Special Instructions: • Fine-tune the forecasted importanceto reflect varying run-time situation a) • Selecting Molecules to implement SIs: • Dynamically choose an SI implementationthat matches the current needs of the application b) • Realize the taken decisions: • Determine a loading sequence forthe Atoms & control the SI execution c)

Run-time Architecture example • 2 Tasks are running alternating, sharing the available Atom Containers • Only one task may determine the content of an Atom container, but both can use them • [SASO’07]: “A Self-Adaptive Extensible Embedded Processor”(IEEE International Conference on Self-Adaptive andSelf-Organizing Systems Boston, July 9-11)

Results & Evaluation:Flow of Test Application • Core part of Encoding Engineof ITU-T H.264 • Special Instructions (# executions per MacroBlock): • SATD_4x4 (256) • DCT_4x4 (16) • HT_4x4 (1) • Focus: Proof of concept, not automatic SI detection

Designing an Atom for thethree transform operations • Consider constraints • Max size of data path • Number of I/O signals • Number of control signals • Increase re-usability • Combine similar data paths (MUX)

Composing Molecules for SATD_4x4 Increasedre-usability

max 15 10 5 0 Performance vs. Area Trade-off Area requirements [# loaded Atoms]

Hardware Feasibility Study • Xilinx Virtex II 3000 xc2v3000-6ff1152 • Board: Xilinx HW-AFX-FF1152-200 • Floor-Planning with Plan Ahead

Special Instruction Execution Timefor Different Resources

Application Execution Time

Time matters! Design Time Compile Time Run Time • Fix the avail-able reconfi-gurable hard-ware resour-ces • Determine Special Instructions • Determine composition out of Atoms / Molecules • Profile the application • Add Forecast Points to the application • Dynamically update the forecasted Importance of the SIs • Choose Molecule implemen-tation for SIs • The art is to find the right trade-off between design-/compile-time and run-time

Summary & Conclusion • Hierarchical Special Instruction (SI) composition • Atom / Molecule model • Use resources more efficiently • Offer multiple SI implementations • Forecasting SI usages at compile time • Pre-computations from profiling and graph analysis • Forecast Decision Function • Push more decisions to run time • Which SI implementation (dynamic trade-off) • Adapting to run-time situation • There is a large potential for improving the way current Extensible Processors work

Thank you foryour attention ! RISPP: Rotating Instruction SetProcessing Platform Lars Bauer, Muhammad Shafique, Simon Kramerand Jörg Henkel Chair for Embedded Systems (CES) University of Karlsruhe http://ces.univ-karlsruhe.de Lars Bauer, CES, University of Karlsruhe, DAC 2007

Atom-Container Interconnections

Final Forecast III. Select final Forecasts • Optimization goals • As few FCs as possible (smaller code size, less executed cycles), as many as needed (provide all necessary information to the run time system • Choose FCs with a good trade-off between ‘sufficiently early’ and a ‘high execution probability’ • For each SI start Depth-First-Searches on the FC Candidates on the transposed Base Block graph (i.e. all edges reversed) Green BB:FC-Candidate

RISPP Area Savings

II. FDF-Details • Explanation and Parameter Description: • T: Time (Rot: for Rotation; SW: For SW Execution • p: Probability • E: Energy • α: Parameter for Energy vs. Speedup fine-tuning

RISPP: R otating I nstruction S et P rocessing P latform