- 84 Views
- Uploaded on
- Presentation posted in: General

Lin, Hai Fei, Yunsi

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Exploring Custom Instruction Synthesis forApplication-Specific Instruction Set Processors withMultiple Design Objectives

Lin, Hai Fei, Yunsi

ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), 2010

Date:2010/05/20

吳俊雄

- INTRODUCTION
- MULTI-OBJECTIVE ASIP DESIGN
- Two Algorithms for Custom Instruction Synthesis
- Mixed Integer Linear Programming
- Simulated Annealing Method

- EXPERIMENTAL RESULTS

- Traditional custom instruction synthesisflows for ASIPs mainly target performance improvement.
- We show that the existing custom instruction exploration algorithms
- Mixed Integer Linear Programming (MILP)
- Simulated Annealing Method

- And cost estimation methods
- Performance improvement
- Energy efficiency
- Area overhead

- Our work presented in this paper has three major contributions
- We address the importance of energy andresource efficiency in ASIP design
- We discuss a setof key factors during the custominstruction selection
- We show that traditional design spaceexploration algorithms are either not feasible or inefficientto estimate all the necessary factors

- Since the theoretical complexity for exploring the design space thoroughly is O(2n), most practical techniques adopt heuristics to prune the design space during the search.
- Present a holistic ASIP synthesis and simulation flow which allows the flexibility to adjust the optimization goal between energy efficiency, area overhead and performance.

- There are two major energy factors:
- Instruction fetch consumes aconsiderable portion of the total energy within a processor.
- The data communication between operations is originally implemented through register file accesses within the base processor.

- The dynamic energy consumption is affectedby the reduction of the number of instructions and dataregister file accesses.

Custom processor 1 with CFU1 achieves better performanceimprovement, because it utilizes operation parallelism in theDFG to reduce the total execution cycles.

Customprocessor 2 with CFU2 achieves larger energy saving, because it realizes a sub-graph covering more operations anddata transfer edges.

We show that generating custom instructions from a DFGcan be viewed as solving an operation scheduling problem.

Thescheduling scheme should ensure data dependency and that the input/outputedges of each software stage satisfy the I/O constraint setby the register file ports.

For a scheduling scheme, the

number ofsoftware stages with

operations in represents the

number ofinstructions for the

customized processor.

The edges acrossdifferent software

stages represent register file

accesses.

S3,4=1

- Mixed Integer Linear Programming (MILP)
- Primary Variable definition:
i: index of the operations, l: index of software stages.

- Parameter definition: hardware execution delay
k is the index of operation types.

- Assistant Variable definition: execution cycle delay
- Constraints:
- data dependency constraint
- I/O

Sd6=0.8

i

j

SN:The number of instructions

SE:The total number of data accesses

For multi-issue, out-of-order processors

equals to the longest execution path delay of the DFG

:The largest number of this type of operations amongdifferent software stages

:the number of functional modules (operators) of type k needed in the final custom hardware extension.

:The unit hardware area of functional module type k.

energy consumption area overhead execution cycle

The advantage of applying MILP to solve the scheduling problem is that, theoretically, it can find the optimum solution to the problem with sufficient searching time.

Simulated Annealing Method

Solution Vector definition: OPv = {op1, op2, op3, ..., opn}

Solution variation mechanism:

In each iteration, we randomly selectn operations and move them to a different software stage togenerate a new solution.

n represents the maximum distance between current solution and the one it evolves to. t is the current temperature, T is the starting temperature and N is the total number of operations.

R=[3~8]

The allowable range for certain operation to move aroundis determined by the location of its parent and child nodes.

In our algorithm, the actual moving range for an operation is further tightened by the current temperature - range = R * sqr(t/T ). We randomly move the operation to a software stage within this range.

Solution acceptance mechanism: A new solution is accepted when its cost is smaller than that of the current solution, or can be accepted with a probability of p when the new cost is larger than that of the current solution, where

Simulated Annealing algorithm balances the trade-off between the solution quality and searching time.

CPLEX is used to solve the MILP problem for design space exploration.

The baseline processor is an out-of-order MIPSstyle processor.

Set the ratio betweenthe weight variable g1 and g2 to be 12.2 : 1.

Set the register file I/O constraints to be 4/2.

We perform experiments for energy reduction and for performance improvement by setting the variable å2 and å3 at zero, and å1 and å2 at zero, respectively.

The average speedup

1.42 for Binary Tree

1.64 for MILP (p.)

1.56 for MILP (e.)

The average energy

consumption reductions are

18.1%, 22.7% and 29.8%.

The custom instruction templatespresented in (b) and (c) are targeting performance and energy efficiency, respectively. There are more operations inthe templates identified for energy efficiency, shown in (c),and they include longer critical paths than the sub-graphsshown in (b).

å3=0, å1 = 1, å2 = 0 å1 = å2 = 0.5

For different designs, the ratio between å1 and å2 can be varied to find the best trade-off between them.

The SA algorithm achieves anaverage of 1.46 performance speedup, which is a little lowerthan that achieved by the MILP algorithm (1.64).