Optimizing Parallel Embedded Systems
Download
1 / 99

Parallel Architecture is Ubiquitous - PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on

Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas http://www.utdallas.edu/~edsha [email protected] Parallel Architecture is Ubiquitous. Parallel Architecture is everywhere As small as cellular phone

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Parallel Architecture is Ubiquitous' - minor


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Optimizing Parallel Embedded SystemsDr. Edwin ShaProfessorComputer ScienceUniversity of Texas at Dallashttp:[email protected]


Parallel architecture is ubiquitous
Parallel Architecture is Ubiquitous

  • Parallel Architecture is everywhere

    • As small as cellular phone

    • Modern DSP processor (VLIW), network processors

    • Modern CPU (instruction-level parallelism)

    • Your home PC (small number of processors)

    • Application-specific systems (image processing, speech processing, network routers, look-up table, etc.)

    • File server

    • Database server or web server

    • Supercomputers

      Interested in domain-specific HW/SW parallel systems


Organization of the presentation
Organization of the Presentation

  • Introduction to parallel architectures

  • Using sorting as an example to show various implementations on parallel architectures.

  • Introduction to embedded systems: strict constraints

  • Timing optimization: parallelize loops and nested loops.

    • Retiming, Multi-dimensional Retiming

    • Full Parallelism: all the nodes can be executed in parallel

  • Design space exploration and optimizations for code size, data memory, low-power, etc.

  • Intelligent prefetching and partitioning to hide memory latency

  • Conclusions


Technology trend
Technology Trend

  • Microprocessor performance increases 50% - 100% per year

  • Where does the performance gain from? Clock Rate and Capacity.

  • Clock Rate increases only 30% per year


Technology trend1
Technology Trend

  • Transistor count grows much faster than clock rate.

    • Increase 40% per year,

    • Order of magnitude more contribution in 2 decades


Exploit parallelism at every level
Exploit Parallelism at Every Level

  • Algorithms Level

  • Thread level

  • Eg. Each request of service is created as a thread

  • Iteration level (loop level)

  • Eg. For_all i= 1 to n do {loop body}.

  • All n iterations can be parallelized.

  • Loop body level (instruction-level)

  • Parallelize instructions inside a loop body as mush as possible

  • Hardware level: parallelize and pipeline the execution of an instruction such as multiplication, etc.


Sorting on linear array of processors
Sorting on Linear Array of Processors

  • Input: x1, x2, .. xn. Output: A sorted sequence (Ascending order)

  • Architecture: a linear array of k processors. Assume k=n at first.

    • What is the optimal time for sorting. Obviously it takes O(n) time to reach the rightmost.

    • Lets consider the different sequential algorithms and then think how to use them on a linear array of processors. This is a good example.

    • Selection Sort

    • Insertion Sort

    • Bubble Sort

    • Bucket Sort

    • Sample Sort


Selection sort

n(n-1)

Timing:

(n-1) + … + 2 + 1 =

2

3 + 2 + 1 = 6

5

Selection Sort

  • Algorithm:for i = 1 to n

  • pick the ith smallest one

5,1,2,4

Is it good?

Keep 1

5,2,4

Keep 2

5,4

Keep 4


Insertion sort
Insertion Sort

5,1,2,4

1

2

3

4

Timing:nonly !

5

1

1

1

4 clock cycles

in this example

5

2

2

Problem:

Need global bus

5

4

5

time


Pipeline sorting without global wire

Systolic array

x

y

Initially, y = 

If x > y

z  x

else

z  y

y  x

z

Pipeline Sorting without Global Wire

Organization


Bubble sorting

1

1

1

5

4

1

2

1

2

5

1

2

5

5

2

5

5

4

4

2

4

5

Bubble Sorting

The worst algorithm in sequential model ! But a good one in this case.

7 clock cycles

In this example

How about n ?

Timing: 2n-1 for n procs. O(n) time

O(n n / k) for k procs.

Can we get O(n/k log n/k)

time


Bucket sort

  • But it assumes n elements are uniformly distributed over an interval [a, b].

  • The interval [a, b] is divided into k equal-sized subintervals called buckets.

  • Scan through each element and put it to the corresponding bucket. The number of elements in each bucket is about n/k.

125

167

102

201

257

207

399

336

318

19

5

98

1

400

100

300

-- splitters

200

Bucket Sort

  • can be lower than the lower bound  (n log n) to be O(n)?


Bucket sort1
Bucket Sort

  • Then sort each bucket locally.

  • The sequential running time is O(n + k(n/k) log (n/k)) = O(n log (n/k)).

  • If k = n/128, then we get O(n) algorithm.

  • Parallelization is straightforward.

  • It is pretty good. Very little communication required between processors.

  • But what happens when the input data are not uniformly distributed. One bucket may have almost all the elements.

    • How to smartly pick appropriate splitters so each bucket will have at most 2 n/k elements. (Sample sort)


Sample sort
Sample Sort

  • First Step:Splitter selection(An important step)

  • Smartly select k-1 splitters from some samples.

  • Second Step: Bucket sort using these splitters on k buckets.

  • Guarantee: Each bucket has at most 2n/k elements.

  • Directly divide n input elements into k blocks of size n/k each and sort each block.

  • From each sorted block it chooses k-1 evenly spaced elements. Then sort these k(k-1) elements.

  • Select the k-1 evenly spaced elements from these k(k-1) elements.

  • Scan through the n input elements and use these k-1 splitters to put each element to the corresponding bucket.


Sample sort1
Sample Sort

  • Sequential: O(n log n/k) + O(k k log k) + O(n log n/k).

  • Not an O(n) alg. But it is very efficient for parallel implementation

Sort

Sort

Step 1

Sort

Sort

Sort

Step 2

Final splitters

Step 3

Bucket sort using these splitters


Randomized sample sort
Randomized Sample Sort

  • Processor 0 randomly pick d´ k samples. d : over-sampling ratio such as 64 or 128.

  • Sort these samples and select k-1 evenly spaced numbers as splitters.

  • With high probability, the splitters are picked well. I.e. with low probability, there is a big bucket.

  • But cannot be used for hard real-time systems.

  • To sort 5 million numbers in a SUN cluster with 4 machines using MPI in our tests:

    • Randomized sample sort takes 5 seconds

    • Deterministic sample sort takes 10 seconds

    • Radix sort takes > 500 seconds (too many communications).


Embedded systems overview
Embedded Systems Overview

  • Embedded computing systems

    • Computing systems embedded within electronic devices

    • Repeatedly carry out a particular function or a set of functions.

    • Nearly any computing system other than a desktop computer are embedded systems

    • Billions of units produced yearly, versus millions of desktop units

    • About 50 per household, 50 - 100 per automobile


Some common characteristics of embedded systems
Some common characteristics of embedded systems

  • Application Specific

    • Executes a single program, repeatedly

    • New ones might be adaptive, and/or multiple mode

  • Tightly-constrained

    • Low cost, low power, small, fast, etc.

  • Reactive and real-time

    • Continually reacts to changes in the system’s environment

    • Must compute certain results in real-time without delay


A short list of embedded systems

Anti-lock brakes

Auto-focus cameras

Automatic teller machines

Automatic toll systems

Automatic transmission

Avionic systems

Battery chargers

Camcorders

Cell phones

Cell-phone base stations

Cordless phones

Cruise control

Curbside check-in systems

Digital cameras

Disk drives

Electronic card readers

Electronic instruments

Electronic toys/games

Factory control

Fax machines

Fingerprint identifiers

Home security systems

Life-support systems

Medical testing systems

Modems

MPEG decoders

Network cards

Network switches/routers

On-board navigation

Pagers

Photocopiers

Point-of-sale systems

Portable video games

Printers

Satellite phones

Scanners

Smart ovens/dishwashers

Speech recognizers

Stereo systems

Teleconferencing systems

Televisions

Temperature controllers

Theft tracking systems

TV set-top boxes

VCR’s, DVD players

Video game consoles

Video phones

Washers and dryers

A “short list” of embedded systems

  • And the list grows longer each year.


An embedded system example a digital camera

Digital camera chip

CCD

CCD preprocessor

Pixel coprocessor

D2A

A2D

lens

JPEG codec

Microcontroller

Multiplier/Accum

DMA controller

Display ctrl

Memory controller

ISA bus interface

UART

LCD ctrl

An embedded system example -- a digital camera

  • Single-functioned -- always a digital camera

  • Tightly-constrained -- Low cost, low power, small, fast


Design metric competition improving one may worsen others

Expertise with both software and hardware is needed to optimize design metrics

Not just a hardware or software expert, as is common

A designer must be comfortable with various technologies in order to choose the best for a given application and constraints

Need serious Design Space Explorations

Power

Performance

Size

NRE cost

Design metric competition -- improving one may worsen others


Processor technology
Processor technology

  • Processors vary in their customization for the problem at hand

total = 0

for i = 1 to N loop

total += M[i]

end loop

Desired functionality

General-purpose processor (software)

Application-specific processor

Single-purpose processor (hardware)


Design productivity gap

10,000

100,000

1,000

10,000

100

1000

Logic transistors per chip

(in millions)

Gap

Productivity

(K) Trans./Staff-Mo.

10

100

IC capacity

1

10

0.1

1

productivity

0.01

0.1

0.001

0.01

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

Design Productivity Gap

  • 1981 leading edge chip required 100 designer months

    • 10,000 transistors / 100 transistors/month

  • 2002 leading edge chip requires 30,000 designer months

    • 150,000,000 / 5000 transistors/month

  • Designer cost increase from $1M to $300M


More challenges coming
More challenges coming

  • Parallel

    • Consist of multiple processors with hardware.

  • Heterogeneous, Networked

    • Each processor has its own speed, memory, power, reliability, etc.

  • Fault-Tolerance, Reliability & Security

    • A major issue for critical applications

  • Design Space Explorations: timing, code-size, data memory, power consumption, cost, etc.

  • System-Level Design, Analysis, and Optimization are important.

  • Compiler is playing an important role. We need more research.

  • Lets start with Timing optimizations, then other optimizations, and design space issues.


Timing optimization
Timing Optimization

  • Parallelization for Nested Loops

  • Focus on computation or data intensive applications.

  • Loops are the most critical parts.

  • Multi-dimensional systems (MD) Uniform nested loops.

  • Develop efficient algorithms to obtain the schedule with the minimum execution time while hiding memory latencies.

    • ALU part: MD Retiming to fully parallelize computations.

    • Memory part: prefetching and partitioning to hide memory latencies.

  • Developed by Edwin Sha’s group. The results are exciting.


Graph representation for loops
Graph Representation for Loops

  • A[0] = A[1] = 0;

  • For (i=2; i<n; i++)

  • {

  • A[i] = D[i-2] / 3;

  • B[i] = A[i] * 5;

  • C[i] = A[i] + 7;

  • D[i] = B[i] + C[i];

  • }

B

A

D

C

Delays


Schedule looped dfg

A

B C

D

B

A

D

A

B C

D

C

A

B C

D

A

B C

D

… …

Schedule looped DFG

  • DFG: Static Schedule:

Schedule

Length = 3


Rotation loop pipelining

A

B C

D A

A

B C

D

A

B C

D

B C

D A

A

B C

D

A

B C

D

B C

D A

A

B C

D

A

B C

D

B C

D A

A

B C

D

A

B C

D

… …

B C

D

… …

… …

Rotation: Loop pipelining

  • Original Schedule: Regrouped Schedule: Rotated Schedule:

prologue

epilogue


Graph representation using retiming

B

A

D

C

Graph Representation Using Retiming

DAG

Longest path = 3

B

Longest path = 2

A

D

C


Multi dimensional problems
Multi-dimensional Problems:

Multi-dimensional problems

DO 10 J = 0, N

DO 1 I = 0, M

d(i,j) = b(i,j-1) * c(i-1,j) D

a(i,j) = d(i,j) * .5 A

b(i,j) = a(i,j) + 1. B

c(i,j) = a(i,j) + 2. C

1 Continue

10 Continue

Circuit optimization

z2-1

(0,1)

B

A

D

z1-1

C

(1,0)


An example of dsp processor ti tms320c64x
An Example of DSP Processor: TI TMS320C64X

  • Clocking speed: 1.1 GHz, Up to 8800 MIPS.


For I = 1, …..

× 1.3

One-Dimensional Retiming

(Leiserson-Saxe, ’91)

For I = 1, …..

× 1.3


Another example

A(1) = B(-1) + 1

For I = 1, ……

B(I) = A(I) × 1.3

A(I+1) = B(I-1) + 1

Another Example

For I = 1, …….

× 1.3


Retiming
Retiming

  • An integer-value transformation on nodes

  • Registers are re-distributed

  • G = < V, E, d >  Gr = < V, E, dr >

  • dr(e) = d(e) + r(u) – r(u)>= 0 Legal retiming

# delays of a cycle remains constant


Multi dimensional retiming
Multi-Dimensional Retiming

  • A nested loop

  • Illegal cases

  • Retiming nested loops

  • New problems …







Iteration space for retimed graph
Iteration Space for Retimed Graph

Legal schedule with row-wise executions. S=(0,1)



Required solution needs
Required Solution Needs:

  • To avoid illegal retiming

  • To be general

  • To obtain full parallelism

  • To be a fast Algorithm


Schedule vector wavefront processing
Schedule Vector(wavefront processing)

Legal schedule: s·d  0





Chained MD Retiming

Schedule plane S+

y

s • (1,1) > 0

s • (-1,1) > 0

Pick s = (0,1),

r s

=> r = (1,0)

s

(-1,1)

(1,1)

x

r


S must be feasible for new delay vectors
S must be feasible for new delay vectors

Let new d’=(d+kr). We know s d >0 and s r=0

s • (d + kr) = s • d + s • (kr) must be > 0

=> s is a legal schedule for d + kr .

Schedule plane

s

d3

d2

d1

r





Synchronous Circuit Optimization Example – original design (cont.)

Critical path: 6 adders & 2 mul.



Synchronous Circuit Optimization Example – retimed design Gnanasekaran’88

Critical path is the minimum


Embedded system design review
Embedded System Design Review Gnanasekaran’88

  • Strict requirements

    • Time, power-consumption, code size, data memory size, hardware cost, areas, etc.

    • Time-to-market, time-to-prototype

  • Special architectural support

    • Harvard architecture, on-chip memory, register files, etc.

  • Increasing amount of software

    • Flexibility of software, short time-to-market, easy upgrades

    • The amount of software is doubling every two years.

  • How to generate high-quality code for embedded systems?

  • How to minimize and search the design space?

  • Compiler role?


Compiler in embedded systems
Compiler in Embedded Systems Gnanasekaran’88

  • Ordinary C compilers for embedded processors are notoriously known for their poor code quality.

    • Data memory overhead for compiled code can reach a factor of 5

    • Cycle overhead can reach a factor of 8, compared with the manually generated assembly code. (Rozenberg et al., 1998)

  • For code generation

  • In general, compilers are included in control flow loops for design space exploration and therefore they play an important role for design phase.

  • Exploring efficient designs in a huge, n-dimensional space, where each dimension corresponds to a design choice.


Compiler in design space exploration
Compiler in Design Space Exploration Gnanasekaran’88

  • Algorithm selection: to analyze dependencies between algorithms and processor architectures.

  • HW/SW partitioning

  • Memory related issues (program and data memory)

    • Optimally placing programs and data in on-chip memories, and hide off-chip memory latencies by smart pre-fetching.(Sha, et al.)

    • Data mapping for processors with multiple on-chip memory modules. (Zhuge and Sha)

    • Code size reduction for software-pipelined applications. (Zhuge and Sha)

  • Instruction set options

    • Search for power-optimized instruction sets (Kin et al. 1999)

    • Scheduling for loops and DAG with the min. energy. (Shao and Sha)


Design space minimization
Design Space Minimization Gnanasekaran’88

  • A direct design space is too large. Must do design space minimization before exploration.

  • Derive the properties of relationships of design metrics

    • A huge number of design points can be proved to be infeasible before performing time-consuming design space exploration.

  • Using our design space minimization algorithm, the design points can be reduced from 510 points to 6 points for synthesizing All-pole Filter, for example.

  • Approach:

    • Develop effective optimization techniques: code size, time, data memory, low-power

    • Understand the relationship among them

    • Design space minimization algorithms

    • Efficient design space exploration algorithms using fuzzy logic


Example relation of optimizations

B Gnanasekaran’88

A

D

C

B

A

D

C

Example Relation of Optimizations

  • Retiming

    • Transform a DFG to minimize its cycle period in polynomial time by redistribution the delays in the DFG.

    • Cycle period c(G) of DFG G is the computation time of the longest zero-delay path.

  • Unfolding

    • The original DFG G is unfolded f times, so the unfolded graph Gfconsists of f copies of the original node set.

    • Iteration period: P = c( Gf )/f.

    • Code size is increased f times. Software pipelining will increase more.


Experimental results
Experimental Results Gnanasekaran’88

  • The search space size using our method is only 2% of the search space using the standard method on average.

  • The quality of the solutions found by our algorithm is better than that of the standard method.


A design space minimization problem
A Design Space Minimization Problem Gnanasekaran’88

  • Clearly understand the relationships: retiming, unfolding and iteration period.


Program memory consideration with code size minimization
Program Memory Consideration with Code Size Minimization Gnanasekaran’88

  • Multiple on-chip memory banks, but usually only one program memory.

  • The capacity of an on-chip memory bank is very limited

    • Motolora’s DSP56K has only 512*24 bit program memory

    • ARM940T uses 4K instruction cache (Icache)

    • StringARM SA-1110 uses 16K cache

  • A widely used performance optimization technique, software pipelining, expands the code size to several times of the original code size.

  • Designers need to fit the code into the small on-chip memory to avoid slow (external) memory accesses.

  • The code size becomes a critical concern for many embedded processors.


Code size expansion caused by software pipelining
Code Size Expansion Caused by Gnanasekaran’88Software Pipelining

A[1] = E[-3] + 9;

A[2] = E[-2] + 9;

B[1] = A[1] * 0.5;

C[1] = A[1] + B[-1];

A[3] = E[-1] + 9;

B[2] = A[2] * 0.5;

C[2] = A[2] + B[0];

D[1] = A[1] * C[1];

for i=1 to n-3 do

A[i+3] = E[i-1] + 9;

B[i+2] = A[i+2] * 0.5;

C[i+2] = A[i+2] + B[i];

D[i+1] = A[i+1] * C[i+1];

E[i] = D[i] + 30;

End

B[n] = A[n] * 0.5;

C[n] = A[n] + B[n-2];

D[n-1] = A[n-1] * C[n-1];

E[n-2] = D[n-2] + 30;

D[n] = A[n] * C[n];

E[n-1] = D[n-1] + 30;

E[n] = D[n] + 30;

  • Schedule length is decreased from 4 cycles for 1 cycle.

  • Code size is expanded to 3 times larger than the original code size.

for i=1 to n do

A[i] = E[i-4] + 9;

B[i] = A[i] * 0.5;

C[i] = A[i] + B[i-2];

D[i] = A[i] * C[i];

E[i] = D[i] + 30;

end


Rotation scheduling
Rotation Scheduling Gnanasekaran’88

  • Resource Constrained Loop Scheduling based on Retiming concept.

  • Retiming gives a clear framework for software pipelining depth.

  • Given an initial DAG schedule, rotation scheduling repeatedly rotates down the nodes in the first row of the schedule.

  • In each step of rotation, the nodes in the first row:

    • retimed once by pushing one delay from each of incoming edges of the node and adding one delay to each of its outgoing edges;

    • rescheduled to an available locations (such as earliest ones) in the schedule based on the new precedence relations defined in the retimed graph.

  • The optimal schedule length can be obtained in polynomial time (2 |V|) in most cases.

  • The techniques can be generalized to deal with code-size, switching activities, branches, nested loops, etc.



Schedule a cyclic dfg

A Gnanasekaran’88

B C

D

A

B C

D

A

B C

D

A

B C

D

… …

Schedule a Cyclic DFG

  • DFG Schedule

Schedule

Length = 3

B

A

D

C


Rotation loop pipelining1

prologue Gnanasekaran’88

A

B C A

D

A

B C

D

B C A

D

A

B C

D

B C A

D

A

B C

D

B C A

D

… …

A

B C

D

epilogue

… …

Rescheduling

Rotation: Loop Pipelining

A

B C

D

A

B C

D

A

B C

D

A

B C

D

… …

B C

D

Rotation

Original schedule


Retiming view of loop pipelining

B Gnanasekaran’88

A

D

C

B

A

D

C

Retiming View of Loop Pipelining

Cycle period = 3

Cycle period = 2


The second rotation
The Second Rotation Gnanasekaran’88

A

B C A

D

A

B C A

D

A

B C A

prologue

B C D A

B C A

D

B C A

D

B C D A

B C D A

B C A

D

B C A

D

… …

B C A

D

B C A

D

… …

… …

The schedule after

the 1st rotation phase

The 2nd rotation

The final schedule

after rescheduling


Retiming view of loop pipelining1

B Gnanasekaran’88

A

D

C

Retiming View of Loop Pipelining

r(A)=1

r(B)=r(C)=r(D)=0

Cycle period = 2

B

r(A)=2

r(B)=r(C)=1

r(D)=0

Cycle period = 1

A

D

C


Prologue and retiming function

A Gnanasekaran’88

B C

D

A

B C

D

A

B C

D

A

B C

D

… …

Prologue and Retiming Function

Original schedule

The 1st rotation

r(A)=1

The 2nd rotation

r(A)=2

  • The number of copies of node A in prologue = r(A)

  • The number of copies of node A in epilogue = (maxu r(u)) – r(A), for u Î V .

A

B C A

D

A

B C A

B C A

D

B C D A

B C D A

B C A

D

B C D A

… …

B C A

D

B C

D

… …

D


Cred technique using predicate register
CRED Technique Gnanasekaran’88Using Predicate Register

  • Predicate register

    • An instruction can be guarded by a predicate register.

    • The instruction is executed when the value of the predicate register is true; otherwise, the instruction is disabled.

  • Implement CRED using predicate register with counter (TI’s TMS320C6x)

    • Set the initial value p = (maxu r(u)) – r(v) .

    • Decrement p by one in each iteration.

    • The instruction is executed when 0 ³ p > -n, where n is the loop counter of the original loop. The instruction is disabled when p > 0 or p £ –n.


The new execution sequence
The New Execution Sequence Gnanasekaran’88

Software-pipelined loop schedule with

r(A)=3, r(B)=r(C)=2, r(D)=1, r(E)=0,

and n=5.

The execution sequence after performing

CRED using 4 conditional registers.

The new code size.


Processor classes
Processor Classes Gnanasekaran’88

  • Processor Class 0: No predicate register

    • Motorola’s StarCore DSP processor

  • Processor Class 1: Has “condition code” bits in instruction, no predicate register

    • Intel’s StrongARM and other ARM architectures

  • Processor Class 2: Has 1-bit predicate registers

    • Philip’s TriMedia Multimedia processor

  • Processor Class 3: Has predicate registers with counters

    • TI’s TMS320C6x processor

  • Processor Class 4: Specialized hardware support for executing software-pipelined loops

    • IA64


Code size reduction for class 3
Code Size Reduction for Class 3 Gnanasekaran’88

A[1] = E[-3] + 9;

A[2] = E[-2] + 9;

B[1] = A[1] * 0.5;

C[1] = A[1] + B[-1];

A[3] = E[-1] + 9;

B[2] = A[2] * 0.5;

C[2] = A[2] + B[0];

D[1] = A[1] * C[1];

for i=1 to n-3 do

A[i+3] = E[i-1] + 9;

B[i+2] = A[i+2] * 0.5;

C[i+2] = A[i+2] + B[i];

D[i+1] = A[i+1] * C[i+1];

E[i] = D[i] + 30;

End

B[n] = A[n] * 0.5;

C[n] = A[n] + B[n-2];

D[n-1] = A[n-1] * C[n-1];

E[n-2] = D[n-2] + 30;

D[n] = A[n] * C[n];

E[n-1] = D[n-1] + 30;

E[n] = D[n] + 30;

p=0;q=1;r=2;s=3

for i=1 to n-3 do

[p]A[i+3] = E[i-1] + 9;

p--;

[q]B[i+2] = A[i+2] * 0.5;

[q]C[i+2] = A[i+2] + B[i];

q--;

[r]D[i+1] = A[i+1] *C[i+1]

r--;

[s]E[i] = D[i] + 30;

s--;

end


Cred on various types of processors
CRED on Various Types of Processors Gnanasekaran’88

  • TI model and IA46 is very efficient for code size reduction.

  • TI model is very effective for DSP processors supporting predicate registers but without specialized hardware as in IA64.


Experimental results on code size performance trade off
Experimental Results on Gnanasekaran’88Code Size/Performance Trade-off

  • Code size/performance exploration for All-pole Filter on the modified TMS320C6x processor with only 2 predicate registers.

  • The code size is increased when software pipeline depth is increased and the schedule length is decreased.

  • Our approach find the shortest schedule length satisfying a code size constraint.


Code size reduction for nested loop

(2,0) Gnanasekaran’88

(2,1)

(2,2)

(1,0)

(1,1)

(1,2)

(0,0)

(0,1)

(0,2)

Code-size Reduction for Nested Loop

1

(0,1)

  • Assume 8 functional units.

  • Traditional software pipelining can only make 6 clock cycle at best.

  • Interchanging the loop index can not help optimization.

2

12

(1,0)

3

11

4

10

Cell Dependency Graph

5

9

(a)

The original loop:

Outer loop begin (trip count = m)

Outer 1 (6 cycles, 15 instructions)

Inner loop (10 cycles, 12 instr., trip count = n)

Outer 2 (5 cycles, 15 instr.)

Assuming m=1000, n=10

Total cycles = m(6+10n+5) = 10mn+11n = 111,000

Code size = 42 instr.

6

8

7

Data Flow Graph


Md retiming and code reduction

1 Gnanasekaran’88

(-4,1)

2

12

(1,0)

3

11

(1,0)

4

10

(1,0)

5

9

(1,0)

6

8

(1,0)

7

MD Retiming and Code Reduction

(b)

Inner-outer combined software pipelining:

Outer loop begin (trip count = m)

Outer 1 & Prologue (12 cycles, 15+28 instr.)

Inner loop (2 cycles, 12 instr., trip count = n-4)

Outer 2 & Epilogue (12 cycles, 15+20 instr.)

Total cycles = m(12+2(n-4)+12) = 2mn+16n = 36,000

Code size = 90 instr.

(c)

Code size reduction: remove pro. epi.

Outer loop begin (trip count = m)

Outer 1 (6 cycles, 15+4 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.)

Total cycles = m(6+2(n+4)+5) = 2mn+19n = 39,000

Code size = 50 instr.

Retimed DFG: r(1)=r(2)=r(3)=r(4)=(4,0), r(5)=r(6)=(3,0), r(7)=r(8)=(2,0), r(9)=r(10)=(1,0)

r(11)=r(12)=(0,0)


Outer loop pipeline and code reduction
Outer Loop Pipeline and Code Reduction Gnanasekaran’88

(d)

Outer loop pipelining:

Outer loop begin (trip count = m-1)

Outer 1 (6 cycles, 19 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 19 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.)

Total cycles = 6+(m-1)(2(n+4)+6)+2(n+4)+5 = 2mn+14n+5 = 34005

Code size = 100 instr.

(e)

Reduce new epilogue:

Outer loop begin (trip count = m)

Outer 1 (6 cycles, 19+1 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 20 instr.)

Total cycles = 6+m(2(n+4)+6) = 2mn+14n+6 = 34006

Code size = 71 instr.


Data memory consideration with optimal data mapping

Data Memory Gnanasekaran’88

Bank 0

Data Memory

Bank 1

DB0

DB1

ALU

Data Memory Consideration with Optimal Data Mapping

  • Multiple memory banks are accessible in parallel.

  • Provides higher memory bandwidth.

  • Many existing compilers cannot work well for such kind of architectural feature. Instead, all variables are assigned to just one bank.

  • The technique of data mapping and scheduling becomes one of the most importance factors in performance optimization


Iir filter data flow graph

C Gnanasekaran’88

E

4

6

20

19

23

C

24

A

A

12

E

F

F

A

A

G

G

2

8

9

0

10

14

1

16

15

3

D

11

B

22

B

21

A

A

7

18

17

5

D

IIR Filter – Data Flow Graph


Our model variable independence graph

D Gnanasekaran’88

E

Our Model– Variable Independence Graph

2

B

C

1/2

1/2

1/2

1/2

7/8

1/2

A

G

1/2

1/2

1/2

F

Partition 2

Partition 1

Weight(e==(u,v)): “gain” to put u, v in different memory modules. We want to find maximum-weight partition.


Experimental results1
Experimental Results Gnanasekaran’88

  • IG approach uses list scheduling and interference graph model (M. Saghir, etc., University of Toronto, Canada; R. Leupers, etc., University of Dortmund, Germany).

  • Our approach uses rotation scheduling with variable repartitioning algorithm and variable independence graph.

  • Different approaches result in different variable partitions.

  • The largest improvement on schedule length using our approach is 52.9%.The average improvement on the benchmarks is 44.8%.


Conclusions
Conclusions Gnanasekaran’88

  • An exciting area: optimizations for parallel DSP and embedded systems. Gave an overview. Needs much more work.

  • Consider both architectures and compilers.

  • Presented techniques:

    • Multi-dimensional (MD) retiming, Rotation

    • Code-size minimization for software pipelined loops

    • Design space minimization

    • Optimal partitioning and prefetching to completely hide memory latencies. And decide the minimum required on-chip memory

  • Detailed retiming, unfolding, low-power scheduling, rate-optimal scheduling, etc. were presented in tutorial. Still a lot more.

  • Please check my web page for details: www.utdallas.edu/~edsha


ad