slide1
Download
Skip this Video
Download Presentation
Parallel Architecture is Ubiquitous

Loading in 2 Seconds...

play fullscreen
1 / 99

Parallel Architecture is Ubiquitous - PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on

Optimizing Parallel Embedded Systems Dr. Edwin Sha Professor Computer Science University of Texas at Dallas http://www.utdallas.edu/~edsha edsha@utdallas.edu. Parallel Architecture is Ubiquitous. Parallel Architecture is everywhere As small as cellular phone

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Parallel Architecture is Ubiquitous' - minor


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Optimizing Parallel Embedded SystemsDr. Edwin ShaProfessorComputer ScienceUniversity of Texas at Dallashttp://www.utdallas.edu/~edshaedsha@utdallas.edu

parallel architecture is ubiquitous
Parallel Architecture is Ubiquitous
  • Parallel Architecture is everywhere
    • As small as cellular phone
    • Modern DSP processor (VLIW), network processors
    • Modern CPU (instruction-level parallelism)
    • Your home PC (small number of processors)
    • Application-specific systems (image processing, speech processing, network routers, look-up table, etc.)
    • File server
    • Database server or web server
    • Supercomputers

Interested in domain-specific HW/SW parallel systems

organization of the presentation
Organization of the Presentation
  • Introduction to parallel architectures
  • Using sorting as an example to show various implementations on parallel architectures.
  • Introduction to embedded systems: strict constraints
  • Timing optimization: parallelize loops and nested loops.
    • Retiming, Multi-dimensional Retiming
    • Full Parallelism: all the nodes can be executed in parallel
  • Design space exploration and optimizations for code size, data memory, low-power, etc.
  • Intelligent prefetching and partitioning to hide memory latency
  • Conclusions
technology trend
Technology Trend
  • Microprocessor performance increases 50% - 100% per year
  • Where does the performance gain from? Clock Rate and Capacity.
  • Clock Rate increases only 30% per year
technology trend1
Technology Trend
  • Transistor count grows much faster than clock rate.
      • Increase 40% per year,
      • Order of magnitude more contribution in 2 decades
exploit parallelism at every level
Exploit Parallelism at Every Level
  • Algorithms Level
  • Thread level
  • Eg. Each request of service is created as a thread
  • Iteration level (loop level)
  • Eg. For_all i= 1 to n do {loop body}.
  • All n iterations can be parallelized.
  • Loop body level (instruction-level)
  • Parallelize instructions inside a loop body as mush as possible
  • Hardware level: parallelize and pipeline the execution of an instruction such as multiplication, etc.
sorting on linear array of processors
Sorting on Linear Array of Processors
  • Input: x1, x2, .. xn. Output: A sorted sequence (Ascending order)
  • Architecture: a linear array of k processors. Assume k=n at first.
    • What is the optimal time for sorting. Obviously it takes O(n) time to reach the rightmost.
    • Lets consider the different sequential algorithms and then think how to use them on a linear array of processors. This is a good example.
    • Selection Sort
    • Insertion Sort
    • Bubble Sort
    • Bucket Sort
    • Sample Sort
selection sort

n(n-1)

Timing:

(n-1) + … + 2 + 1 =

2

3 + 2 + 1 = 6

5

Selection Sort
  • Algorithm:for i = 1 to n
  • pick the ith smallest one

5,1,2,4

Is it good?

Keep 1

5,2,4

Keep 2

5,4

Keep 4

insertion sort
Insertion Sort

5,1,2,4

1

2

3

4

Timing:nonly !

5

1

1

1

4 clock cycles

in this example

5

2

2

Problem:

Need global bus

5

4

5

time

pipeline sorting without global wire

Systolic array

x

y

Initially, y = 

If x > y

z  x

else

z  y

y  x

z

Pipeline Sorting without Global Wire

Organization

bubble sorting

1

1

1

5

4

1

2

1

2

5

1

2

5

5

2

5

5

4

4

2

4

5

Bubble Sorting

The worst algorithm in sequential model ! But a good one in this case.

7 clock cycles

In this example

How about n ?

Timing: 2n-1 for n procs. O(n) time

O(n n / k) for k procs.

Can we get O(n/k log n/k)

time

bucket sort

But it assumes n elements are uniformly distributed over an interval [a, b].

  • The interval [a, b] is divided into k equal-sized subintervals called buckets.
  • Scan through each element and put it to the corresponding bucket. The number of elements in each bucket is about n/k.

125

167

102

201

257

207

399

336

318

19

5

98

1

400

100

300

-- splitters

200

Bucket Sort
  • can be lower than the lower bound  (n log n) to be O(n)?
bucket sort1
Bucket Sort
  • Then sort each bucket locally.
  • The sequential running time is O(n + k(n/k) log (n/k)) = O(n log (n/k)).
  • If k = n/128, then we get O(n) algorithm.
  • Parallelization is straightforward.
  • It is pretty good. Very little communication required between processors.
  • But what happens when the input data are not uniformly distributed. One bucket may have almost all the elements.
    • How to smartly pick appropriate splitters so each bucket will have at most 2 n/k elements. (Sample sort)
sample sort
Sample Sort
  • First Step:Splitter selection(An important step)
  • Smartly select k-1 splitters from some samples.
  • Second Step: Bucket sort using these splitters on k buckets.
  • Guarantee: Each bucket has at most 2n/k elements.
  • Directly divide n input elements into k blocks of size n/k each and sort each block.
  • From each sorted block it chooses k-1 evenly spaced elements. Then sort these k(k-1) elements.
  • Select the k-1 evenly spaced elements from these k(k-1) elements.
  • Scan through the n input elements and use these k-1 splitters to put each element to the corresponding bucket.
sample sort1
Sample Sort
  • Sequential: O(n log n/k) + O(k k log k) + O(n log n/k).
  • Not an O(n) alg. But it is very efficient for parallel implementation

Sort

Sort

Step 1

Sort

Sort

Sort

Step 2

Final splitters

Step 3

Bucket sort using these splitters

randomized sample sort
Randomized Sample Sort
  • Processor 0 randomly pick d´ k samples. d : over-sampling ratio such as 64 or 128.
  • Sort these samples and select k-1 evenly spaced numbers as splitters.
  • With high probability, the splitters are picked well. I.e. with low probability, there is a big bucket.
  • But cannot be used for hard real-time systems.
  • To sort 5 million numbers in a SUN cluster with 4 machines using MPI in our tests:
    • Randomized sample sort takes 5 seconds
    • Deterministic sample sort takes 10 seconds
    • Radix sort takes > 500 seconds (too many communications).
embedded systems overview
Embedded Systems Overview
  • Embedded computing systems
    • Computing systems embedded within electronic devices
    • Repeatedly carry out a particular function or a set of functions.
    • Nearly any computing system other than a desktop computer are embedded systems
    • Billions of units produced yearly, versus millions of desktop units
    • About 50 per household, 50 - 100 per automobile
some common characteristics of embedded systems
Some common characteristics of embedded systems
  • Application Specific
    • Executes a single program, repeatedly
    • New ones might be adaptive, and/or multiple mode
  • Tightly-constrained
    • Low cost, low power, small, fast, etc.
  • Reactive and real-time
    • Continually reacts to changes in the system’s environment
    • Must compute certain results in real-time without delay
a short list of embedded systems

Anti-lock brakes

Auto-focus cameras

Automatic teller machines

Automatic toll systems

Automatic transmission

Avionic systems

Battery chargers

Camcorders

Cell phones

Cell-phone base stations

Cordless phones

Cruise control

Curbside check-in systems

Digital cameras

Disk drives

Electronic card readers

Electronic instruments

Electronic toys/games

Factory control

Fax machines

Fingerprint identifiers

Home security systems

Life-support systems

Medical testing systems

Modems

MPEG decoders

Network cards

Network switches/routers

On-board navigation

Pagers

Photocopiers

Point-of-sale systems

Portable video games

Printers

Satellite phones

Scanners

Smart ovens/dishwashers

Speech recognizers

Stereo systems

Teleconferencing systems

Televisions

Temperature controllers

Theft tracking systems

TV set-top boxes

VCR’s, DVD players

Video game consoles

Video phones

Washers and dryers

A “short list” of embedded systems
  • And the list grows longer each year.
an embedded system example a digital camera

Digital camera chip

CCD

CCD preprocessor

Pixel coprocessor

D2A

A2D

lens

JPEG codec

Microcontroller

Multiplier/Accum

DMA controller

Display ctrl

Memory controller

ISA bus interface

UART

LCD ctrl

An embedded system example -- a digital camera
  • Single-functioned -- always a digital camera
  • Tightly-constrained -- Low cost, low power, small, fast
design metric competition improving one may worsen others
Expertise with both software and hardware is needed to optimize design metrics

Not just a hardware or software expert, as is common

A designer must be comfortable with various technologies in order to choose the best for a given application and constraints

Need serious Design Space Explorations

Power

Performance

Size

NRE cost

Design metric competition -- improving one may worsen others
processor technology
Processor technology
  • Processors vary in their customization for the problem at hand

total = 0

for i = 1 to N loop

total += M[i]

end loop

Desired functionality

General-purpose processor (software)

Application-specific processor

Single-purpose processor (hardware)

design productivity gap

10,000

100,000

1,000

10,000

100

1000

Logic transistors per chip

(in millions)

Gap

Productivity

(K) Trans./Staff-Mo.

10

100

IC capacity

1

10

0.1

1

productivity

0.01

0.1

0.001

0.01

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

Design Productivity Gap
  • 1981 leading edge chip required 100 designer months
    • 10,000 transistors / 100 transistors/month
  • 2002 leading edge chip requires 30,000 designer months
    • 150,000,000 / 5000 transistors/month
  • Designer cost increase from $1M to $300M
more challenges coming
More challenges coming
  • Parallel
    • Consist of multiple processors with hardware.
  • Heterogeneous, Networked
    • Each processor has its own speed, memory, power, reliability, etc.
  • Fault-Tolerance, Reliability & Security
    • A major issue for critical applications
  • Design Space Explorations: timing, code-size, data memory, power consumption, cost, etc.
  • System-Level Design, Analysis, and Optimization are important.
  • Compiler is playing an important role. We need more research.
  • Lets start with Timing optimizations, then other optimizations, and design space issues.
timing optimization
Timing Optimization
  • Parallelization for Nested Loops
  • Focus on computation or data intensive applications.
  • Loops are the most critical parts.
  • Multi-dimensional systems (MD) Uniform nested loops.
  • Develop efficient algorithms to obtain the schedule with the minimum execution time while hiding memory latencies.
    • ALU part: MD Retiming to fully parallelize computations.
    • Memory part: prefetching and partitioning to hide memory latencies.
  • Developed by Edwin Sha’s group. The results are exciting.
graph representation for loops
Graph Representation for Loops
  • A[0] = A[1] = 0;
  • For (i=2; i<n; i++)
  • {
  • A[i] = D[i-2] / 3;
  • B[i] = A[i] * 5;
  • C[i] = A[i] + 7;
  • D[i] = B[i] + C[i];
  • }

B

A

D

C

Delays

schedule looped dfg

A

B C

D

B

A

D

A

B C

D

C

A

B C

D

A

B C

D

… …

Schedule looped DFG
  • DFG: Static Schedule:

Schedule

Length = 3

rotation loop pipelining

A

B C

D A

A

B C

D

A

B C

D

B C

D A

A

B C

D

A

B C

D

B C

D A

A

B C

D

A

B C

D

B C

D A

A

B C

D

A

B C

D

… …

B C

D

… …

… …

Rotation: Loop pipelining
  • Original Schedule: Regrouped Schedule: Rotated Schedule:

prologue

epilogue

graph representation using retiming

B

A

D

C

Graph Representation Using Retiming

DAG

Longest path = 3

B

Longest path = 2

A

D

C

multi dimensional problems
Multi-dimensional Problems:

Multi-dimensional problems

DO 10 J = 0, N

DO 1 I = 0, M

d(i,j) = b(i,j-1) * c(i-1,j) D

a(i,j) = d(i,j) * .5 A

b(i,j) = a(i,j) + 1. B

c(i,j) = a(i,j) + 2. C

1 Continue

10 Continue

Circuit optimization

z2-1

(0,1)

B

A

D

z1-1

C

(1,0)

an example of dsp processor ti tms320c64x
An Example of DSP Processor: TI TMS320C64X
  • Clocking speed: 1.1 GHz, Up to 8800 MIPS.
slide33

For I = 1, …..

× 1.3

One-Dimensional Retiming

(Leiserson-Saxe, ’91)

For I = 1, …..

× 1.3

another example

A(1) = B(-1) + 1

For I = 1, ……

B(I) = A(I) × 1.3

A(I+1) = B(I-1) + 1

Another Example

For I = 1, …….

× 1.3

retiming
Retiming
  • An integer-value transformation on nodes
  • Registers are re-distributed
  • G = < V, E, d >  Gr = < V, E, dr >
  • dr(e) = d(e) + r(u) – r(u)>= 0 Legal retiming

# delays of a cycle remains constant

multi dimensional retiming
Multi-Dimensional Retiming
  • A nested loop
  • Illegal cases
  • Retiming nested loops
  • New problems …
iteration space for retimed graph
Iteration Space for Retimed Graph

Legal schedule with row-wise executions. S=(0,1)

required solution needs
Required Solution Needs:
  • To avoid illegal retiming
  • To be general
  • To obtain full parallelism
  • To be a fast Algorithm
schedule vector wavefront processing
Schedule Vector(wavefront processing)

Legal schedule: s·d  0

slide50

Chained MD Retiming

Schedule plane S+

y

s • (1,1) > 0

s • (-1,1) > 0

Pick s = (0,1),

r s

=> r = (1,0)

s

(-1,1)

(1,1)

x

r

s must be feasible for new delay vectors
S must be feasible for new delay vectors

Let new d’=(d+kr). We know s d >0 and s r=0

s • (d + kr) = s • d + s • (kr) must be > 0

=> s is a legal schedule for d + kr .

Schedule plane

s

d3

d2

d1

r

slide55

Synchronous Circuit Optimization Example – original design (cont.)

Critical path: 6 adders & 2 mul.

embedded system design review
Embedded System Design Review
  • Strict requirements
    • Time, power-consumption, code size, data memory size, hardware cost, areas, etc.
    • Time-to-market, time-to-prototype
  • Special architectural support
    • Harvard architecture, on-chip memory, register files, etc.
  • Increasing amount of software
    • Flexibility of software, short time-to-market, easy upgrades
    • The amount of software is doubling every two years.
  • How to generate high-quality code for embedded systems?
  • How to minimize and search the design space?
  • Compiler role?
compiler in embedded systems
Compiler in Embedded Systems
  • Ordinary C compilers for embedded processors are notoriously known for their poor code quality.
    • Data memory overhead for compiled code can reach a factor of 5
    • Cycle overhead can reach a factor of 8, compared with the manually generated assembly code. (Rozenberg et al., 1998)
  • For code generation
  • In general, compilers are included in control flow loops for design space exploration and therefore they play an important role for design phase.
  • Exploring efficient designs in a huge, n-dimensional space, where each dimension corresponds to a design choice.
compiler in design space exploration
Compiler in Design Space Exploration
  • Algorithm selection: to analyze dependencies between algorithms and processor architectures.
  • HW/SW partitioning
  • Memory related issues (program and data memory)
    • Optimally placing programs and data in on-chip memories, and hide off-chip memory latencies by smart pre-fetching.(Sha, et al.)
    • Data mapping for processors with multiple on-chip memory modules. (Zhuge and Sha)
    • Code size reduction for software-pipelined applications. (Zhuge and Sha)
  • Instruction set options
    • Search for power-optimized instruction sets (Kin et al. 1999)
    • Scheduling for loops and DAG with the min. energy. (Shao and Sha)
design space minimization
Design Space Minimization
  • A direct design space is too large. Must do design space minimization before exploration.
  • Derive the properties of relationships of design metrics
    • A huge number of design points can be proved to be infeasible before performing time-consuming design space exploration.
  • Using our design space minimization algorithm, the design points can be reduced from 510 points to 6 points for synthesizing All-pole Filter, for example.
  • Approach:
    • Develop effective optimization techniques: code size, time, data memory, low-power
    • Understand the relationship among them
    • Design space minimization algorithms
    • Efficient design space exploration algorithms using fuzzy logic
example relation of optimizations

B

A

D

C

B

A

D

C

Example Relation of Optimizations
  • Retiming
    • Transform a DFG to minimize its cycle period in polynomial time by redistribution the delays in the DFG.
    • Cycle period c(G) of DFG G is the computation time of the longest zero-delay path.
  • Unfolding
    • The original DFG G is unfolded f times, so the unfolded graph Gfconsists of f copies of the original node set.
    • Iteration period: P = c( Gf )/f.
    • Code size is increased f times. Software pipelining will increase more.
experimental results
Experimental Results
  • The search space size using our method is only 2% of the search space using the standard method on average.
  • The quality of the solutions found by our algorithm is better than that of the standard method.
a design space minimization problem
A Design Space Minimization Problem
  • Clearly understand the relationships: retiming, unfolding and iteration period.
program memory consideration with code size minimization
Program Memory Consideration with Code Size Minimization
  • Multiple on-chip memory banks, but usually only one program memory.
  • The capacity of an on-chip memory bank is very limited
    • Motolora’s DSP56K has only 512*24 bit program memory
    • ARM940T uses 4K instruction cache (Icache)
    • StringARM SA-1110 uses 16K cache
  • A widely used performance optimization technique, software pipelining, expands the code size to several times of the original code size.
  • Designers need to fit the code into the small on-chip memory to avoid slow (external) memory accesses.
  • The code size becomes a critical concern for many embedded processors.
code size expansion caused by software pipelining
Code Size Expansion Caused bySoftware Pipelining

A[1] = E[-3] + 9;

A[2] = E[-2] + 9;

B[1] = A[1] * 0.5;

C[1] = A[1] + B[-1];

A[3] = E[-1] + 9;

B[2] = A[2] * 0.5;

C[2] = A[2] + B[0];

D[1] = A[1] * C[1];

for i=1 to n-3 do

A[i+3] = E[i-1] + 9;

B[i+2] = A[i+2] * 0.5;

C[i+2] = A[i+2] + B[i];

D[i+1] = A[i+1] * C[i+1];

E[i] = D[i] + 30;

End

B[n] = A[n] * 0.5;

C[n] = A[n] + B[n-2];

D[n-1] = A[n-1] * C[n-1];

E[n-2] = D[n-2] + 30;

D[n] = A[n] * C[n];

E[n-1] = D[n-1] + 30;

E[n] = D[n] + 30;

  • Schedule length is decreased from 4 cycles for 1 cycle.
  • Code size is expanded to 3 times larger than the original code size.

for i=1 to n do

A[i] = E[i-4] + 9;

B[i] = A[i] * 0.5;

C[i] = A[i] + B[i-2];

D[i] = A[i] * C[i];

E[i] = D[i] + 30;

end

rotation scheduling
Rotation Scheduling
  • Resource Constrained Loop Scheduling based on Retiming concept.
  • Retiming gives a clear framework for software pipelining depth.
  • Given an initial DAG schedule, rotation scheduling repeatedly rotates down the nodes in the first row of the schedule.
  • In each step of rotation, the nodes in the first row:
    • retimed once by pushing one delay from each of incoming edges of the node and adding one delay to each of its outgoing edges;
    • rescheduled to an available locations (such as earliest ones) in the schedule based on the new precedence relations defined in the retimed graph.
  • The optimal schedule length can be obtained in polynomial time (2 |V|) in most cases.
  • The techniques can be generalized to deal with code-size, switching activities, branches, nested loops, etc.
schedule a cyclic dfg

A

B C

D

A

B C

D

A

B C

D

A

B C

D

… …

Schedule a Cyclic DFG
  • DFG Schedule

Schedule

Length = 3

B

A

D

C

rotation loop pipelining1

prologue

A

B C A

D

A

B C

D

B C A

D

A

B C

D

B C A

D

A

B C

D

B C A

D

… …

A

B C

D

epilogue

… …

Rescheduling

Rotation: Loop Pipelining

A

B C

D

A

B C

D

A

B C

D

A

B C

D

… …

B C

D

Rotation

Original schedule

retiming view of loop pipelining

B

A

D

C

B

A

D

C

Retiming View of Loop Pipelining

Cycle period = 3

Cycle period = 2

the second rotation
The Second Rotation

A

B C A

D

A

B C A

D

A

B C A

prologue

B C D A

B C A

D

B C A

D

B C D A

B C D A

B C A

D

B C A

D

… …

B C A

D

B C A

D

… …

… …

The schedule after

the 1st rotation phase

The 2nd rotation

The final schedule

after rescheduling

retiming view of loop pipelining1

B

A

D

C

Retiming View of Loop Pipelining

r(A)=1

r(B)=r(C)=r(D)=0

Cycle period = 2

B

r(A)=2

r(B)=r(C)=1

r(D)=0

Cycle period = 1

A

D

C

prologue and retiming function

A

B C

D

A

B C

D

A

B C

D

A

B C

D

… …

Prologue and Retiming Function

Original schedule

The 1st rotation

r(A)=1

The 2nd rotation

r(A)=2

  • The number of copies of node A in prologue = r(A)
  • The number of copies of node A in epilogue = (maxu r(u)) – r(A), for u Î V .

A

B C A

D

A

B C A

B C A

D

B C D A

B C D A

B C A

D

B C D A

… …

B C A

D

B C

D

… …

D

cred technique using predicate register
CRED Technique Using Predicate Register
  • Predicate register
    • An instruction can be guarded by a predicate register.
    • The instruction is executed when the value of the predicate register is true; otherwise, the instruction is disabled.
  • Implement CRED using predicate register with counter (TI’s TMS320C6x)
    • Set the initial value p = (maxu r(u)) – r(v) .
    • Decrement p by one in each iteration.
    • The instruction is executed when 0 ³ p > -n, where n is the loop counter of the original loop. The instruction is disabled when p > 0 or p £ –n.
the new execution sequence
The New Execution Sequence

Software-pipelined loop schedule with

r(A)=3, r(B)=r(C)=2, r(D)=1, r(E)=0,

and n=5.

The execution sequence after performing

CRED using 4 conditional registers.

The new code size.

processor classes
Processor Classes
  • Processor Class 0: No predicate register
    • Motorola’s StarCore DSP processor
  • Processor Class 1: Has “condition code” bits in instruction, no predicate register
    • Intel’s StrongARM and other ARM architectures
  • Processor Class 2: Has 1-bit predicate registers
    • Philip’s TriMedia Multimedia processor
  • Processor Class 3: Has predicate registers with counters
    • TI’s TMS320C6x processor
  • Processor Class 4: Specialized hardware support for executing software-pipelined loops
    • IA64
code size reduction for class 3
Code Size Reduction for Class 3

A[1] = E[-3] + 9;

A[2] = E[-2] + 9;

B[1] = A[1] * 0.5;

C[1] = A[1] + B[-1];

A[3] = E[-1] + 9;

B[2] = A[2] * 0.5;

C[2] = A[2] + B[0];

D[1] = A[1] * C[1];

for i=1 to n-3 do

A[i+3] = E[i-1] + 9;

B[i+2] = A[i+2] * 0.5;

C[i+2] = A[i+2] + B[i];

D[i+1] = A[i+1] * C[i+1];

E[i] = D[i] + 30;

End

B[n] = A[n] * 0.5;

C[n] = A[n] + B[n-2];

D[n-1] = A[n-1] * C[n-1];

E[n-2] = D[n-2] + 30;

D[n] = A[n] * C[n];

E[n-1] = D[n-1] + 30;

E[n] = D[n] + 30;

p=0;q=1;r=2;s=3

for i=1 to n-3 do

[p]A[i+3] = E[i-1] + 9;

p--;

[q]B[i+2] = A[i+2] * 0.5;

[q]C[i+2] = A[i+2] + B[i];

q--;

[r]D[i+1] = A[i+1] *C[i+1]

r--;

[s]E[i] = D[i] + 30;

s--;

end

cred on various types of processors
CRED on Various Types of Processors
  • TI model and IA46 is very efficient for code size reduction.
  • TI model is very effective for DSP processors supporting predicate registers but without specialized hardware as in IA64.
experimental results on code size performance trade off
Experimental Results on Code Size/Performance Trade-off
  • Code size/performance exploration for All-pole Filter on the modified TMS320C6x processor with only 2 predicate registers.
  • The code size is increased when software pipeline depth is increased and the schedule length is decreased.
  • Our approach find the shortest schedule length satisfying a code size constraint.
code size reduction for nested loop

(2,0)

(2,1)

(2,2)

(1,0)

(1,1)

(1,2)

(0,0)

(0,1)

(0,2)

Code-size Reduction for Nested Loop

1

(0,1)

  • Assume 8 functional units.
  • Traditional software pipelining can only make 6 clock cycle at best.
  • Interchanging the loop index can not help optimization.

2

12

(1,0)

3

11

4

10

Cell Dependency Graph

5

9

(a)

The original loop:

Outer loop begin (trip count = m)

Outer 1 (6 cycles, 15 instructions)

Inner loop (10 cycles, 12 instr., trip count = n)

Outer 2 (5 cycles, 15 instr.)

Assuming m=1000, n=10

Total cycles = m(6+10n+5) = 10mn+11n = 111,000

Code size = 42 instr.

6

8

7

Data Flow Graph

md retiming and code reduction

1

(-4,1)

2

12

(1,0)

3

11

(1,0)

4

10

(1,0)

5

9

(1,0)

6

8

(1,0)

7

MD Retiming and Code Reduction

(b)

Inner-outer combined software pipelining:

Outer loop begin (trip count = m)

Outer 1 & Prologue (12 cycles, 15+28 instr.)

Inner loop (2 cycles, 12 instr., trip count = n-4)

Outer 2 & Epilogue (12 cycles, 15+20 instr.)

Total cycles = m(12+2(n-4)+12) = 2mn+16n = 36,000

Code size = 90 instr.

(c)

Code size reduction: remove pro. epi.

Outer loop begin (trip count = m)

Outer 1 (6 cycles, 15+4 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.)

Total cycles = m(6+2(n+4)+5) = 2mn+19n = 39,000

Code size = 50 instr.

Retimed DFG: r(1)=r(2)=r(3)=r(4)=(4,0), r(5)=r(6)=(3,0), r(7)=r(8)=(2,0), r(9)=r(10)=(1,0)

r(11)=r(12)=(0,0)

outer loop pipeline and code reduction
Outer Loop Pipeline and Code Reduction

(d)

Outer loop pipelining:

Outer loop begin (trip count = m-1)

Outer 1 (6 cycles, 19 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 19 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.)

Total cycles = 6+(m-1)(2(n+4)+6)+2(n+4)+5 = 2mn+14n+5 = 34005

Code size = 100 instr.

(e)

Reduce new epilogue:

Outer loop begin (trip count = m)

Outer 1 (6 cycles, 19+1 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 20 instr.)

Total cycles = 6+m(2(n+4)+6) = 2mn+14n+6 = 34006

Code size = 71 instr.

data memory consideration with optimal data mapping

Data Memory

Bank 0

Data Memory

Bank 1

DB0

DB1

ALU

Data Memory Consideration with Optimal Data Mapping
  • Multiple memory banks are accessible in parallel.
  • Provides higher memory bandwidth.
  • Many existing compilers cannot work well for such kind of architectural feature. Instead, all variables are assigned to just one bank.
  • The technique of data mapping and scheduling becomes one of the most importance factors in performance optimization
iir filter data flow graph

C

E

4

6

20

19

23

C

24

A

A

12

E

F

F

A

A

G

G

2

8

9

0

10

14

1

16

15

3

D

11

B

22

B

21

A

A

7

18

17

5

D

IIR Filter – Data Flow Graph
our model variable independence graph

D

E

Our Model– Variable Independence Graph

2

B

C

1/2

1/2

1/2

1/2

7/8

1/2

A

G

1/2

1/2

1/2

F

Partition 2

Partition 1

Weight(e==(u,v)): “gain” to put u, v in different memory modules. We want to find maximum-weight partition.

experimental results1
Experimental Results
  • IG approach uses list scheduling and interference graph model (M. Saghir, etc., University of Toronto, Canada; R. Leupers, etc., University of Dortmund, Germany).
  • Our approach uses rotation scheduling with variable repartitioning algorithm and variable independence graph.
  • Different approaches result in different variable partitions.
  • The largest improvement on schedule length using our approach is 52.9%.The average improvement on the benchmarks is 44.8%.
conclusions
Conclusions
  • An exciting area: optimizations for parallel DSP and embedded systems. Gave an overview. Needs much more work.
  • Consider both architectures and compilers.
  • Presented techniques:
    • Multi-dimensional (MD) retiming, Rotation
    • Code-size minimization for software pipelined loops
    • Design space minimization
    • Optimal partitioning and prefetching to completely hide memory latencies. And decide the minimum required on-chip memory
  • Detailed retiming, unfolding, low-power scheduling, rate-optimal scheduling, etc. were presented in tutorial. Still a lot more.
  • Please check my web page for details: www.utdallas.edu/~edsha
ad