- By
**minor** - Follow User

- 89 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Parallel Architecture is Ubiquitous' - minor

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Optimizing Parallel Embedded SystemsDr. Edwin ShaProfessorComputer ScienceUniversity of Texas at Dallashttp://www.utdallas.edu/~edshaedsha@utdallas.edu

Parallel Architecture is Ubiquitous

- Parallel Architecture is everywhere
- As small as cellular phone
- Modern DSP processor (VLIW), network processors
- Modern CPU (instruction-level parallelism)
- Your home PC (small number of processors)
- Application-specific systems (image processing, speech processing, network routers, look-up table, etc.)
- File server
- Database server or web server
- Supercomputers

Interested in domain-specific HW/SW parallel systems

Organization of the Presentation

- Introduction to parallel architectures
- Using sorting as an example to show various implementations on parallel architectures.
- Introduction to embedded systems: strict constraints
- Timing optimization: parallelize loops and nested loops.
- Retiming, Multi-dimensional Retiming
- Full Parallelism: all the nodes can be executed in parallel
- Design space exploration and optimizations for code size, data memory, low-power, etc.
- Intelligent prefetching and partitioning to hide memory latency
- Conclusions

Technology Trend

- Microprocessor performance increases 50% - 100% per year
- Where does the performance gain from? Clock Rate and Capacity.
- Clock Rate increases only 30% per year

Technology Trend

- Transistor count grows much faster than clock rate.
- Increase 40% per year,
- Order of magnitude more contribution in 2 decades

Exploit Parallelism at Every Level

- Algorithms Level
- Thread level
- Eg. Each request of service is created as a thread
- Iteration level (loop level)
- Eg. For_all i= 1 to n do {loop body}.
- All n iterations can be parallelized.
- Loop body level (instruction-level)
- Parallelize instructions inside a loop body as mush as possible
- Hardware level: parallelize and pipeline the execution of an instruction such as multiplication, etc.

Sorting on Linear Array of Processors

- Input: x1, x2, .. xn. Output: A sorted sequence (Ascending order)
- Architecture: a linear array of k processors. Assume k=n at first.
- What is the optimal time for sorting. Obviously it takes O(n) time to reach the rightmost.
- Lets consider the different sequential algorithms and then think how to use them on a linear array of processors. This is a good example.
- Selection Sort
- Insertion Sort
- Bubble Sort
- Bucket Sort
- Sample Sort

Timing:

(n-1) + … + 2 + 1 =

2

3 + 2 + 1 = 6

5

Selection Sort- Algorithm:for i = 1 to n
- pick the ith smallest one

5,1,2,4

Is it good?

Keep 1

5,2,4

Keep 2

5,4

Keep 4

Insertion Sort

5,1,2,4

1

2

3

4

Timing:nonly !

5

1

1

1

4 clock cycles

in this example

5

2

2

Problem:

Need global bus

5

4

5

time

1

1

5

4

1

2

1

2

5

1

2

5

5

2

5

5

4

4

2

4

5

Bubble SortingThe worst algorithm in sequential model ! But a good one in this case.

7 clock cycles

In this example

How about n ?

Timing: 2n-1 for n procs. O(n) time

O(n n / k) for k procs.

Can we get O(n/k log n/k)

time

But it assumes n elements are uniformly distributed over an interval [a, b].

- The interval [a, b] is divided into k equal-sized subintervals called buckets.
- Scan through each element and put it to the corresponding bucket. The number of elements in each bucket is about n/k.

125

167

102

…

201

257

207

…

399

336

318

…

19

5

98

…

1

400

100

300

-- splitters

200

Bucket Sort- can be lower than the lower bound (n log n) to be O(n)?

Bucket Sort

- Then sort each bucket locally.
- The sequential running time is O(n + k(n/k) log (n/k)) = O(n log (n/k)).
- If k = n/128, then we get O(n) algorithm.
- Parallelization is straightforward.
- It is pretty good. Very little communication required between processors.
- But what happens when the input data are not uniformly distributed. One bucket may have almost all the elements.
- How to smartly pick appropriate splitters so each bucket will have at most 2 n/k elements. (Sample sort)

Sample Sort

- First Step:Splitter selection(An important step)
- Smartly select k-1 splitters from some samples.
- Second Step: Bucket sort using these splitters on k buckets.
- Guarantee: Each bucket has at most 2n/k elements.

- Directly divide n input elements into k blocks of size n/k each and sort each block.
- From each sorted block it chooses k-1 evenly spaced elements. Then sort these k(k-1) elements.
- Select the k-1 evenly spaced elements from these k(k-1) elements.
- Scan through the n input elements and use these k-1 splitters to put each element to the corresponding bucket.

Sample Sort

- Sequential: O(n log n/k) + O(k k log k) + O(n log n/k).
- Not an O(n) alg. But it is very efficient for parallel implementation

Sort

Sort

Step 1

Sort

Sort

Sort

Step 2

Final splitters

Step 3

Bucket sort using these splitters

Randomized Sample Sort

- Processor 0 randomly pick d´ k samples. d : over-sampling ratio such as 64 or 128.
- Sort these samples and select k-1 evenly spaced numbers as splitters.
- With high probability, the splitters are picked well. I.e. with low probability, there is a big bucket.
- But cannot be used for hard real-time systems.
- To sort 5 million numbers in a SUN cluster with 4 machines using MPI in our tests:
- Randomized sample sort takes 5 seconds
- Deterministic sample sort takes 10 seconds
- Radix sort takes > 500 seconds (too many communications).

Embedded Systems Overview

- Embedded computing systems
- Computing systems embedded within electronic devices
- Repeatedly carry out a particular function or a set of functions.
- Nearly any computing system other than a desktop computer are embedded systems
- Billions of units produced yearly, versus millions of desktop units
- About 50 per household, 50 - 100 per automobile

Some common characteristics of embedded systems

- Application Specific
- Executes a single program, repeatedly
- New ones might be adaptive, and/or multiple mode
- Tightly-constrained
- Low cost, low power, small, fast, etc.
- Reactive and real-time
- Continually reacts to changes in the system’s environment
- Must compute certain results in real-time without delay

Auto-focus cameras

Automatic teller machines

Automatic toll systems

Automatic transmission

Avionic systems

Battery chargers

Camcorders

Cell phones

Cell-phone base stations

Cordless phones

Cruise control

Curbside check-in systems

Digital cameras

Disk drives

Electronic card readers

Electronic instruments

Electronic toys/games

Factory control

Fax machines

Fingerprint identifiers

Home security systems

Life-support systems

Medical testing systems

Modems

MPEG decoders

Network cards

Network switches/routers

On-board navigation

Pagers

Photocopiers

Point-of-sale systems

Portable video games

Printers

Satellite phones

Scanners

Smart ovens/dishwashers

Speech recognizers

Stereo systems

Teleconferencing systems

Televisions

Temperature controllers

Theft tracking systems

TV set-top boxes

VCR’s, DVD players

Video game consoles

Video phones

Washers and dryers

A “short list” of embedded systems- And the list grows longer each year.

CCD

CCD preprocessor

Pixel coprocessor

D2A

A2D

lens

JPEG codec

Microcontroller

Multiplier/Accum

DMA controller

Display ctrl

Memory controller

ISA bus interface

UART

LCD ctrl

An embedded system example -- a digital camera- Single-functioned -- always a digital camera
- Tightly-constrained -- Low cost, low power, small, fast

Expertise with both software and hardware is needed to optimize design metrics

Not just a hardware or software expert, as is common

A designer must be comfortable with various technologies in order to choose the best for a given application and constraints

Need serious Design Space Explorations

Power

Performance

Size

NRE cost

Design metric competition -- improving one may worsen othersProcessor technology

- Processors vary in their customization for the problem at hand

total = 0

for i = 1 to N loop

total += M[i]

end loop

Desired functionality

General-purpose processor (software)

Application-specific processor

Single-purpose processor (hardware)

100,000

1,000

10,000

100

1000

Logic transistors per chip

(in millions)

Gap

Productivity

(K) Trans./Staff-Mo.

10

100

IC capacity

1

10

0.1

1

productivity

0.01

0.1

0.001

0.01

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

Design Productivity Gap- 1981 leading edge chip required 100 designer months
- 10,000 transistors / 100 transistors/month
- 2002 leading edge chip requires 30,000 designer months
- 150,000,000 / 5000 transistors/month
- Designer cost increase from $1M to $300M

More challenges coming

- Parallel
- Consist of multiple processors with hardware.
- Heterogeneous, Networked
- Each processor has its own speed, memory, power, reliability, etc.
- Fault-Tolerance, Reliability & Security
- A major issue for critical applications
- Design Space Explorations: timing, code-size, data memory, power consumption, cost, etc.
- System-Level Design, Analysis, and Optimization are important.
- Compiler is playing an important role. We need more research.
- Lets start with Timing optimizations, then other optimizations, and design space issues.

Timing Optimization

- Parallelization for Nested Loops
- Focus on computation or data intensive applications.
- Loops are the most critical parts.
- Multi-dimensional systems (MD) Uniform nested loops.
- Develop efficient algorithms to obtain the schedule with the minimum execution time while hiding memory latencies.
- ALU part: MD Retiming to fully parallelize computations.
- Memory part: prefetching and partitioning to hide memory latencies.
- Developed by Edwin Sha’s group. The results are exciting.

Graph Representation for Loops

- A[0] = A[1] = 0;
- For (i=2; i<n; i++)
- {
- A[i] = D[i-2] / 3;
- B[i] = A[i] * 5;
- C[i] = A[i] + 7;
- D[i] = B[i] + C[i];
- }

B

A

D

C

Delays

B C

D A

A

B C

D

A

B C

D

B C

D A

A

B C

D

A

B C

D

B C

D A

A

B C

D

A

B C

D

B C

D A

A

B C

D

A

B C

D

… …

B C

D

… …

… …

Rotation: Loop pipelining- Original Schedule: Regrouped Schedule: Rotated Schedule:

prologue

epilogue

Multi-dimensional Problems:

Multi-dimensional problems

DO 10 J = 0, N

DO 1 I = 0, M

d(i,j) = b(i,j-1) * c(i-1,j) D

a(i,j) = d(i,j) * .5 A

b(i,j) = a(i,j) + 1. B

c(i,j) = a(i,j) + 2. C

1 Continue

10 Continue

Circuit optimization

z2-1

(0,1)

B

A

D

z1-1

C

(1,0)

An Example of DSP Processor: TI TMS320C64X

- Clocking speed: 1.1 GHz, Up to 8800 MIPS.

Retiming

- An integer-value transformation on nodes
- Registers are re-distributed
- G = < V, E, d > Gr = < V, E, dr >
- dr(e) = d(e) + r(u) – r(u)>= 0 Legal retiming

# delays of a cycle remains constant

Multi-Dimensional Retiming

- A nested loop
- Illegal cases
- Retiming nested loops
- New problems …

Multi-Dimensional Retiming

Iteration Space

Iteration Space for Retimed Graph

Legal schedule with row-wise executions. S=(0,1)

Required Solution Needs:

- To avoid illegal retiming
- To be general
- To obtain full parallelism
- To be a fast Algorithm

Schedule Vector(wavefront processing)

Legal schedule: s·d 0

Schedule plane S+

y

s • (1,1) > 0

s • (-1,1) > 0

Pick s = (0,1),

r s

=> r = (1,0)

s

(-1,1)

(1,1)

x

r

S must be feasible for new delay vectors

Let new d’=(d+kr). We know s d >0 and s r=0

s • (d + kr) = s • d + s • (kr) must be > 0

=> s is a legal schedule for d + kr .

Schedule plane

s

d3

d2

d1

r

Synchronous Circuit Optimization Example – original design (cont.)

Critical path: 6 adders & 2 mul.

Synchronous Circuit Optimization Example – Gnanasekaran’88

Synchronous Circuit Optimization Example – retimed design

Critical path is the minimum

Embedded System Design Review

- Strict requirements
- Time, power-consumption, code size, data memory size, hardware cost, areas, etc.
- Time-to-market, time-to-prototype
- Special architectural support
- Harvard architecture, on-chip memory, register files, etc.
- Increasing amount of software
- Flexibility of software, short time-to-market, easy upgrades
- The amount of software is doubling every two years.
- How to generate high-quality code for embedded systems?
- How to minimize and search the design space?
- Compiler role?

Compiler in Embedded Systems

- Ordinary C compilers for embedded processors are notoriously known for their poor code quality.
- Data memory overhead for compiled code can reach a factor of 5
- Cycle overhead can reach a factor of 8, compared with the manually generated assembly code. (Rozenberg et al., 1998)
- For code generation
- In general, compilers are included in control flow loops for design space exploration and therefore they play an important role for design phase.
- Exploring efficient designs in a huge, n-dimensional space, where each dimension corresponds to a design choice.

Compiler in Design Space Exploration

- Algorithm selection: to analyze dependencies between algorithms and processor architectures.
- HW/SW partitioning
- Memory related issues (program and data memory)
- Optimally placing programs and data in on-chip memories, and hide off-chip memory latencies by smart pre-fetching.(Sha, et al.)
- Data mapping for processors with multiple on-chip memory modules. (Zhuge and Sha)
- Code size reduction for software-pipelined applications. (Zhuge and Sha)
- Instruction set options
- Search for power-optimized instruction sets (Kin et al. 1999)
- Scheduling for loops and DAG with the min. energy. (Shao and Sha)

Design Space Minimization

- A direct design space is too large. Must do design space minimization before exploration.
- Derive the properties of relationships of design metrics
- A huge number of design points can be proved to be infeasible before performing time-consuming design space exploration.
- Using our design space minimization algorithm, the design points can be reduced from 510 points to 6 points for synthesizing All-pole Filter, for example.
- Approach:
- Develop effective optimization techniques: code size, time, data memory, low-power
- Understand the relationship among them
- Design space minimization algorithms
- Efficient design space exploration algorithms using fuzzy logic

A

D

C

B

A

D

C

Example Relation of Optimizations- Retiming
- Transform a DFG to minimize its cycle period in polynomial time by redistribution the delays in the DFG.
- Cycle period c(G) of DFG G is the computation time of the longest zero-delay path.

- Unfolding
- The original DFG G is unfolded f times, so the unfolded graph Gfconsists of f copies of the original node set.
- Iteration period: P = c( Gf )/f.
- Code size is increased f times. Software pipelining will increase more.

Experimental Results

- The search space size using our method is only 2% of the search space using the standard method on average.
- The quality of the solutions found by our algorithm is better than that of the standard method.

A Design Space Minimization Problem

- Clearly understand the relationships: retiming, unfolding and iteration period.

Program Memory Consideration with Code Size Minimization

- Multiple on-chip memory banks, but usually only one program memory.
- The capacity of an on-chip memory bank is very limited
- Motolora’s DSP56K has only 512*24 bit program memory
- ARM940T uses 4K instruction cache (Icache)
- StringARM SA-1110 uses 16K cache
- A widely used performance optimization technique, software pipelining, expands the code size to several times of the original code size.
- Designers need to fit the code into the small on-chip memory to avoid slow (external) memory accesses.
- The code size becomes a critical concern for many embedded processors.

Code Size Expansion Caused bySoftware Pipelining

A[1] = E[-3] + 9;

A[2] = E[-2] + 9;

B[1] = A[1] * 0.5;

C[1] = A[1] + B[-1];

A[3] = E[-1] + 9;

B[2] = A[2] * 0.5;

C[2] = A[2] + B[0];

D[1] = A[1] * C[1];

for i=1 to n-3 do

A[i+3] = E[i-1] + 9;

B[i+2] = A[i+2] * 0.5;

C[i+2] = A[i+2] + B[i];

D[i+1] = A[i+1] * C[i+1];

E[i] = D[i] + 30;

End

B[n] = A[n] * 0.5;

C[n] = A[n] + B[n-2];

D[n-1] = A[n-1] * C[n-1];

E[n-2] = D[n-2] + 30;

D[n] = A[n] * C[n];

E[n-1] = D[n-1] + 30;

E[n] = D[n] + 30;

- Schedule length is decreased from 4 cycles for 1 cycle.
- Code size is expanded to 3 times larger than the original code size.

for i=1 to n do

A[i] = E[i-4] + 9;

B[i] = A[i] * 0.5;

C[i] = A[i] + B[i-2];

D[i] = A[i] * C[i];

E[i] = D[i] + 30;

end

Rotation Scheduling

- Resource Constrained Loop Scheduling based on Retiming concept.
- Retiming gives a clear framework for software pipelining depth.
- Given an initial DAG schedule, rotation scheduling repeatedly rotates down the nodes in the first row of the schedule.
- In each step of rotation, the nodes in the first row:
- retimed once by pushing one delay from each of incoming edges of the node and adding one delay to each of its outgoing edges;
- rescheduled to an available locations (such as earliest ones) in the schedule based on the new precedence relations defined in the retimed graph.
- The optimal schedule length can be obtained in polynomial time (2 |V|) in most cases.
- The techniques can be generalized to deal with code-size, switching activities, branches, nested loops, etc.

A

B C A

D

A

B C

D

B C A

D

A

B C

D

B C A

D

A

B C

D

B C A

D

… …

A

B C

D

epilogue

… …

Rescheduling

Rotation: Loop PipeliningA

B C

D

A

B C

D

A

B C

D

A

B C

D

… …

B C

D

Rotation

Original schedule

The Second Rotation

A

B C A

D

A

B C A

D

A

B C A

prologue

B C D A

B C A

D

B C A

D

B C D A

B C D A

B C A

D

B C A

D

… …

B C A

D

B C A

D

… …

… …

The schedule after

the 1st rotation phase

The 2nd rotation

The final schedule

after rescheduling

A

D

C

Retiming View of Loop Pipeliningr(A)=1

r(B)=r(C)=r(D)=0

Cycle period = 2

B

r(A)=2

r(B)=r(C)=1

r(D)=0

Cycle period = 1

A

D

C

B C

D

A

B C

D

A

B C

D

A

B C

D

… …

Prologue and Retiming FunctionOriginal schedule

The 1st rotation

r(A)=1

The 2nd rotation

r(A)=2

- The number of copies of node A in prologue = r(A)
- The number of copies of node A in epilogue = (maxu r(u)) – r(A), for u Î V .

A

B C A

D

A

B C A

B C A

D

B C D A

B C D A

B C A

D

B C D A

… …

B C A

D

B C

D

… …

D

CRED Technique Using Predicate Register

- Predicate register
- An instruction can be guarded by a predicate register.
- The instruction is executed when the value of the predicate register is true; otherwise, the instruction is disabled.
- Implement CRED using predicate register with counter (TI’s TMS320C6x)
- Set the initial value p = (maxu r(u)) – r(v) .
- Decrement p by one in each iteration.
- The instruction is executed when 0 ³ p > -n, where n is the loop counter of the original loop. The instruction is disabled when p > 0 or p £ –n.

The New Execution Sequence

Software-pipelined loop schedule with

r(A)=3, r(B)=r(C)=2, r(D)=1, r(E)=0,

and n=5.

The execution sequence after performing

CRED using 4 conditional registers.

The new code size.

Processor Classes

- Processor Class 0: No predicate register
- Motorola’s StarCore DSP processor
- Processor Class 1: Has “condition code” bits in instruction, no predicate register
- Intel’s StrongARM and other ARM architectures
- Processor Class 2: Has 1-bit predicate registers
- Philip’s TriMedia Multimedia processor
- Processor Class 3: Has predicate registers with counters
- TI’s TMS320C6x processor
- Processor Class 4: Specialized hardware support for executing software-pipelined loops
- IA64

Code Size Reduction for Class 3

A[1] = E[-3] + 9;

A[2] = E[-2] + 9;

B[1] = A[1] * 0.5;

C[1] = A[1] + B[-1];

A[3] = E[-1] + 9;

B[2] = A[2] * 0.5;

C[2] = A[2] + B[0];

D[1] = A[1] * C[1];

for i=1 to n-3 do

A[i+3] = E[i-1] + 9;

B[i+2] = A[i+2] * 0.5;

C[i+2] = A[i+2] + B[i];

D[i+1] = A[i+1] * C[i+1];

E[i] = D[i] + 30;

End

B[n] = A[n] * 0.5;

C[n] = A[n] + B[n-2];

D[n-1] = A[n-1] * C[n-1];

E[n-2] = D[n-2] + 30;

D[n] = A[n] * C[n];

E[n-1] = D[n-1] + 30;

E[n] = D[n] + 30;

p=0;q=1;r=2;s=3

for i=1 to n-3 do

[p]A[i+3] = E[i-1] + 9;

p--;

[q]B[i+2] = A[i+2] * 0.5;

[q]C[i+2] = A[i+2] + B[i];

q--;

[r]D[i+1] = A[i+1] *C[i+1]

r--;

[s]E[i] = D[i] + 30;

s--;

end

CRED on Various Types of Processors

- TI model and IA46 is very efficient for code size reduction.
- TI model is very effective for DSP processors supporting predicate registers but without specialized hardware as in IA64.

Experimental Results on Code Size/Performance Trade-off

- Code size/performance exploration for All-pole Filter on the modified TMS320C6x processor with only 2 predicate registers.
- The code size is increased when software pipeline depth is increased and the schedule length is decreased.
- Our approach find the shortest schedule length satisfying a code size constraint.

(2,1)

(2,2)

(1,0)

(1,1)

(1,2)

(0,0)

(0,1)

(0,2)

Code-size Reduction for Nested Loop1

(0,1)

- Assume 8 functional units.
- Traditional software pipelining can only make 6 clock cycle at best.
- Interchanging the loop index can not help optimization.

2

12

(1,0)

3

11

4

10

Cell Dependency Graph

5

9

(a)

The original loop:

Outer loop begin (trip count = m)

Outer 1 (6 cycles, 15 instructions)

Inner loop (10 cycles, 12 instr., trip count = n)

Outer 2 (5 cycles, 15 instr.)

Assuming m=1000, n=10

Total cycles = m(6+10n+5) = 10mn+11n = 111,000

Code size = 42 instr.

6

8

7

Data Flow Graph

(-4,1)

2

12

(1,0)

3

11

(1,0)

4

10

(1,0)

5

9

(1,0)

6

8

(1,0)

7

MD Retiming and Code Reduction(b)

Inner-outer combined software pipelining:

Outer loop begin (trip count = m)

Outer 1 & Prologue (12 cycles, 15+28 instr.)

Inner loop (2 cycles, 12 instr., trip count = n-4)

Outer 2 & Epilogue (12 cycles, 15+20 instr.)

Total cycles = m(12+2(n-4)+12) = 2mn+16n = 36,000

Code size = 90 instr.

(c)

Code size reduction: remove pro. epi.

Outer loop begin (trip count = m)

Outer 1 (6 cycles, 15+4 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.)

Total cycles = m(6+2(n+4)+5) = 2mn+19n = 39,000

Code size = 50 instr.

Retimed DFG: r(1)=r(2)=r(3)=r(4)=(4,0), r(5)=r(6)=(3,0), r(7)=r(8)=(2,0), r(9)=r(10)=(1,0)

r(11)=r(12)=(0,0)

Outer Loop Pipeline and Code Reduction

(d)

Outer loop pipelining:

Outer loop begin (trip count = m-1)

Outer 1 (6 cycles, 19 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 19 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.)

Total cycles = 6+(m-1)(2(n+4)+6)+2(n+4)+5 = 2mn+14n+5 = 34005

Code size = 100 instr.

(e)

Reduce new epilogue:

Outer loop begin (trip count = m)

Outer 1 (6 cycles, 19+1 instr.)

Inner loop (2 cycles, 16 instr., trip count = n+4)

Outer 2 (5 cycles, 15 instr.); Outer 1 (6 cycles, 20 instr.)

Total cycles = 6+m(2(n+4)+6) = 2mn+14n+6 = 34006

Code size = 71 instr.

Bank 0

Data Memory

Bank 1

DB0

DB1

ALU

Data Memory Consideration with Optimal Data Mapping- Multiple memory banks are accessible in parallel.
- Provides higher memory bandwidth.
- Many existing compilers cannot work well for such kind of architectural feature. Instead, all variables are assigned to just one bank.
- The technique of data mapping and scheduling becomes one of the most importance factors in performance optimization

E

Our Model– Variable Independence Graph2

B

C

1/2

1/2

1/2

1/2

7/8

1/2

A

G

1/2

1/2

1/2

F

Partition 2

Partition 1

Weight(e==(u,v)): “gain” to put u, v in different memory modules. We want to find maximum-weight partition.

Experimental Results

- IG approach uses list scheduling and interference graph model (M. Saghir, etc., University of Toronto, Canada; R. Leupers, etc., University of Dortmund, Germany).
- Our approach uses rotation scheduling with variable repartitioning algorithm and variable independence graph.
- Different approaches result in different variable partitions.
- The largest improvement on schedule length using our approach is 52.9%.The average improvement on the benchmarks is 44.8%.

Conclusions

- An exciting area: optimizations for parallel DSP and embedded systems. Gave an overview. Needs much more work.
- Consider both architectures and compilers.
- Presented techniques:
- Multi-dimensional (MD) retiming, Rotation
- Code-size minimization for software pipelined loops
- Design space minimization
- Optimal partitioning and prefetching to completely hide memory latencies. And decide the minimum required on-chip memory
- Detailed retiming, unfolding, low-power scheduling, rate-optimal scheduling, etc. were presented in tutorial. Still a lot more.
- Please check my web page for details: www.utdallas.edu/~edsha

Download Presentation

Connecting to Server..