Algorithmic transformations
Download
1 / 59

Algorithmic Transformations - PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on

Algorithmic Transformations. Goals. The goal: Get the DSP algorithm in an amenable form before heading off to synthesize the design on the selected platform (FPGA or PDSP) No changes to the actual algorithms, just changes to the way the algorithms are prepared for implementation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Algorithmic Transformations' - megara


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Goals
Goals

  • The goal: Get the DSP algorithm in an amenable form before heading off to synthesize the design on the selected platform (FPGA or PDSP)

  • No changes to the actual algorithms, just changes to the way the algorithms are prepared for implementation.

  • This will require understanding aspects of

    • timing,

    • pipelining,

    • parallelism

(C)2002-2004 Yu Hen Hu


Overview
Overview

  • Algorithm Representations and Iteration Bound

  • Parallelism and Pipelining

  • Retiming

  • Unfolding

  • Folding

(C)2002-2004 Yu Hen Hu





Data flow graph

Node:

Computation

Associated with a computing time.

Direct edge:

data path and delay

Delay: iteration count

Example

y(n) = a*y(n-1) + b*u(n)

The delay of 1 u.t. indicates that to compute y(n+1) in the next iteration depends on result y(n) of the present iteration.

Delay labeled with D or positive integer on edges

Data Flow Graph

(C)2002-2004 Yu Hen Hu


Intra-iteration dependency

A direct edge without any delay

Inter-iteration dependency

Direct edge with 1 or more delays

Node computing delay labeled with parenthesis.

Critical path: longest path between registers

Example: critical path delay = 4+2+2 = 8 t.u.

Recursive DFG: contains loops. Must have at least one delay element along any loop. Otherwise, the algorithm is NON-computable!

DFG

x(n)

D

D

M1

M2

(4)

(4)

M0

(4)

y(n)

A1

A0

(2)

(2)

(C)2002-2004 Yu Hen Hu


Loop bound and iteration bound

T{A-B-A} = (2+4)/2 = 3 t.u.

T = max{(2+4)/2, (2+4+5)/1}

= max{3, 11} = 11

Loop bound and Iteration bound

D

(2)

(5)

(4)

A

B

C

2D

(2)

(4)

A

B

2D

(C)2002-2004 Yu Hen Hu




Solution
Solution

  • To achieve high-speed, the length of the critical path can be reduced by pipelining and parallel processing

(C)2002-2004 Yu Hen Hu


Overview1
Overview

  • Algorithm Representations and Iteration Bound

  • Parallelism and Pipelining

  • Retiming

  • Unfolding

  • Folding

(C)2002-2004 Yu Hen Hu


Basic ideas

Parallel processing

Pipelined processing

Basic Ideas

time

time

P1

P2

P3

P4

P1

P2

P3

P4

a1

a2

a3

a4

a1

b1

c1

d1

b1

b2

b3

b4

a2

b2

c2

d2

c1

c2

c3

c4

a3

b3

c3

d3

d1

d2

d3

d4

a4

b4

c4

d4

Less inter-processor communication

Complicated processor hardware

More inter-processor communication

Simpler processor hardware

Colors: different types of operations performed

a, b, c, d: different data streams processed

(C)2002-2004 Yu Hen Hu


Data dependence

Parallel processing requires NO data dependence between processors

Pipelined processing will involve inter-processor communication

Data Dependence

P1

P2

P3

P4

P1

P2

P3

P4

time

time

(C)2002-2004 Yu Hen Hu


Usage of pipelined processing

By processorsinserting latches or registers between combinational logic circuits, the critical path can be shortened.

Consequence:

reduce clock cycle time,

increase clock frequency.

Suitable for DSP applications that have (infinity) long data stream.

Method to incorporate pipelining: Cut-set retiming

Cut set:

A cut set is a set of edges of a graph. If these edges are removed from the original graph, the remaining graph will become two separate graphs.

Retiming:

The timing of an algorithm is re-adjusted while keeping the partial ordering of execution unchanged so that the results correct

Usage of Pipelined Processing

(C)2002-2004 Yu Hen Hu


Pipelining
Pipelining processors

(C)2002-2004 Yu Hen Hu


Pipelining of fir filters
Pipelining of FIR filters processors

(C)2002-2004 Yu Hen Hu


Pipelining1
Pipelining processors

(C)2002-2004 Yu Hen Hu


Fine grain pipelining
Fine-grain pipelining processors

To further reduce TM.

Critical Path = Max {TM1, TM2, TA}

(C)2002-2004 Yu Hen Hu


Graphic transpose theorem

x[n] processors

z-1

z-1

h[0]

h[1]

y[n]

h[2]

?

=

Graphic Transpose Theorem

  • The transfer function of a signal flow graph remain unchanged if

    • The directions of each arc is reversed

    • The input and output labels are switched.

u[n]

y[n]

z-1

z-1

h[2]

h[0]

h[1]

x[n]

(C)2002-2004 Yu Hen Hu


Data broadcast structure

Algorithm transform may lead to pipelined structure without adding additional delays.

Given a FIR filter SFG

Critical path TM+2TA

Use graph transposition theorem:

Reverse all arcs

Reverse input/output

We obtain

Critical path Max(TM, TA)

No additional delay added!

Data broadcast structure

(C)2002-2004 Yu Hen Hu


Block processing

One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense)

Block vector: [x(3k) x(3k+1) x(3k+2)]

Clock cycle: can be 3 times longer

Original (FIR filter):

Rewrite 3 equations at a time:

Define block vector

Block formulation:

Block Processing

(C)2002-2004 Yu Hen Hu


Block processing1
Block Processing algorithms. (Not the parallel processing in most general sense)

(C)2002-2004 Yu Hen Hu


General approach for block processing
General approach for block processing algorithms. (Not the parallel processing in most general sense)

(C)2002-2004 Yu Hen Hu


(C)2002-2004 Yu Hen Hu algorithms. (Not the parallel processing in most general sense)


Timing comparison
Timing Comparison algorithms. (Not the parallel processing in most general sense)

x(1)

x(2)

x(3)

x(4)

MAC

1

2

3

4

y(1)

y(2)

y(3)

y(4)

  • Pipelining

  • Block processing

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

x(7)

Add

1

2

3

4

5

6

7

8

y(1)

y(2)

y(3)

y(4)

y(5)

y(6)

y(7)

y(7)

a y(1)

Mul

1

2

3

4

5

6

7

8

x(2)

x(4)

x(6)

x(8)

2

2

4

4

6

6

8

8

x(1)

x(3)

x(5)

x(7)

1

1

3

3

5

5

7

7

(C)2002-2004 Yu Hen Hu


Overview2
Overview algorithms. (Not the parallel processing in most general sense)

  • Algorithm Representations and Iteration Bound

  • Parallelism and Pipelining

  • Retiming

  • Unfolding

  • Folding

(C)2002-2004 Yu Hen Hu


Definitions
Definitions algorithms. (Not the parallel processing in most general sense)

  • Retiming

    Retiming is a mapping from a given DFG, G to a retimed DFT, Gr such that the corresponding transfer function of G and Gr differ by a pure delay z-L.

  • Purposes

    • To facilitate pipelining to reduce clock cycle time

    • To reduce number of registers needed.

(C)2002-2004 Yu Hen Hu


Cut set retiming
Cut Set Retiming algorithms. (Not the parallel processing in most general sense)

(C)2002-2004 Yu Hen Hu


Cut set delay transfer
Cut set delay transfer algorithms. (Not the parallel processing in most general sense)

(C)2002-2004 Yu Hen Hu


Cut set delay transfer failure
Cut-set delay transfer failure algorithms. (Not the parallel processing in most general sense)

(C)2002-2004 Yu Hen Hu


Cut set retiming1

Feed-forward cut-set: algorithms. (Not the parallel processing in most general sense)

Feed-back cut-set

Delay transfer theorem

Adding arbitrary non-negative number of delays to each edge of a feed-forward cut-set of a DFG will not alter its output, except the output timing will be delayed.

Transfer the same amount of delays from edges of the same direction across a feed-back cut set of a DFG to all edges of opposing edges across the same cut set will not alter the output, but its timing.

Cut-set Retiming

(C)2002-2004 Yu Hen Hu


Feed forward cut set retiming

Consider the FIR digital filter and its DFG: algorithms. (Not the parallel processing in most general sense)

y(n) = b0x(n) + b1x(n-1)

Critical path length = TM+TA

Select a cut set

Insert a delay each to each edge in the cut set.

Retiming:

ynew(n) = b0x(n-1) + b1x(n-2)

ynew(n) = y(n-1)

Critical path = Max(TM, TA)

Feed-forward Cut-Set Retiming

D

x(n)

x(n-1)

X

b0

X

b1

D

x(n)

x(n-1)

+

y(n)

X

b0

X

b1

D

D

+

y(n)

(C)2002-2004 Yu Hen Hu


Feed back cut set retiming

Consider an IIR digital filter algorithms. (Not the parallel processing in most general sense)

y(n) = a·y(n-2) + x(n)

loop bound = (TM+TA)/2

clock cycle = TM+TA

Shift 1 delay to the other edge across a feed-back cut set

Filter remains unchanged.

loop bound = (TM+TA)/2

clock cycle = Max(TM ,TA)

Feed-back Cut Set Retiming

x(n)

y(n)

x(n)

y(n)

+

+

2D

D

D

a

a

(C)2002-2004 Yu Hen Hu


Feed back cut set retiming1

Consider an IIR digital filter algorithms. (Not the parallel processing in most general sense)

y(n) = ay(n-1) + x(n)

loop bound = (TM+TA)

throughput = 1/(TM+TA)

x(2k-1)=x(k)

x(2k) = 0

Clock period = (TM+TA)

Throughput = 1/[2(TM+TA)]

Feed-back Cut Set Retiming

x(n)

y(n)

+

x(m)

y(m)

+

D

2D

a

a

(C)2002-2004 Yu Hen Hu


Time scaling
Time scaling algorithms. (Not the parallel processing in most general sense)

(C)2002-2004 Yu Hen Hu


Slowing down the input rate
Slowing down the input rate algorithms. (Not the parallel processing in most general sense)

(C)2002-2004 Yu Hen Hu


Loss of efficiency
Loss of Efficiency algorithms. (Not the parallel processing in most general sense)

(C)2002-2004 Yu Hen Hu


Slowdown retiming

Start with algorithms. (Not the parallel processing in most general sense)

y(n) = a y(n-1) + x(n)

clock cycle = Max(TM ,TA)

Throughput = 1/[2max(TM,TA)]

Start with

y(n) = a y(n-2) + x(n)

loop bound = (TM+TA)/2

clock cycle = Max(TM ,TA)

throughput = 1/ Max(TM ,TA)

Slowdown + Retiming

x(n)

y(n)

x(m)

y(m)

+

+

D

D

D

D

a

a

(C)2002-2004 Yu Hen Hu


Slow down for cut set retiming
Slow Down for Cut-Set Retiming algorithms. (Not the parallel processing in most general sense)

(C)2002-2004 Yu Hen Hu


Example of retiming

Node delay = 1 t.u. algorithms. (Not the parallel processing in most general sense)

Before retiming:

Critical path: a3  a4  a5  a6

Clock cycle time = 4

2 delay units

After cut-set retiming

Critical path: a3  a5, a4  a6

Clock cycle time = 2

6 delay units

After additional retiming

Critical path: none

Clock cycle time = 1

11 delay units

D

a4

a2

D

a6

D

a1

D

D

D

a3

a5

Example of retiming

D

a4

a2

a6

a1

D

a5

a3

2D

a4

a2

D

D

a6

2D

a1

D

D

D

2D

a3

a5

(C)2002-2004 Yu Hen Hu


Node retiming

Transfer delay through a node in DFG: algorithms. (Not the parallel processing in most general sense)

r(v) = # of delays transferred from out-going edges to incoming edges of node v w(e) = # of delays on edge e

wr(e) = # of delays on edge e after retiming

Retiming equation:

subject to wr(e)  0.

Let p be a path from v0 to vk

then

e0

e1

ek

v0

v1

vk

Node Retiming

e

v

u

D

3D

2D

r(v) = 2

v

v

2D

3D

D

p

(C)2002-2004 Yu Hen Hu


Invariant properties
Invariant Properties algorithms. (Not the parallel processing in most general sense)

  • Retiming does NOT change the total number of delays for each cycle.

  • Retiming does not change loop bound or iteration bound of the DFG

  • If the retiming values of every node v in a DFG G are added to a constant integer j, the retimed graph Gr will not be affected. That is, the weights (# of delays) of the retimed graph will remain the same.

(C)2002-2004 Yu Hen Hu


Node retiming examples
Node Retiming Examples algorithms. (Not the parallel processing in most general sense)

r(2) = 1

(C)2002-2004 Yu Hen Hu


Dfg illustration of the example
DFG Illustration of the Example algorithms. (Not the parallel processing in most general sense)

T = max. {(1+2+1)/2, (1+2+1)/3} = 2

Cr. Path Delay = max{2,2,1+1} = 2 t.u

T = max. {(1+2+1)/2, (1+2+1)/3} = 2

Cr. Path delay = 2+1 = 3 t.u

(C)2002-2004 Yu Hen Hu


Retiming for minimizing clock period

Note that retiming will NOT alter iteration bound T algorithms. (Not the parallel processing in most general sense).

Iteration bound is the theoretical minimum clock period to execute the algorithm.

Let edge e connect node u to node v. If the node computing time t(u) + t(v) > T, then clock period T > T. For such an edge, we require that

To generalize, for any path from v0 to vk, we have

In other words, for any possible critical path in the DFG that is larger than T, we require wr(e)  1.

Retiming for Minimizing Clock Period

(C)2002-2004 Yu Hen Hu


Retiming example revisited
Retiming Example Revisited algorithms. (Not the parallel processing in most general sense)

wr(e21)  0, since t(2)+t(1) = 2 = T.

wr(e13)  1, since t(1)+t(3) = 3 > T.

wr(e14)  1, since t(1)+t(4) = 3 > T.

wr(e32)  1, since t(3)+t(2) = 3 > T.

wr(e42)  1, since t(4)+t(2) = 3 > T.

Use eq. wr(euv) = w(e) + r(v) – r(u),

w(e21) + r(1) – r(2) = 1 + r(1) – r(2)  0

w(e13) + r(3) – r(1) = 1 + r(3) – r(1)  1

w(e14) + r(4) – r(1) = 2 + r(4) – r(1)  1

w(e32) + r(2) – r(3) = 0 + r(2) – r(3)  1

w(e42) + r(2) – r(4) = 0 + r(2) – r(4)  1

(C)2002-2004 Yu Hen Hu


Solution continues

Since the retimed graph G algorithms. (Not the parallel processing in most general sense)r remain the same if all node retiming values are added by the same constant. We thus can set r(1) = 0.

The inequalities become

1 – r(2)  0 or r(2)  1

1 + r(3)  1 or r(3)  0

2 + r(4)  1 or r(4) –1

r(2) – r(3)  1 or r(3) r(2) - 1

r(2) – r(4)  1 or r(2)  r(4) + 1

Since

one must have r(2) = +1.

This implies r(3) 0. But we also have r(3)  0. Hence r(3)=0.

These leave –1  r(4)  0.

Hence the two sets of solutions are:

r(3) = 0, r(2) = +1, and r(4) = 0 or -1.

Solution continues

(C)2002-2004 Yu Hen Hu


Systematic solutions

Given a systems of inequalities: algorithms. (Not the parallel processing in most general sense)

r(i) – r(j)  k; 1  i,j  N

Construct a constraint graph:

Map each r(i) to node i. Add a node N+1.

For each inequality

r(i) – r(j)  k,

draw an edge eji

such that w(eji) = k.

Draw N edges eN+1,i = 0.

The system of inequalities has a solution if and only if the constraint graph contains no negative cycles

If a solution exists, one solution is where ri is the minimum length path from the node N+1 to the node i.

Shortest path algorithms: Bellman-Ford algorithm

Floyd-Warshall algorithm

Systematic Solutions

(C)2002-2004 Yu Hen Hu


Overview3
Overview algorithms. (Not the parallel processing in most general sense)

  • Algorithm Representations and Iteration Bound

  • Parallelism and Pipelining

  • Retiming

  • Unfolding

  • Folding

(C)2002-2004 Yu Hen Hu


Definitions1

Unfolding is the process of unfolding a loop so that several iterations are unrolled into the same iteration.

Also known as

Loop unrolling (in compilers for parallel programs)

Block processing

Applications

Reducing sampling period to achieve iteration bound (desired throughput rate) T.

Parallel (block processing) to execute several iterations concurrently.

Digit-serial or bit-serial processing

Definitions

(C)2002-2004 Yu Hen Hu


An example

Block processing formulation iterations are unrolled into the same iteration.

J = 3, 9/J = 3 (an integer)

X(k) = [x(3k) x(3k+1) x(3k+2)]T

Y(k) = [y(3k) y(3k+1) y(3k+2)]T

Y(k) = a*Y(k- 3 ) + X(k)

J = 2, 9/J = ? (not an integer)

X(k) = [x(2k) x(2k+1)]T

Y(k) = [y(2k) y(2k+1)]T

Y(k) = a*Y(k- ? ) + X(k)

Before unfolding:

For n = 0 to N-1,

y(n)=a*y(n-9)+x(n)

end

Unfolding once (J = 2)

For k = 0 to N/2-1,

y(2k)=a*y(2k-9)+x(2k)

y(2k+1)=a*y(2k-8)+x(2k+1)

end

Unfolding twice (J = 3)

For k = 0 to N/3-1,

y(3k)=a*y(3k-9)+x(3k)

y(3k+1)=a*y(3k-8)+x(3k+1)

y(3k+2)=a*y(3k-7)+x(3k+2)

end

An example

(C)2002-2004 Yu Hen Hu


Unfolding the dfg

Rewrite the algorithm formulation: iterations are unrolled into the same iteration.

y(2k)=a*y(2k-9)+x(2k)

y(2k+1)=a*y(2k-8)+x(2k+1)

y(2k)=a*y(2(k-5)+1)+x(2k)

y(2k+1)=a*y(2(k-4))+x(2k+1)

After J-folded unfolding, the clock period T = J Ts, where Ts is the data sampling period.

Unfolding the DFG

T=Ts

T=J Ts

(C)2002-2004 Yu Hen Hu


General dfg unfolding method

Define iterations are unrolled into the same iteration.

Step 1. For each node U in original DFG, draw J nodes {Ui; 0 iJ-1} in the unfolded DFG

Step 2. For each edge from U to V with w delays, draw J edges from Ui to V(i+w)%J with (i+w)/J delays

General DFG Unfolding Method

(C)2002-2004 Yu Hen Hu


Another dfg unfolding example
Another DFG Unfolding Example iterations are unrolled into the same iteration.

J=2

S0

Q0

T0

S

R0

Q

T

3D

2D

S1

R

Q1

T1

T=3

R1

Step 1. Duplicate J copies of each node

(C)2002-2004 Yu Hen Hu


Another dfg unfolding example1
Another DFG Unfolding Example iterations are unrolled into the same iteration.

J=2

S0

Q0

T0

S

R0

Q

T

3D

2D

S1

R

Q1

T1

T=3

R1

Step 2. Add all edges with 0 delay on them.

(C)2002-2004 Yu Hen Hu


Another dfg unfolding example2
Another DFG Unfolding Example iterations are unrolled into the same iteration.

J=2

S0

Q0

T0

S

D

R0

Q

T

2D

D

3D

2D

S1

R

Q1

T1

T=3

D

R1

Step 3. Use table on the left to figure out edges with delays.

T=6

(C)2002-2004 Yu Hen Hu


Properties of unfolding

Unfolding preserves the number of registers (delays) in a DFG

For a loop with w delays in a DFG that has been unfolded J times, it leads to

g.c.d.(w, J) loops in the unfolded DFG, with each of these loops containing

w/(g.c.d.(w,J)) delays and

J/(g.c.d.(w,J)) copies of each node that appear in the original loop.

Unfolding a DFG with iteration bound T results in a J-folded DFG with iteration bound JT.

A path with w (< J) delays in a DFG will lead to J-w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG.

Any path in the original DFT containing J or more delays leads to J paths 2ith 1 or more delay in each path. Therefore, it can not create a critical path in the J-unfolded DFT

Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding.

Properties of Unfolding

(C)2002-2004 Yu Hen Hu


ad