Parallel processing

Pipelined processing

Basic Ideastime

time

P1

P2

P3

P4

P1

P2

P3

P4

a1

a2

a3

a4

a1

b1

c1

d1

b1

b2

b3

b4

a2

b2

c2

d2

c1

c2

c3

c4

a3

b3

c3

d3

d1

d2

d3

d4

a4

b4

c4

d4

Less inter-processor communication

Complicated processor hardware

More inter-processor communication

Simpler processor hardware

Colors: different types of operations performed

a, b, c, d: different data streams processed

(C) 1997-2006 by Yu Hen Hu

Parallel processing requires NO data dependence between processors

Pipelined processing will involve inter-processor communication

Data DependenceP1

P2

P3

P4

P1

P2

P3

P4

time

time

(C) 1997-2006 by Yu Hen Hu

By inserting latches or registers between combinational logic circuits, the critical path can be shortened.

Consequence:

reduce clock cycle time,

increase clock frequency.

Suitable for DSP applications that have (infinity) long data stream.

Method to incorporate pipelining: Cut-set retiming

Cut set:

A cut set is a set of edges of a graph. If these edges are removed from the original graph, the remaining graph will become two separate graphs.

Retiming:

The timing of an algorithm is re-adjusted while keeping the partial ordering of execution unchanged so that the results correct

Usage of Pipelined Processing(C) 1997-2006 by Yu Hen Hu

z-1

z-1

h[0]

h[1]

y[n]

h[2]

?

=

Graphic Transpose Theorem- The transfer function of a signal flow graph remain unchanged if
- The directions of each arc is reversed
- The input and output labels are switched.

u[n]

y[n]

z-1

z-1

h[2]

h[0]

h[1]

x[n]

(C) 1997-2006 by Yu Hen Hu

Algorithm transform may lead to pipelined structure without adding additional delays.

Given a FIR filter SFG

Critical path TM+2TA

Use graph transposition theorem:

Reverse all arcs

Reverse input/output

We obtain

Critical path TM+ TA

No additional delay added!

Data broadcast structure(C) 1997-2006 by Yu Hen Hu

Fine-grain pipelining

To further reduce TM.

Critical Path = Max {TM1, TM2, TA}

(C) 1997-2006 by Yu Hen Hu

One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense)

Block vector: [x(3k) x(3k+1) x(3k+2)]

Clock cycle: can be 3 times longer

Original (FIR filter):

Rewrite 3 equations at a time:

Define block vector

Block formulation:

Block Processing(C) 1997-2006 by Yu Hen Hu

Block Processing

(C) 1997-2006 by Yu Hen Hu

General approach for block processing

(C) 1997-2006 by Yu Hen Hu

Original formulation:

Rewrite

Define block vectors

Then

Time indices

n: sampling period

k: clock period (processor)

k = 2n

Note:

Pipelining: clock period = sampling period.

Block (parallel): clock period not equal to sampling period.

Block Processing for IIR Digital Filter(C) 1997-2006 by Yu Hen Hu

Block IIR Filter

y(2(k-1))

D

x(2k)

y(2k)

+

x(n)

S/P

P/S

y(n)

y(2k+1)

+

x(2k+1)

y(2(k-1)+1)

D

(C) 1997-2006 by Yu Hen Hu

Timing Comparison

x(1)

x(2)

x(3)

x(4)

MAC

1

2

3

4

y(1)

y(2)

y(3)

y(4)

- Pipelining
- Block processing

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

x(7)

Add

1

2

3

4

5

6

7

8

y(1)

y(2)

y(3)

y(4)

y(5)

y(6)

y(7)

y(7)

a y(1)

Mul

1

2

3

4

5

6

7

8

x(2)

x(4)

x(6)

x(8)

2

2

4

4

6

6

8

8

x(1)

x(3)

x(5)

x(7)

1

1

3

3

5

5

7

7

(C) 1997-2006 by Yu Hen Hu

