Problems with Superscalar approach. Limits to conventional exploitation of ILP: 1) Pipelined clock rate : Increasing clock rate requires deeper pipelines with longer pipeline latency which increases the CPI increase (longer branch penalty , other hazards).
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
1) Pipelined clock rate: Increasing clock rate requires deeper pipelines with longer pipeline latency which increases the CPI increase (longer branch penalty , other hazards).
2) Instruction Issue Rate: Limited instruction level parallelism (ILP) reduces actual instruction issue/completion rate. (vertical & horizontal waste)
3) Cache hit rate: Dataintensive scientific programs have very large data sets accessed with poor locality; others have continuous data streams (multimedia) and hence poor locality. (poor memory latency hiding).
4) Data Parallelism: Poor exploitation of data parallelism present in many scientific and multimedia applications, where similar independent computations are performed on large arrays of data (Limited ISA, hardware support).
Papers: VEC1, VEC2, VEC3
AMD Athlon TBird 1GHZ
L1: 64K INST, 64K DATA (3 cycle latency),
both 2way
L2: 256K 16way 64 bit
Latency: 7 cycles
L1,L2 onchip
Data working set
larger than L2
Intel P 4, 1.5 GHZ
L1: 8K INST, 8K DATA (2 cycle latency)
both 4way
96KB Execution Trace Cache
L2: 256K 8way 256 bit , Latency: 7 cycles
L1,L2 onchip
Intel PIII 1 GHZ
L1: 16K INST, 16K DATA (3 cycle latency)
both 4way
L2: 256K 8way 256 bit , Latency: 7 cycles
L1,L2 onchip
Impact of long memory
latency for large data working sets
Source: http://www1.anandtech.com/showdoc.html?i=1360&p=15
From 551
From 756 Lecture 1
From 756 Lecture 1
(1 operation)
VECTOR
(N operations)
v2
v1
r2
r1
+
+
r3
v3
vector
length
addv.d v3, v1, v2
Add.d F3, F1, F2
Alternative Model:Vector ProcessingScalar
ISA
(RISC
or CISC)
Vector
ISA
Up to
Maximum
Vector
Length
(MVL)
VEC1
Typical MVL = 64 (Cray)
Applications with high degree of data parallelism (looplevel parallelism),
thus suitable for vector processing. Not Limited to scientific computing
for (i=1; i<=1000; i=i+1;)
x[i] = x[i] + y[i];
Vector
Code
Scalar
Code
Load_vector V1, Rx
Load_vector V2, Ry
Add_vector V3, V1, V2
Store_vector V3, Rx
From 551
e.g. in for (i=1; i<=1000; i++)
x[i] = x[i] + s;
the computation in each iteration is independent of the
previous iterations and the loop is thus parallel. The use
of X[i] twice is within a single iteration.
From 551
for (i=1; i<=100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1];} /* S2 */
}
From 551
for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
A[1] = A[1] + B[1];
for (i=1; ii<=99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];
Scalar code
Vectorizable
code
Scalar code
From 551
B[2] = C[1] + D[1];
A[99] = A[99] + B[99];
B[100] = C[99] + D[99];
A[2] = A[2] + B[2];
B[3] = C[2] + D[2];
A[100] = A[100] + B[100];
B[101] = C[100] + D[100];
LLP Analysis Examplefor (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
Original Loop:
Iteration 99
Iteration 100
Iteration 1
Iteration 2
. . . . . .
. . . . . .
Loopcarried
Dependence
Modified Parallel Loop:
Iteration 98
Iteration 99
. . . .
Iteration 1
Loop Startup code
A[1] = A[1] + B[1];
B[2] = C[1] + D[1];
A[99] = A[99] + B[99];
B[100] = C[99] + D[99];
A[2] = A[2] + B[2];
B[3] = C[2] + D[2];
A[100] = A[100] + B[100];
B[101] = C[100] + D[100];
Not Loop
Carried
Dependence
Loop Completion code
From 551
Vector
Vector
the high bandwidth needed and hide memory latency.=> Amortize memory latency of over many vector elements=> no (data) caches usually used. (Do use instruction cache)
=> Reduces branches and branch problems in pipelines
all vector operations are memory to memory
MultiBanked
memory
for bandwidth
and latencyhiding
Pipelined
Vector Functional Units
Vector LoadStore
Units (LSUs)
MVL elements
Vector Control Registers
VLR
Vector Length Register
VM
Vector Mask Register
VEC1
VEC1
1) Hide vector startup time
2) Lower instruction bandwidth
3) Tiled access to memory reduce scalar processor memory bandwidth needs
4) If known maximum length of app. is < MVL, no strip mining (vector loop) overhead is needed.
5) Better spatial locality for memory access
1) Diminishing returns on overhead savings as keep doubling number of elements.
2) Need natural application vector length to match physical vector register length, or no help
VEC1
(a) has a single add pipeline and can complete one addition
per cycle. The machine shown in (b) has four add pipelines
and can complete four additions per cycle.
One Lane
Four Lanes
MVL lanes? Data parallel system, SIMD array?
VEC1
Vector FP
Vector
Memory
Vector Index
Vector Mask
Vector Length
VEC1
LV (Load Vector), SV (Store Vector):
LV V1, R1 Load vector register V1 from memory starting at address R1
SV R1, V1 Store vector register V1 into memory starting at address R1.
LVWS (Load Vector With Stride), SVWS (Store Vector With Stride):
LVWS V1,(R1,R2) Load V1 from address at R1 with stride in R2, i.e., R1+i × R2.
SVWS (R1,R2),V1 Store V1 from address at R1 with stride in R2, i.e., R1+i × R2.
LVI (Load Vector Indexed or Gather), SVI (Store Vector Indexed or Scatter):
LVI V1,(R1+V2) Load V1 with vector whose elements are at R1+V2(i), i.e., V2 is an index.
SVI (R1+V2),V1 Store V1 to vector whose elements are at R1+V2(i), i.e., V2 is an index.
(i size of element)
VEC1
Assuming vectors X, Y are length 64 =MVL
Scalar vs. Vector
L.D F0,a ;load scalar a
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vectorscalar mult.
LV V3,Ry ;load vector Y
ADDV.D V4,V2,V3 ;add
SV Ry,V4 ;store the result
L.D F0,a
DADDIU R4,Rx,#512 ;last address to load
loop: L.D F2, 0(Rx) ;load X(i)
MUL.D F2,F0,F2 ;a*X(i)
L.D F4, 0(Ry) ;load Y(i)
ADD.D F4,F2, F4 ;a*X(i) + Y(i)
S.D F4,0(Ry) ;store into Y(i)
DADDIU Rx,Rx,#8 ;increment index to X
DADDIU Ry,Ry,#8 ;increment index to Y
DSUBU R20,R4,Rx ;compute bound
BNEZ R20,loop ;check if done
VLR = 64
VM = (1,1,1,1 ..1)
578 (2+9*64) vs. 321 (1+5*64) ops (1.8X)
578 (2+9*64) vs. 6 instructions (96X)
64 operation vectors +
no loop overhead
also 64X fewer pipeline
hazards
1: LV of a A single Vector Add InstructionV1,Rx ;load vector X
2: MULV V2,F0,V1 ;vectorscalar mult.
LV V3,Ry ;load vector Y
3: ADDV V4,V2,V3 ;add
4: SV Ry,V4 ;store the result
Vector Execution TimeAssuming one lane is used
4 conveys, 1 lane, VL=64
=> 4 x 64 = 256 cycles
(or 4 cycles per result
vector element)
VEC1
Vector FU Startup Time of a A single Vector Add Instruction
Assume convoys don't overlap; vector length = n:
Convoy Start 1st result last result
1. LV 0 12 11+n (12+n1)
2. MULV, LV 12+n 12+n+12 23+2n Load startup
3. ADDV 24+2n 24+2n+6 29+3n Wait convoy 2
4. SV 30+3n 30+3n+12 41+4n Wait convoy 3
Startup cycles
VEC1
1) support multiple loads/stores per cycle => multiple banks & address banks independently
2) support nonsequential accesses (non unit stride)
clock: 0 … l l+1 l+2 … l+m 1 l+m … 2 l
word:  … 0 1 2 … m1  … m
VEC1
do 10 i = 1, n
10 Y(i) = a * X(i) + Y(i)
Vector length = n
low = 1 VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0,(n / MVL) /*outer loop*/
do 10 i = low,low+VL1 /*runs for length VL*/ Y(i) = a*X(i) + Y(i) /*main operation*/10 continue low = low+VL /*start of next vector*/ VL = MVL /*reset the length to max*/1 continue
Time for loop:
Vector loop iterations needed
0
1st iteration
n MOD MVL elements
(odd size piece)
For First Iteration (shorter vector)
Set VL = n MOD MVL
0 < size < MVL
VL 1
For MVL = 64 VL = 1  63
2nd iteration
MVL elements
For second Iteration onwards
Set VL = MVL
MVL
(e.g. VL = MVL = 64)
3rd iteration
MVL elements
ì n/MVLù vector loop
iterations needed
MVL
::
::
Answer
VLR = n MOD 64 = 200 MOD 64 = 8
For first iteration only
VLR = MVL= 64
for second iteration
onwards
MTC1 VLR,R1 Move contents of R1 to the vectorlength register.
4 vector loop iterations
4 iterations
Tloop = loop overhead = 15 cycles
VEC1
The total execution time per element and the total overhead time per
element versus the vector length for the strip mining example
MVL supported = 64
do 10 i = 1,100
do 10 j = 1,100
A(i,j) = 0.0
do 10 k = 1,100
10 A(i,j) = A(i,j)+B(i,k)*C(k,j)
LVWS V1,(R1,R2) Load V1 from address at R1 with stride in R2, i.e., R1+i × R2.
=> SVWS (store vector with stride) instruction
SVWS (R1,R2),V1 Store V1 from address at R1 with stride in R2, i.e., R1+i × R2.
Answer
MULV.D V1,V2,V3
ADDV.D V4,V1,V5 ; separate convoys?
Vector length + Startup timeADDV + Startup timeMULV
MULV.D V1,V2,V3
ADDV.D V4,V1,V5
both unchained and chained.
m convoys with n elements take:
startup + m x n cycles
Here startup = 7 + 6 = 13 cycles
n = 64
= 7 + 64 + 6 + 64
= startup + m x n
= 13 + 2 x 64
Two Convoys
m =2
One Convoy
m =1
= 7 + 6 + 64
= startup + m x n
= 13 + 1 x 64
141 / 77 = 1.83 times faster with chaining
VEC1
do 100 i = 1, 64
if (A(i) .ne. 0) then
A(i) = A(i) – B(i)
endif
100 continue
in the vectormask register are 1.
Compare the elements (EQ, NE, GT, LT, GE, LE) in V1 and V2. If condition is true, put a 1 in the corresponding bit vector; otherwise put 0.
Put resulting bit vector in vector mask register (VM).
The instruction SVS.D performs the same compare but using a scalar value as one operand.
SV.D V1, V2
SVS.D V1, F0
LV, SV Load/Store vector with stride 1
do 100 i = 1,n
100 A(K(i)) = A(K(i)) + C(M(i))
LVI V1,(R1+V2) Load V1 with vector whose elements are at R1+V2(i), i.e., V2 is an index.
SVI (R1+V2),V1 Store V1 to vector whose elements are at R1+V2(i), i.e., V2 is an index.
Assuming that Ra, Rc, Rk, and Rm contain the starting addresses of the
vectors in the previous sequence, the inner loop of the sequence
can be coded with vector instructions such as:
(index vector)
(index vector)
LVI V1, (R1+V2) (Gather) Load V1 with vector whose elements are at R1+V2(i),
i.e., V2 is an index.
SVI (R1+V2), V1 (Scatter) Store V1 to vector whose elements are at R1+V2(i),
i.e., V2 is an index.
Index vectors Vk Vm already initialized
CVI V1,R1 Create an index vector by storing the values 0, 1 × R1, 2 × R1,...,63 × R1 into V1.
V2 Index Vector
VM Vector Mask
VEC1
/* Multiply a[m][k] * b[k][n] to get c[m][n] */
for (i=1; i<m; i++)
{
for (j=1; j<n; j++)
{
sum = 0;
for (t=1; t<k; t++)
{
sum += a[i][t] * b[t][j];
}
c[i][j] = sum;
}
}
C mxn = A mxk X B kxn
Dot product
/* Multiply a[m][k] * b[k][n] to get c[m][n] */
for (i=1; i<m; i++)
{
for (j=1; j<n; j++)
{
sum = 0;
for (t=1; t<k; t++)
{
sum += a[i][t] * b[t][j];
}
c[i][j] = sum;
}
}
Inner loop t = 1 to k (vector dot product loop)
(for a given i, j produces one element C(i, j)
k
n
n
i
j
t
m
X
=
t
C(i, j)
k
n
A(m, k)
B(k, n)
C(m, n)
Second loop j = 1 to n
Outer loop i = 1 to m
For one iteration of outer loop (on i) and second loop (on j)
inner loop (t = 1 to k) produces one element of C, C(i, j)
Inner loop (one element of C, C(i, j) produced)
Vectorize inner t loop?
Instead on most inner loop t
/* Multiply a[m][k] * b[k][n] to get c[m][n] */
for (i=1; i<m; i++)
{
for (j=1; j<n; j+=32)/* Step j 32 at a time. */
{
sum[0:31] = 0; /* Initialize a vector register to zeros. */
for (t=1; t<k; t++)
{
a_scalar = a[i][t]; /* Get scalar from a matrix. */
b_vector[0:31] = b[t][j:j+31]; /* Get vector from b matrix. */
prod[0:31] = b_vector[0:31]*a_scalar;
/* Do a vectorscalar multiply. */
/* Vectorvector add into results. */
sum[0:31] += prod[0:31];
}
/* Unitstride store of vector of results. */
c[i][j:j+31] = sum[0:31];
}
}
Vectorize j
loop
32 = MVL elements done
Inner loop t = 1 to k (vector dot product loop for MVL =32 elements)
(for a given i, j produces a 32element vector C(i, j : j+31)
k
n
n
j to j+31
i
j
j to j+31
t
m
i
X
=
t
C(i, j : j+31)
k
n
A(m, k)
B(k, n)
C(m, n)
32 =MVL
element vector
Second loop j = 1 to n/32
(vectorized in steps of 32)
Outer loop i = 1 to m
Not vectorized
For one iteration of outer loop (on i) and vectorized second loop (on j)
inner loop (t = 1 to k) produces 32 elements of C, C(i, j : j+31)
Assume MVL =32
and n multiple of 32 (no odd size vector)
Inner loop (32 element vector of C produced)
faster than scalar mode
For A given benchmark or program:
64x2
One LSU thus needs 3 convoys Tchime = m= 3
Larger version of Linpack 1000x1000
VEC1
VEC1
Here of a A single Vector Add Instruction
3 LSUs
m=1
m = 1 convoy
Not 3
Speedup = 1.7 (going from m=3 to m=1)
Not 3
+ of a A single Vector Add Instruction
Vector for Multimedia? short vector: load, add, store 8 8bit operands
MVL = 8
for byte elements
KernelVector length
MVL?
(from Pradeep Dubey  IBM,
http://www.research.ibm.com/people/p/pradeep/tutor.html)
+ strip mining
L of a A single Vector Add Instruction
o
g
i
c
f
a
b
Proc
$
$
L2$
I/O
I/O
Bus
Bus
D
R
A
M
I/O
I/O
Proc
f
a
b
Bus
D
R
A
M
D
R
A
M
Vector Processing & VLSI:Intelligent RAM (IRAM)Effort towards a fullvector
processor on a chip:
How to meet vector processing high memory
bandwidth and low latency requirements?
Microprocessor & DRAM on a single chip:
VEC2, VEC3
Cray 1
133 MFLOPS
Peak
Maximum Vector Length (mvl) = # elts per register
VP0
VP1
VPvl1
vr0
vr1
Data
Registers
vpw
vr31
To handle variablewidth data (8,16,32,64bit):
VEC2
Standard scalar instruction set (e.g., ARM, MIPS)
Scalar
+
–
x
÷
&

shl
shr
.vv
.vs
.sv
8
16
32
64
s.int
u.int
s.fp
d.fp
saturate
overflow
Vector
ALU
masked
unmasked
8
16
32
64
8
16
32
64
unit
constant
indexed
Vector
Memory
load
store
s.int
u.int
masked
unmasked
Vector
Registers
32 x 32 x 64b (or 32 x 64 x 32b or 32 x 128 x 16b) + 32 x128 x 1b flag
Plus: flag, convert, DSP, and transfer operations
VIRAM1 (2000) of a A single Vector Add Instruction
256 Mbit generation (0.20)
Die size = 1.5X 256 Mb die
1.5  2.0 v logic, 210 watts
100  500 MHz
4 64bit pipes/lanes
14 GFLOPS(64b)/616G (16b)
30  50 GB/sec Mem. BW
32 MB capacity + DRAM bus
Several fast serial I/O
Goal for Vector IRAM Generations2 arithmetic units of a A single Vector Add Instruction
both execute integer operations
one executes FP operations
4 64bit datapaths (lanes) per unit
2 flag processing units
for conditional execution and speculation support
1 loadstore unit
optimized for strides 1,2,3, and 4
4 addresses/cycle for indexed and strided operations
decoupled indexed and strided stores
VIRAM1 MicroarchitectureC of a A single Vector Add Instruction
P
U+$
4 Vector Pipes/Lanes
Tentative VIRAM1 Floorplanbanks x 256b, 128 subbanks
Memory(128 Mbits / 16 MBytes)
Ring
based
Switch
I/O
Memory(128 Mbits / 16 MBytes)
8 x 64 of a A single Vector Add Instruction
or
16 x 32
or
32 x 16
+
2way Superscalar
x
Vector
Instruction
÷
Processor
Queue
I/O
Load/Store
I/O
Vector Registers
8K I cache
8K D cache
8 x 64
8 x 64
Memory Crossbar Switch
M
M
M
M
M
M
M
M
M
M
…
M
M
M
M
M
M
M
M
M
M
8 x 64
8 x 64
8 x 64
8 x 64
8 x 64
I/O
…
…
…
…
…
…
…
…
…
…
I/O
M
M
M
M
M
M
M
M
M
M
VIRAM2: 0.13 µm, 1GHz 16 GFLOPS(64b)/64 GOPS(16b)/128MBMemory of a A single Vector Add Instruction(512 Mbits / 64 MBytes)
Cross
bar
Switch
C
P
U
I
O
8 Vector Pipes (+ 1 spare)
Memory(512 Mbits / 64 MBytes)
VIRAM2 FloorplanVectorizer of a A single Vector Add Instruction
Code Generators
Frontends
C
PDGCS
C90
C++
IRAM
Fortran
VIRAM Compiler