Problems with Superscalar approach. Limits to conventional exploitation of ILP: 1) pipelined clock rate : at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards)
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
1) pipelined clock rate: at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards)
2) instruction fetch and decode: at some point, its hard to fetch and decode more instructions per clock cycle
3) cache hit rate: some longrunning (scientific) programs have very large data sets accessed with poor locality; others have continuous data streams (multimedia) and hence poor locality
(1 operation)
VECTOR
(N operations)
v2
v1
r2
r1
+
+
r3
v3
vector
length
add.vv v3, v1, v2
add r3, r1, r2
Alternative Model:Vector ProcessingVector
Vector
Spec92fp Operations (Millions) Instructions (M)
Program RISC Vector R / V RISC Vector R / V
swim256 115 95 1.1x 115 0.8 142x
hydro2d 58 40 1.4x 58 0.8 71x
nasa7 69 41 1.7x 69 2.2 31x
su2cor 51 35 1.4x 51 1.8 29x
tomcatv 15 10 1.4x 15 1.3 11x
wave5 27 25 1.1x 27 7.2 4x
mdljdp2 32 52 0.6x 32 15.8 2x
Operation & Instruction Count: RISC v. Vector ProcessorVector reduces ops by 1.2X, instructions by 20X
1) Hide vector startup
2) lower instruction bandwidth
3) tiled access to memory reduce scalar processor memory bandwidth needs
4) if know max length of app. is < max vector length, no strip mining overhead
5) Better spatial locality for memory access
1) diminishing returns on overhead savings as keep doubling number of element
2) need natural app. vector length to match physical register length, or no help
1) Reduces vector register “spills” (save/restore)
2) agressive scheduling of vector instructinons: better compiling to take advantage of ILP
Fewer bits in instruciton format (usually 3 fields)
Instr. Operands Operation Comment
Assuming vectors X, Y are length 64
Scalar vs. Vector
LD F0,a ;load scalar a
LV V1,Rx ;load vector X
MULTS V2,F0,V1 ;vectorscalar mult.
LV V3,Ry ;load vector Y
ADDV V4,V2,V3 ;add
SV Ry,V4 ;store the result
LD F0,a
ADDI R4,Rx,#512 ;last address to load
loop: LD F2, 0(Rx) ;load X(i)
MULTD F2,F0,F2 ;a*X(i)
LD F4, 0(Ry) ;load Y(i)
ADDD F4,F2, F4 ;a*X(i) + Y(i)
SD F4,0(Ry) ;store into Y(i)
ADDI Rx,Rx,#8 ;increment index to X
ADDI Ry,Ry,#8 ;increment index to Y
SUB R20,R4,Rx ;compute bound
BNZ R20,loop ;check if done
578 (2+9*64) vs. 321 (1+5*64) ops (1.8X)
578 (2+9*64) vs. 6 instructions (96X)
64 operation vectors + no loop overhead
also 64X fewer pipeline hazards
Machine Year Clock 100x100 1kx1k Peak(Procs)
VP0
VP1
VP$vlr1
General
Purpose
Registers
vr0
Control
Registers
vr1
vr31
vcr0
vcr1
$vdw bits
vf0
Flag
Registers
(32)
vf1
vcr31
32 bits
vf31
1 bit
Vector Architectural State1: LV V1,Rx ;load vector X
2: MULV V2,F0,V1 ;vectorscalar mult.
LV V3,Ry ;load vector Y
3: ADDV V4,V2,V3 ;add
4: SV Ry,V4 ;store the result
Vector Execution Time4 conveys, 1 lane, VL=64
=> 4 x 64 256 clocks
(or 4 clocks per result)
Assume convoys don't overlap; vector length = n:
Convoy Start 1st result last result
1. LV 0 12 11+n (12+n1)
2. MULV, LV 12+n 12+n+12 23+2n Load startup
3. ADDV 24+2n 24+2n+6 29+3n Wait convoy 2
4. SV 30+3n 30+3n+12 41+4n Wait convoy 3
1) support multiple loads/stores per cycle => multiple banks & address banks independently
2) support nonsequential accesses (see soon)
clock: 0 … l l+1 l+2 … l+m 1 l+m … 2 l
word:  … 0 1 2 … m1  … m
do 10 i = 1, n
10 Y(i) = a * X(i) + Y(i)
low = 1 VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0,(n / MVL) /*outer loop*/
do 10 i = low,low+VL1 /*runs for length VL*/ Y(i) = a*X(i) + Y(i) /*main operation*/10 continue low = low+VL /*start of next vector*/ VL = MVL /*reset the length to max*/1 continue
faster than scalar mode
do 10 i = 1,100
do 10 j = 1,100
A(i,j) = 0.0
do 10 k = 1,100
10 A(i,j) = A(i,j)+B(i,k)*C(k,j)
MULV V1,V2,V3
ADDV V4,V1,V5 ; separate convoy?
Vector
Multiply Pipeline
Vector
Adder Pipeline
Vector
Memory Pipeline
Scalar
8 lanes, vector length 32,
chaining
do 100 i = 1, 64
if (A(i) .ne. 0) then
A(i) = A(i) – B(i)
endif
100 continue
do 100 i = 1,n
100 A(K(i)) = A(K(i)) + C(M(i))
IBM RS6000 Cray YMP
Clock 72 MHz 167 MHz
Cache 256 KB 0.25 KB
Linpack 140 MFLOPS 160 (1.1)
Sparse Matrix 17 MFLOPS 125 (7.3)(Cholesky Blocked )
/* Multiply a[m][k] * b[k][n] to get c[m][n] */
for (i=1; i<m; i++)
{
for (j=1; j<n; j++)
{
sum = 0;
for (t=1; t<k; t++)
{
sum += a[i][t] * b[t][j];
}
c[i][j] = sum;
}
}
/* Multiply a[m][k] * b[k][n] to get c[m][n] */
for (i=1; i<m; i++)
{
for (j=1; j<n; j+=32)/* Step j 32 at a time. */
{
sum[0:31] = 0; /* Initialize a vector register to zeros. */
for (t=1; t<k; t++)
{
a_scalar = a[i][t]; /* Get scalar from a matrix. */
b_vector[0:31] = b[t][j:j+31]; /* Get vector from b matrix. */
prod[0:31] = b_vector[0:31]*a_scalar;
/* Do a vectorscalar multiply. */
/* Vectorvector add into results. */
sum[0:31] += prod[0:31];
}
/* Unitstride store of vector of results. */
c[i][j:j+31] = sum[0:31];
}
}
Not Limited to scientific computing
KernelVector length
(from Pradeep Dubey  IBM,
http://www.research.ibm.com/people/p/pradeep/tutor.html)
R
A
M
f
a
b
D
R
A
M
D
R
A
M
Intelligent RAM (IRAM)L
o
g
i
c
f
a
b
Proc
Microprocessor & DRAM on a single chip:
$
$
L2$
I/O
I/O
Bus
Bus
I/O
I/O
Proc
Bus
Maximum Vector Length (mvl) = # elts per register
VP0
VP1
VPvl1
vr0
vr1
Data
Registers
vpw
vr31
To handle variablewidth data (8,16,32,64bit):
Standard scalar instruction set (e.g., ARM, MIPS)
Scalar
+
–
x
÷
&

shl
shr
.vv
.vs
.sv
8
16
32
64
s.int
u.int
s.fp
d.fp
saturate
overflow
Vector
ALU
masked
unmasked
8
16
32
64
8
16
32
64
unit
constant
indexed
Vector
Memory
load
store
s.int
u.int
masked
unmasked
Vector
Registers
32 x 32 x 64b (or 32 x 64 x 32b or 32 x 128 x 16b) + 32 x128 x 1b flag
Plus: flag, convert, DSP, and transfer operations
256 Mbit generation (0.20)
Die size = 1.5X 256 Mb die
1.5  2.0 v logic, 210 watts
100  500 MHz
4 64bit pipes/lanes
14 GFLOPS(64b)/616G (16b)
30  50 GB/sec Mem. BW
32 MB capacity + DRAM bus
Several fast serial I/O
Goal for Vector IRAM Generationsboth execute integer operations
one executes FP operations
4 64bit datapaths (lanes) per unit
2 flag processing units
for conditional execution and speculation support
1 loadstore unit
optimized for strides 1,2,3, and 4
4 addresses/cycle for indexed and strided operations
decoupled indexed and strided stores
VIRAM1 MicroarchitectureP
U+$
4 Vector Pipes/Lanes
Tentative VIRAM1 Floorplanbanks x 256b, 128 subbanks
Memory(128 Mbits / 16 MBytes)
Ring
based
Switch
I/O
Memory(128 Mbits / 16 MBytes)
or
16 x 32
or
32 x 16
+
2way Superscalar
x
Vector
Instruction
÷
Processor
Queue
I/O
Load/Store
I/O
Vector Registers
8K I cache
8K D cache
8 x 64
8 x 64
Memory Crossbar Switch
M
M
M
M
M
M
M
M
M
M
…
M
M
M
M
M
M
M
M
M
M
8 x 64
8 x 64
8 x 64
8 x 64
8 x 64
I/O
…
…
…
…
…
…
…
…
…
…
I/O
M
M
M
M
M
M
M
M
M
M
VIRAM2: 0.13 µm, Fast Logic, 1GHz 16 GFLOPS(64b)/64 GOPS(16b)/128MBMemory(512 Mbits / 64 MBytes)
Cross
bar
Switch
C
P
U
I
O
8 Vector Pipes (+ 1 spare)
Memory(512 Mbits / 64 MBytes)
VIRAM2 FloorplanCode Generators
Frontends
C
PDGCS
C90
C++
IRAM
Fortran
IRAM Compiler Status