improving memory system performance for soft vector processors n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Improving Memory System Performance for Soft Vector Processors PowerPoint Presentation
Download Presentation
Improving Memory System Performance for Soft Vector Processors

Loading in 2 Seconds...

play fullscreen
1 / 26

Improving Memory System Performance for Soft Vector Processors - PowerPoint PPT Presentation


  • 106 Views
  • Uploaded on

Improving Memory System Performance for Soft Vector Processors. Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008. Soft Processors in FPGA Systems. Data-level parallelism → soft vector processors. Soft Processor. Custom Logic. C + Compiler. HDL + CAD.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Improving Memory System Performance for Soft Vector Processors' - devin-freeman


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
improving memory system performance for soft vector processors

Improving Memory System Performance for Soft Vector Processors

Peter Yiannacouras

J. Gregory Steffan

Jonathan Rose

WoSPS – Oct 26, 2008

soft processors in fpga systems
Soft Processors in FPGA Systems

Data-level parallelism → soft vector processors

Soft

Processor

Custom

Logic

C

+

Compiler

HDL

+

CAD

 Easier

 Faster

 Smaller

 Less Power

 Configurable – how can we make use of this?

vector processing primer
Vector Processing Primer

vadd

// C code

for(i=0;i<16; i++)

b[i]+=a[i]

// Vectorized code

set vl,16

vload vr0,b

vload vr1,a

vadd vr0,vr0,vr1

vstore vr0,b

b[15]+=a[15]

b[14]+=a[14]

b[13]+=a[13]

b[12]+=a[12]

b[11]+=a[11]

b[10]+=a[10]

b[9]+=a[9]

b[8]+=a[8]

b[7]+=a[7]

b[6]+=a[6]

b[5]+=a[5]

b[4]+=a[4]

Each vector instruction

holds many units of

independent operations

b[3]+=a[3]

b[2]+=a[2]

b[1]+=a[1]

b[0]+=a[0]

1 Vector Lane

vector processing primer1
Vector Processing Primer

16x speedup

vadd

// C code

for(i=0;i<16; i++)

b[i]+=a[i]

// Vectorized code

set vl,16

vload vr0,b

vload vr1,a

vadd vr0,vr0,vr1

vstore vr0,b

b[15]+=a[15]

16 Vector Lanes

b[14]+=a[14]

b[13]+=a[13]

b[12]+=a[12]

b[11]+=a[11]

b[10]+=a[10]

b[9]+=a[9]

b[8]+=a[8]

b[7]+=a[7]

b[6]+=a[6]

b[5]+=a[5]

b[4]+=a[4]

Each vector instruction

holds many units of

independent operations

b[3]+=a[3]

b[2]+=a[2]

b[1]+=a[1]

b[0]+=a[0]

sub linear scalability
Sub-Linear Scalability

Vector lanes not being fully utilized

where are the cycles spent
Where Are The Cycles Spent?

2/3 cycles spent waiting on memory unit, often from cache misses

16 lanes

67%

our goals
Our Goals
  • Improve memory system
    • Better cache design
    • Hardware prefetching
  • Evaluate improvements for real:
    • Using a complete hardware design (in Verilog)
    • On real FPGA hardware (Stratix 1S80C6)
    • Running full benchmarks (EEMBC)
    • From off-chip memory (DDR-133MHz)
current infrastructure
Current Infrastructure

VC

RF

VC

WB

Logic

VS

RF

VS

WB

Mem

Unit

Decode

Repli-

cate

Hazard

check

VR

RF

VR

RF

VR

WB

VR

WB

Satu-

rate

Satu-

rate

M

U

X

M

U

X

A

L

U

A

L

U

x & satur.

Rshift

x & satur.

Rshift

SOFTWARE

HARDWARE

Verilog

EEMBC C

Benchmarks

GCC

ld

scalar

μP

ELF

Binary

+

Vectorized

assembly

subroutines

GNU as

+

vpu

Vector

support

MINT

Instruction

Set

Simulator

Modelsim

(RTL

Simulator)

Altera

Quartus II

v 8.0

area,

frequency

cycles

verification

verification

vespa architecture design
VESPA Architecture Design

Icache

Dcache

M

U

X

WB

Decode

RF

A

L

U

Scalar

Pipeline

3-stage

Shared Dcache

VC

RF

VC

WB

Supports integer

and fixed-point

operations, and

predication

Vector

Control

Pipeline

3-stage

Logic

Decode

VS

RF

VS

WB

Mem

Unit

Decode

Repli-

cate

Hazard

check

VR

RF

Vector

Pipeline

6-stage

VR

RF

VR

WB

M

U

X

VR

WB

M

U

X

A

L

U

A

L

U

Satu-

rate

Satu-

rate

32-bit

datapaths

x & satur.

Rshift

x & satur.

Rshift

10

memory system design
Memory System Design

vld.w (load 16 contiguous 32-bit words)

VESPA

16 lanes

Scalar

Vector

Coproc

Lane

0

Lane

0

Lane

0

Lane

4

Lane

0

Lane

0

Lane

0

Lane

8

Lane

0

Lane

0

Lane

0

Lane

12

Lane

4

Lane

4

Lane

15

Lane

16

Vector Memory Crossbar

Dcache

4KB,

16B line

DDR

9 cycle access

DDR

memory system design1
Memory System Design

Reduced

cache accesses

+

some prefetching

vld.w (load 16 contiguous 32-bit words)

VESPA

16 lanes

Scalar

Vector

Coproc

Lane

0

Lane

0

Lane

0

Lane

4

Lane

0

Lane

0

Lane

0

Lane

8

Lane

0

Lane

0

Lane

0

Lane

12

Lane

4

Lane

4

Lane

15

Lane

16

Vector Memory Crossbar

4x

Dcache

16KB,

64B line

4x

DDR

9 cycle access

DDR

improving cache design
Improving Cache Design
  • Vary the cache depth & cache line size
    • Using parameterized design
    • Cache line size: 16, 32, 64, 128 bytes
    • Cache depth: 4, 8, 16, 32, 64 KB
  • Measure performance on 9 benchmarks
    • 6 from EEMBC, all executed in hardware
  • Measure area cost
    • Equate silicon area of all resources used
      • Report in units of Equivalent LEs
cache design space performance wall clock time
Cache Design Space – Performance (Wall Clock Time)

Best cache design almost doubles performance of original VESPA

More pipelining/retiming could reduce clock frequency penalty

Cache line more important than cache depth (lots of streaming)

122MHz

123MHz

126MHz

129MHz

cache design space area
Cache Design Space – Area

16bits

16bits

16bits

16bits

16bits

16bits

16bits

4096

bits

4096

bits

4096

bits

4096

bits

4096

bits

4096

bits

4096

bits

System area almost doubled in worst case

64B (512 bits)

M4K

32 => 16KB of storage

MRAM

cache design space area1
Cache Design Space – Area

b) Don’t use MRAMs: big, few, and overkill

a) Choose depth to fill block RAMs needed for line size

M4K

MRAM

hardware prefetching example
Hardware Prefetching Example

No Prefetching

Prefetching 3 blocks

vld.w

vld.w

vld.w

vld.w

MISS

MISS

MISS

HIT

Dcache

Dcache

9 cycle

penalty

9 cycle

penalty

DDR

DDR

hardware data prefetching
Hardware Data Prefetching

We measure performance/area using a 64B, 16KB dcache

  • Advantages
    • Little area overhead
    • Parallelize memory fetching with computation
    • Use full memory bandwidth
  • Disadvantages
    • Cache pollution
  • We use Sequential Prefetching triggered on:
    • a) any miss, or
    • b) sequential vector instruction miss
prefetching k blocks any miss
Prefetching K Blocks – Any Miss

Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%

Peak averagespeedup 28%

Not receptive

2.2x

prefetching area cost writeback buffer
Prefetching Area Cost: Writeback Buffer

Prefetching 3 blocks

  • Two options:
    • Deny prefetch
    • Buffer all dirty lines
  • Area cost is small
    • 1.6% of system area
    • Mostly block RAMs
    • Little logic
  • No clock frequency impact

vld.w

WB

Buffer

MISS

dirty

lines

Dcache

9 cycle

penalty

DDR

any miss vs sequential vector miss
Any Miss vs Sequential Vector Miss

Collinear – nearly all misses in our benchmarks are sequential vector

vector length prefetching
Vector Length Prefetching
  • Previously: constant# cache lines prefetched
  • Now: Use multiple of vector length
    • Only for sequential vector memory instructions
    • Eg. Vector load of 32 elements
  • Guarantees <= 1 miss per vector memory instr

0

31

vld.w

fetch +

prefetch 28*k

vector length prefetching performance
Vector Length Prefetching - Performance

1*VL prefetching provides good speedup without tuning, 8*VL best

Peak 29%

Not receptive

21%

2.2x

no cache

pollution

overall memory system performance
Overall Memory System Performance

Wider line + prefetching reduces memory unit stall cycles significantly

Wider line + prefetching eliminates all but4% of miss cycles

67%

48%

31%

4%

(4KB)

(16KB)

15

improved scalability
Improved Scalability

Previous: 3-8x range, average of 5x for 16 lanes

Now: 6-13x range, average of 10xfor 16 lanes

summary
Summary
  • Explored cache design
    • ~2x performance for ~2x system area
      • Area growth due largely to memory crossbar
    • Widened cache line size to 64B and depth to 16KB
  • Enhanced VESPA w/ hardware data prefetching
    • Up to 2.2x performance, average of 28% for K=15
    • Vector length prefetcher gains 21% on average for 1*VL
      • Good for mixed workloads, no tuning, no cache pollution
      • Peak at 8*VL, average of 29% speedup
  • Overall improved VESPA memory system & scalability
    • Decreased miss cycles to 4%,
    • Decreased memory unit stall cycles to 31%
vector memory unit
Vector Memory Unit

Memory

Request

Queue

base

rddata0

rddata1

stride*0

rddataL

M

U

X

stride*1

M

U

X

+

...

+

stride*L

M

U

X

+

index0

index1

indexL

wrdata0

...

Memory

Lanes=4

wrdata1

wrdataL

Dcache

Read

Crossbar

Write

Crossbar

L = # Lanes - 1

Memory

Write

Queue