Design of a parallel vector access unit for sdram memory systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 23

Design of a Parallel Vector Access Unit for SDRAM Memory Systems PowerPoint PPT Presentation


  • 43 Views
  • Uploaded on
  • Presentation posted in: General

Design of a Parallel Vector Access Unit for SDRAM Memory Systems. Impulse Group Department of Computer Science University of Utah Presented by Binu K. Mathew. Motivation. Current microprocessors are very powerful, but ... Irregular applications still perform poorly Vectorizable loop

Download Presentation

Design of a Parallel Vector Access Unit for SDRAM Memory Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Design of a parallel vector access unit for sdram memory systems

Design of a Parallel Vector Access Unit for SDRAM Memory Systems

Impulse Group

Department of Computer Science

University of Utah

Presented by Binu K. Mathew


Motivation

Motivation

  • Current microprocessors are very powerful, but ...

  • Irregular applications still perform poorly

  • Vectorizable loop

    • e.g. : for(i = 0; i < L * S; i += S) y[i] += a[i] * x[i];

    • Poor cache utilization

    • Cache pollution

    • Poor bus utilization

    • Access pattern may be predictable

  • Memory system enhancements for vectors

  • Vectors are back again


A vector memory controller

A Vector Memory Controller

  • Handle both strided and normal accesses

    • Fast scatter/gather

    • Efficient cache-line fills

  • New fast Parallel Vector Access Algorithm

  • Scheduling heuristics

  • Prototype implementation


The serial vector access problem

The Serial Vector Access Problem

Vector = < Base Address, Stride, length >

0

1

2

3

4

5

6

7


The serial vector access problem1

The Serial Vector Access Problem

V = < 1024, 1, 16 >, cache-line fill

1024

1025

1026

1027

...

0

1

2

3

4

5

6

7

Cache-line Interleaved Serial Memory System


The serial vector access problem2

The Serial Vector Access Problem

V = < 1024, 32, 16 >, strided access

1024

1056

1088

1120

...

0

1

2

3

4

5

6

7

Cache-line Interleaved Serial Memory System


Parallel vector access

Parallel Vector Access

  • Serial vector access : Low throughput

    • Exploit bank parallelism

    • Exploit internal parallelism

  • History of Parallel Vector Access

    • CVMS : Corbal, Espasa, Valero

      • Two to 15 cycle algorithm

      • Interconnect and crossbar

    • Module stride : Steven Moyer

  • Our PVA Algorithm

    • Two to three cycles, Merge on Bus

    • Scalable

    • Word and Block interleave


Memory organization

Memory Organization


The pva problem

The PVA Problem

V = < 1024, 1, 16 >, cache-line fill

1024

1025

1026

1027

1028

1029

1030

1031

1032

1033

1034

1035

1036

1037

1038

1039

0

1

2

3

4

5

6

7

Bank Access Sequence :

0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7


The pva problem stride 2

The PVA Problem : Stride 2

V = < 1024, 2, 16 >

1024

1026

1028

1030

1032

1034

1036

1038

1040

1042

1044

1046

1048

1050

1052

1054

0

1

2

3

4

5

6

7

Bank Access Sequence :

0, 2, 4, 6, 0, 2, 4, 6, ...


The pva problem stride 3

0

1

2

3

4

5

6

7

The PVA Problem : Stride 3

V = < 1024, 3, 16 >

1024

1027

1030

1033

1036

1039

1042

1045

1048

1051

1054

1057

1060

1063

1066

1069

Bank Access Sequence :

0, 3, 6, 1, 4, 7, 2, 5, 0, ...


Our pva solution

Our PVA Solution

  • Functions

    • FirstHit(V,b) : Compute first vector element of V that hits b

      Operations: Table lookup, multiply or shift and add

    • NextHit(V.S) : Compute incremental index of next element

      Operations : Trivial PLA

  • Bank Controller Algorithm

    • Compute i = FirstHit(V, b)

    • If there is no hit, continue

    • Till the end of the vector is reached do :

      • Schedule access memory location V.B + i * V.S

      • i = i + NextHit(V.S)

  • Scheduling heuristics

    • Early row open

    • Reordering and interleaving requests


Our pva solution stride 2

Our PVA Solution : Stride 2

V = < 1024, 2, 16 >

0

1

2

3

4

5

6

7


Our pva solution stride 21

Hit, 0

No Hit

Hit, 1

No Hit

Hit, 2

No Hit

Hit, 3

No Hit

δ=4

δ=4

δ=4

δ=4

1024

1026

1028

1030

1032

1034

1036

1038

1040

1042

1044

1046

1048

1050

1052

1054

Our PVA Solution : Stride 2

0

1

2

3

4

5

6

7


Our pva solution1

Our PVA Solution

  • Functions

    • FirstHit(V,b) : Compute first vector element of V that hits b

      Operations: Table lookup, multiply or shift and add

    • NextHit(V.S) : Compute incremental index of next element

      Operations : Trivial PLA

  • Bank Controller Algorithm

    • Compute i = FirstHit(V, b)

    • If there is no hit, continue

    • Till the end of the vector is reached do :

      • Schedule access memory location V.B + i * V.S

      • i = i + NextHit(V.S)

  • Scheduling heuristics

    • Early row open

    • Reordering and interleaving requests


Hardware prototype

Hardware Prototype

  • Verilog model : Approximately 3600 lines of code

  • Timing estimates with FPGA, gate level simulation

  • Hardware cost per bank controller

    • Approximately 11000 gates

    • 2K bytes on-chip RAM

  • Target

    • CPU : R10000

    • L2 cache-line size : 128 bytes

    • System bus : 64 bits wide, split transaction

    • Four outstanding memory requests

    • Memory: 256 Mbit Micron SDRAM at 100 MHz

    • 32 bits x 16 banks

    • Word interleaved


Pva implementation

Register

File

Access

Scheduler

Vector

Contexts

SDRAM

Interface

PVA Implementation

Request

FIFO

FirstHit

Predict

FirstHit

Calculate

Vector Bus

Staging Unit

SDRAM Bus


Performance evaluation

Performance Evaluation

  • Kernels

  • 240 data points

  • Compared PVA with :

    • Cache-line interleaved SDRAM

      64 bits wide

      Burst length of 16

    • Scatter/Gather serial SDRAM

      32 bits wide, 16 banks

      Overlapped RAS and precharge

    • Parallel Vector Access SRAM

      32 bits wide, 16 banks

      Single cycle latency, pipelined SRAM


Results cache line fills

Results : Cache-line fills


Results strided access

Results : Strided Access

10.86

32.69


Results sram comparison

Results : SRAM Comparison


Future work

Future Work

  • Full program simulation

  • Integration with virtual memory

  • Parallel access techniques for other patterns

    • Indirect vectors

    • FFT

  • Impulse ASIC


Summary

Summary

  • New and improved PVA Algorithm

    • Up to five times improvement over older method

    • Speedups in the range 1.0 to 32.8

    • Technique for block interleaving

    • Scalable

  • Hardware prototype designed

  • Moderate hardware complexity


  • Login