Lecture 10 memory dependence detection and speculation
Download
1 / 18

- PowerPoint PPT Presentation


  • 174 Views
  • Uploaded on

Lecture 10: Memory Dependence Detection and Speculation. Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha 21264 Example. Store: SW Rt , A( Rs ) Calculate effective memory address  dependent on Rs

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - damita


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Lecture 10 memory dependence detection and speculation l.jpg

Lecture 10: Memory Dependence Detection and Speculation

Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha 21264 Example


Register and memory dependences l.jpg

Store: SW Rt, A(Rs)

Calculate effective memory address  dependent on Rs

Write to D-Cache  dependent on Rt, and cannot be speculative

Compare “ADD Rd, Rs, Rt”

What is the difference?

LW Rt, A(Rs)

Calculate effective memory address  dependent on Rs

Read D-Cache  could be memory-dependent on pending writes!

When is the memory dependence known?

Register and Memory Dependences


Memory correctness and performance l.jpg
Memory Correctness and Performance

Correctness conditions:

  • Only committed store instructions can write to memory

  • Any load instruction receives its memory operand from its parent (a store instruction)

  • At the end of execution, any memory word receives the value of the last write

    Performance: Exploit memory level parallelism


Load store buffer in tomasulo l.jpg

Original Tomasulo: Load/store address are pre-calculated before scheduling

Loads are not dependent on other instructions

Stores are dependent on instructions producing the store data

Provide dynamic memory disambiguation: check the memory dependence between stores and loads

Load/store Buffer in Tomasulo

IM

Fetch Unit

Reorder

Buffer

Decode

Rename

Regfile

S-buf

L-buf

RS

RS

DM

FU1

FU2


Dynamic scheduling with integer instructions l.jpg

Centralized design example: before scheduling

Centralized reservation stations usually include the load buffer

Integer units are shared by load/store and ALU instructions

What is the challenge in detecting memory dependence?

Dynamic Scheduling with Integer Instructions

IM

Fetch Unit

Reorder

Buffer

Decode

Rename

Regfile

Centralized RS

I-Fu

I-FU

FU

FU

addr

data

S-buf

addr

data

D-Cache


Load store with dynamic execution l.jpg
Load/Store with Dynamic Execution before scheduling

  • Only committed store instructions can write to memory

     Use store buffer as a temporary place for write instruction output

  • Any memory word receives the value of the last write

     Store instructions write to memory in program order

  • Any load instruction receives its memory operand from its parent (a store instruction)

  • Memory level parallelism be exploited

     Non-speculative solution: load bypassing and load forwarding

     Speculative solution: speculative load execution


Store buffer design example l.jpg

Store instruction: before scheduling

Wait in RS until the base address and data are ready

Calculate address, move to store buffer

Move data directly to store buffer

Wait for commit

If no exception/mis-predict

Wait for memory port

Write to D-cache

Otherwise flushed before writing D-cache

Store Buffer Design Example

RS

I-FU

From RS

C

Ry

addr

data

young

0

0

0

1

Arch.

states

1

-

1

-

old

To D-Cache


Memory dependence l.jpg
Memory Dependence before scheduling

Any load instruction receives the memory operand from its parent (a store instruction)

  • If any previous store has not written the D-cache, what to do?

  • If any previous store has not finished, what to do?

    Simple Design: Delay all following loads; but how about performance?


Memory level parallelism l.jpg

Significant improvement from sequential reads/writes before scheduling

for (i=0;i<100;i++)

A[i] = A[i]*2;

Loop: L.S F2, 0(R1)

MULT F2, F2, F4

SW F2, 0(R1)

ADD R1, R1, 4

BNE R1, R3,Loop

Memory-level Parallelism

Read

Read

Read

Write

Write

Write


Load bypassing and load forwarding l.jpg
Load Bypassing and Load Forwarding before scheduling

Non-speculative solution

  • Dynamic Disambiguation: Match the load address with all store addresses

  • Load bypassing: start cache read if no match is found

  • Load forwarding: using store buffer value if a match is found

  • In-order execution limitation: must wait until all previous store have finished

RS

Store

unit

I-FU

I-FU

match

D-cache


In order execution limitation l.jpg

Example 1: When is the SW result available, and when can the next load start?

Possible solution: start store address calculation early  more complex design

Example2: When is the address “a->b->c” available?

In-order Execution Limitation

Example 1:

for (i=0;i<100;i++)

A[i] = A[i]/2;

Loop: L.S F2, 0(R1)

DIV F2, F2, F4

SW F2, 0(R1)

ADD R1, R1, 4

BNE R1, R3,Loop

Example 2:

a->b->c = 100;

d = x;


Speculative load execution l.jpg
Speculative Load Execution next load start?

If no dependence predicted

  • Send loads out even if dependence is unknown

  • Do address matchingat store commits

    • Match found: memory dependence violation, flush pipeline;

    • Otherwise: continue

      Note: may still need load forwarding (not shown)

RS

I-FU

I-FU

match

store-q

load-q

D-cache


Alpha 21264 pipeline l.jpg
Alpha 21264 Pipeline next load start?


Alpha 21264 load store queues l.jpg
Alpha 21264 Load/Store Queues next load start?

Int issue queue

fp issue queue

AddrALU

IntALU

IntALU

AddrALU

FPALU

FPALU

Int RF(80)

Int RF(80)

FP RF(72)

D-TLB

L-Q

S-Q

AF

Dual D-Cache

32-entry load queue, 32-entry store queue


Load bypassing forwarding and raw detection l.jpg
Load Bypassing, Forwarding, and RAW Detection next load start?

commit

match at commit

Load/store?

ROB

head

Load: WAIT if LQ head not completed, then move LQ head

Store: mark SQ head as completed, then move SQ head

store-q

load-q

load addr

store addr

committed

If match: forward

D-cache

If match: mark store-load trapto flush pipeline (at commit)


Speculative memory disambiguation l.jpg
Speculative Memory Disambiguation next load start?

PC

1024 1-bitentry table

Renamed inst

1

int issue queue

To help predict memory dependence:

  • Whenever a load causes a violation, set stWait bit in the table

  • When the load is fetched, get its stWait from the table, send to issue queue with the load instruction

  • A load waits there if its swWait is set and any previous store exists

  • The tale is cleared periodically


Architectural memory states l.jpg

Memory request: search the hierarchy from top to bottom next load start?

Architectural Memory States

LQ

SQ

Committed states

Completed entries

L1-Cache

L2-Cache

L3-Cache (optional)

Memory

Disk, Tape, etc.


Summary of superscalar execution l.jpg
Summary of Superscalar Execution next load start?

  • Instruction flow techniques

    Branch prediction, branch target prediction, and instruction prefetch

  • Register data flow techniques

    Register renaming, instruction scheduling, in-order commit, mis-prediction recovery

  • Memory data flow techniques

    Load/store units, memory consistency

    Source: Shen & Lipasti reference book


ad