Computer architecture the p6 micro architecture an example of an out of order micro processor
Download
1 / 53

Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor - PowerPoint PPT Presentation


  • 181 Views
  • Uploaded on
  • Presentation posted in: General

Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor. Lihu Rappoport and Adi Yoaz. The P6 Family. Features Out Of Order execution Register renaming Speculative Execution Multiple Branch prediction Super pipeline: 12 pipe stages. *off die.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Computer ArchitectureThe P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor

Lihu Rappoport and Adi Yoaz


The P6 Family

  • Features

    • Out Of Order execution

    • Register renaming

    • Speculative Execution

    • Multiple Branch prediction

    • Super pipeline: 12 pipe stages

*off die


External

Bus

L2

MOB

BIU

DCU

MIU

IFU

AGU

R

IEU

BPU

S

I

FEU

D

MS

ROB

RAT

P6 Arch

  • In-Order Front End

    • BIU: Bus Interface Unit

    • IFU: Instruction Fetch Unit (includes IC)

    • BPU: Branch Prediction Unit

    • ID: Instruction Decoder

    • MS: Micro-Instruction Sequencer

    • RAT: Register Alias Table

  • Out-of-order Core

    • ROB: Reorder Buffer

    • RRF: Real Register File

    • RS: Reservation Stations

    • IEU: Integer Execution Unit

    • FEU: Floating-point Execution Unit

    • AGU: Address Generation Unit

    • MIU: Memory Interface Unit

    • DCU: Data Cache Unit

    • MOB: Memory Order Buffer

    • L2: Level 2 cache

  • In-Order Retire


Next

IP

Reg

Ren

RS

Wr

Icache

Decode

I1

I2

I3

I4

I5

I6

I7

I8

P6 Pipeline

RS

disp

In-Order Front End

Ex

1:Next IP

2: ICache lookup

3: ILD (instruction length decode)

4: rotate

5: ID1

6: ID2

7: RAT- rename sources,

ALLOC-assign destinations

8: ROB-read sources

RS-schedule data-ready uops for dispatch

9: RS-dispatch uops

10:EX

11:Retirement

Out-of-order

Core

O1

O3

Retirement

R1

R2

In-order

Retirement


In-Order Front End

Bytes

Instructions

uops

BPU

Next IP

Mux

MS

  • BPU – Branch Prediction Unit – predict next fetch address

  • IFU – Instruction Fetch Unit

    • iTLB translates virtual to physical address (access PMH on miss)

    • IC supplies 16byte/cyc (access L2 cache on miss)

  • ILD – Induction Length Decode – split bytes to instructions

  • IQ – Instruction Queue – buffer the instructions

  • ID – Instruction Decode – decode instructions into uops

  • MS – Micro-Sequencer – provides uops for complex instructions

ILD

ID

IFU

IQ

IDQ


Branch Prediction

  • Implementation

    • Use local history to predict direction

    • Need to predict multiple branches

    • Need to predict branches before previous branches are resolved

    • Branch history updated first based on prediction, later based on actual execution (speculative history)

    • Target address taken from BTB

  • Prediction rate: ~92%

    • ~60 instructions between mispredictions

    • High prediction rate is very crucial for long pipelines

    • Especially important for OOOE, speculative execution:

      • On misprediction all instructions following the branch in the instruction window are flushed

      • Effective size of the window is determined by prediction accuracy

  • RSB used for Call/Return pairs


Jump out

of the line

Jump into

the fetch line

jmp

jmp

jmp

jmp

Predict

taken

Predict

not taken

Predict

taken

Predict

taken

Branch Prediction – Clustering

  • Need to provide predictions for the entire fetch line each cycle

    • Predict the first taken branch in the line, following the fetch IP

  • Implemented by

    • Splitting IP into offset within line, set, and tag

    • If the tag of more than one way matches the fetch IP

      • The offsets of the matching ways are ordered

      • Ways with offset smaller than the fetch IP offset are discarded

      • The first branch that is predicted taken is chosen as the predicted branch


Prediction bit

1001

0

counters

Tag

Target

ofst

Hist

V

T

P

IP

LRR

9

Pred=

msb of

counter

4

1

2

1

4

2

9

32

32

128

sets

15

Branch Type

00- cond

01- ret

10- call

11- uncond

Return

Stack

Buffer

Per-Set

Way 1

Way 3

Way 0

Way 2

The P6 BTB

  • 2-level, local histories, per-set counters

  • 4-way set associative: 512 entries in 128 sets

  • Up to 4 branches can have a tag match


D0

D1

D2

D3

4 uops

1 uop

1 uop

1 uop

In-Order Front End: Decoder

16 Instruction

bytes from IFU

Determine where each

IA instruction starts

Instruction Length

Decode

Buffer Instructions

IQ

  • Convert instructions Into uops

  • If inst aligned with dec1/2/3 decodes into >1 uops, defer it to next cycle

  • Buffers uops

  • Smooth decoder’s variable throughput

IDQ


Micro Operations (Uops)

  • Each CISC inst is broken into one or more RISC uops

    • Simplicity:

      • Each uop is (relatively) simple

      • Canonical representation of src/dest (2 src, 1 dest)

    • Increased ILP

      • e.g., pop eaxbecomes esp1<-esp0+4, eax1<-[esp0]

  • Simple instructions translate to a few uops

    • Typical uop count (it is not necessarily cycle count!)Reg-Reg ALU/Mov inst:1 uopMem-Reg Mov (load)1 uopMem-Reg ALU(load + op)2 uopsReg-Mem Mov (store)2 uops (st addr, st data)Reg-Mem ALU(ld + op + st)4 uops

  • Complex instructions need ucode


External

Bus

L2

MOB

BIU

DCU

MIU

IFU

AGU

R

IEU

BTB

S

I

FEU

MIS

D

ROB

RAT

Out-of-order Core

  • Reorder Buffer (ROB):

    • Holds “not yet retired” instructions

    • 40 entries, in-order

  • Reservation Stations (RS):

    • Holds “not yet executed” instructions

    • 20 entries

  • Execution Units

    • IEU: Integer Execution Unit

    • FEU: Floating-point Execution Unit

  • Memory related units

    • AGU: Address Generation Unit MIU: Memory Interface Unit

    • DCU: Data Cache Unit

    • MOB: Orders Memory operations

    • L2: Level 2 cache


Alloc & Rat

  • Perform register allocation and renaming for ≤4 uops/cyc

  • The Register Alias Table (RAT)

    • Maps architectural registers into physical registers

      • For each arch reg, holds the number of latest phyreg that updates it

    • When a new uop that writes to a arch regR is allocated

      • Record phyreg allocated to the uop as the latest reg that updates R

  • The Allocator (Alloc)

    • Assigns each uop an entry number in the ROB / RS

    • For each one of the sources (architectural registers) of the uop

      • Lookup the RAT to find out the latest phyreg updating it

      • Write it up in the RS entry

    • Allocate Load & Store buffers in the MOB


Re-order Buffer (ROB)

  • Hold 40 uops which are not yet committed

    • At the same order as in the program

  • Provide a large physical register space for register renaming

    • One physical register per each ROB entry

      • physical register number = entry number

      • Each uop has only one destination

      • Buffer the execution results until retirement

    • Valid data is set after uop executed and result written to physical reg


RRF – Real Register File

  • Holds the Architectural Register File

    • Architectural Register are numbered: 0 – EAX, 1 – EBX, …

  • The value of an architectural register

    • is the value written to it by the last instruction committed which writes to this register

RRF:


Uop flow through the ROB

  • Uops are entered in order

    • Registers renamed by the entry number

  • Once assigned: execution order unimportant

  • After execution:

    • Entries marked “executed” and wait for retirement

    • Executed entry can be “retired” once all prior instruction have retired

    • Commit architectural state only after speculation (branch, exception) has resolved

  • Retirement

    • Detect exceptions and mispredictions

      • Initiate repair to get machine back on right track

    • Update “real registers” with value of renamed registers

    • Update memory

    • Leave the ROB


Reservation station (RS)

  • Pool of all “not yet executed” uops

    • Holds the uop attributes and the uop source data until it is dispatched

  • When a uop is allocated in RS, operand values are updated

    • If operand is from an architectural register, value is taken from the RRF

    • If operand is from a phy reg, with data valid set, value taken from ROB

    • If operand is from a phy reg, with data valid not set, wait for value

  • The RS maintains operands status “ready/not-ready”

    • Each cycle, executed uops make more operands “ready”

      • The RS arbitrate the WB busses between the units

      • The RS monitors the WB bus to capture data needed by awaiting uops

      • Data can be bypassed directly from WB bus to execution unit

    • Uops whose all operands are ready can be dispatched for execution

      • Dispatcher chooses which of the ready uops to execute next

      • Dispatches chosen uops to functional units


Register Renaming example

IDQ

Add EAX, EBX, EAX

RAT / Alloc

Add ROB37, ROB19, RRF0

ROB

RS

RRF:


Register Renaming example (2)

IDQ

sub EAX, ECX, EAX

RAT / Alloc

sub ROB38, ROB23, ROB37

ROB

RS

RRF:


Out-of-order Core: Execution Units

internal 0-dealy bypass within each EU

1st bypass in MIU

2nd bypass in RS

SHF

FMU

RS

MIU

FDIV

IDIV

FAU

IEU

Port 0

JEU

IEU

Port 1

Port 2

AGU

Load Address

Port 3,4

Store Address

AGU

SDB

DCU


External

Bus

L2

MOB

BIU

DCU

MIU

IFU

AGU

R

IEU

BTB

S

I

FEU

MIS

D

ROB

RAT

In-Order Retire

  • ROB:

    • Retires up to 3 uops per clock

    • Copies the values to the RRF

    • Retirement is done In-order

    • Performs exception checking


In-order Retirement

  • The process of committing the results to the architectural state of the processor

  • Retires up to 3 uops per clock

  • Copies the values to the RRF

  • Retirement is done In Order

  • Performs exception checking

  • An instruction is retired after the following checks

    • Instruction has executed

    • All previous instructions have retired

    • Instruction isn’t mis-predicted

    • no exceptions


Predict/Fetch

Decode

Alloc

Schedule

EX

Retire

IQ

IDQ

RS

ROB

Pipeline: Fetch

  • Fetch 16B from I$

  • Length-decode instructions within 16B

  • Write instructions into IQ


Pipeline: Decode

Predict/Fetch

Decode

Alloc

Schedule

EX

Retire

  • Read 4 instructions from IQ

  • Translate instructions into uops

    • Asymmetric decoders (4-1-1-1)

  • Write resulting uops into IDQ

IQ

IDQ

RS

ROB


Pipeline: Allocate

Predict/Fetch

Decode

Alloc

Schedule

EX

Retire

  • Allocate, port bind and rename 4 uops

  • Allocate ROB/RS entry per uop

    • If source data is available from ROB or RRF, write data to RS

    • Otherwise, mark data not ready in RS

IQ

IDQ

RS

ROB


Predict/Fetch

Decode

Alloc

Schedule

EX

Retire

IQ

IDQ

RS

ROB

Pipeline: EXE

  • Ready/Schedule

    • Check for data-ready uops if needed functional unit available

    • Select and dispatch ≤6 ready uops/clock to EXE

    • Reclaim RS entries

  • Write back results into RS/ROB

    • Write results into result buffer

    • Snoop write-back ports for results that are sources to uops in RS

    • Update data-ready status of these uops in the RS


Pipeline: Retire

Predict/Fetch

Decode

Alloc

Schedule

EX

Retire

  • Retire ≤4 oldest uops in ROB

    • Uop may retire if

      • its ready bit is set

      • it does not cause an exception

      • all preceding candidates are eligible for retirement

    • Commit results from result buffer to RRF

    • Reclaim ROB entry

  • In case of exception

    • Nuke and restart

IQ

IDQ

RS

ROB


Jump Misprediction – Flush at Retire

  • Flush the pipe when the mispredicted jump retires

    • retirement is in-order

    • all the instructions preceding the jump have already retired

    • all the instructions remaining in the pipe follow the jump

    • flush all the instructions in the pipe

  • Disadvantage

    • Misprediction known after the jump was executed

    • Continue fetching instructions from wrong path until the branch retires


Jump Misprediction – Flush at Execute

  • When the JEU detects jump misprediction it

    • Flush the in-order front-end

    • Instructions already in the OOO part continue to execute

      • Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed

    • Start fetching and decoding from the “correct” path

      • The “correct” path still be wrong

        • A preceding uop that hasn’t executed may cause an exception

        • A preceding jump executed OOO can also mispredict

    • The “correct” instruction stream is stalled at the RAT

      • The RAT was wrongly updated also by wrong path instruction

  • When the mispredicted branch retires

    • Resets all state in the Out-of-Order Engine (RAT, RS, RB, MOB, etc.)

      • Only instruction following the jump are left – they must all be flushed

      • Reset the RAT to point only to architectural registers

    • Un-stalls the in-order machine

    • RS gets uops from RAT and starts scheduling and dispatching them


Pipeline: Branch gets to EXE

Alloc

Schedule

JEU

Fetch

Decode

Retire

IQ

IDQ

RS

ROB


Flush

Pipeline: Mispredicted Branch EXE

  • Flush front-end and re-steer it to correct path

  • RAT state already updated by wrong path

    • Block further allocation

  • Block younger branches from clearing

  • Update BPU

Alloc

Schedule

JEU

Fetch

Decode

Retire

IQ

IDQ

RS

ROB


Pipeline: Mispredicted Branch Retires

Clear

  • When mispredicted branch retires

    • Flush OOO

    • Allow allocation of uops from correct path

Alloc

Schedule

JEU

Fetch

Decode

Retire

IQ

IDQ

RS

ROB


Instant Reclamation

  • Allow a faster recovery after jump misprediction

    • Allow execution/allocation of uops from correct path before mispredicted jump retires

  • Every few cycles take a checkpoint of the RAT

  • In case of misprediction

    • Flush the frontend and re-steer it to the correct path

    • Recover RAT to latest checkpoint taken prior to misprediction

    • Recover RAT to exact state at misprediction

      • Rename 4 uops/cycle from checkpoint and until branch

    • Flush all uops younger than the branch in the OOO


Clear

Instant ReclamationMispredicted Branch EXE

  • JEClear raised on mispredicted macro-branches

Predict/Fetch

Decode

Alloc

Schedule

JEU

Retire

IQ

IDQ

RS

ROB


Instant ReclamationMispredicted Branch EXE

  • JEClear raised on mispredicted macro-branches

    • Flush frontend and re-steer it to the correct path

    • Flush all younger uops in OOO

    • Update BPU

    • Block further allocation

BPU Update

Clear

Predict/Fetch

Decode

Alloc

Schedule

JEU

Retire

IQ

IDQ

RS

ROB


Pipeline: Instant Reclamation: EXE

  • Restore RAT from latest check-point before branch

  • Recover RAT to its states just after the branch

    • Before any instruction on the wrong path

  • Meanwhile front-end starts fetching and decoding instructions from the correct path

Predict/Fetch

Decode

Alloc

Schedule

JEU

Retire

IQ

IDQ

RS

ROB


Pipeline: Instant Reclamation: EXE

  • Once done restoring the RAT

    • allow allocation of uops from correct path

Predict/Fetch

Decode

Alloc

Schedule

JEU

Retire

IQ

IDQ

RS

ROB


Large ROB and RS are Important

  • Large RS

    • Increases the window in which looking for impendent instructions

      • Exposes more parallelism potential

  • Large ROB

    • The ROB is a superset of the RS  ROB size ≥ RS size

    • Allows for of covering long latency operations (cache miss, divide)

  • Example

    • Assume there is a Load that misses the L1 cache

      • Data takes ~10 cycles to return  ~30 new instrs get into pipeline

    • Instructions following the Load cannot commit  Pile up in the ROB

    • Instructions independent of the load are executed, and leave the RS

      • As long as the ROB is not full, we can keep executing instructions

    • A 40 entry ROB can cover for an L1 cache miss

      • Cannot cover for an L2 cache miss, which is hundreds of cycles


OOO Execution of Memory Operations


P6 Caches

  • Blocking caches severely hurt OOO

    • A cache miss prevents from other cache requests (which could possibly be hits) to be served

    • Hurts one of the main gains from OOO – hiding caches misses

  • Both L1 and L2 cache in the P6 are non-blocking

    • Initiate the actions necessary to return data to cache miss while they respond to subsequent cached data requests

    • Support up to 4 outstanding misses

      • Misses translate into outstanding requests on the P6 bus

      • The bus can support up to 8 outstanding requests

  • Squash subsequent requests for the same missed cache line

    • Squashed requests not counted in number of outstanding requests

    • Once the engine has executed beyond the 4 outstanding requests

      • subsequent load requests are placed in the load buffer


Memory Operations

  • Loads

    • Loads need to specify

      • memory address to be accessed

      • width of data being retrieved

      • the destination register

    • Loads are encoded into a single uop

  • Stores

    • Stores need to provide

      • a memory address, a data width, and the data to be written

    • Stores require two uops:

      • One to generate the address

      • One to generate the data

    • Uops are scheduled independently to maximize concurrency

      • Can calculate the address before the data to be stored is known

      • must re-combine in the store buffer for the store to complete


The Memory Execution Unit

  • As a black box, the MEU is just another execution unit

    • Receives memory reads and writes from the RS

    • Returns data back to the RS and to the ROB

  • Unlike many execution units, the MEU has side effects

    • May cause a bus request to external memory

  • The RS dispatches memory uops

    • When all the data used for address calculation is ready

    • Both to the MOB and to the Address Generation Unit (AGU) are free

    • The AGU computes the linear address:

      Segment-Base + Base-Address + (Scale*Index) + Displacement

      • Sends linear address to MOB, where it is stored with the uop in the appropriate Load Buffer or Store Buffer entry


The Memory Execution Unit (cont.)

  • The RS operates based on register dependencies

    • Cannot detect memory dependencies, even as simple as:

      i1: movl -4(%ebp), %ebx # MEM[ebp-4] ← ebx

      i2: movl %eax, -4(%ebp) # eax ← MEM[ebp-4]

    • The RS may dispatch these operations in any order

  • MEU is in-charge of memory ordering

    • If MEU blindly executes above operations, eax may get wrong value

  • Could require memory operations dispatch non-speculatively

    • x86 has small register set  uses memory often

    • Preventing Stores from passing Stores/Loads: 3%~5% perf. loss

      • P6 chooses not allow Stores to pass Stores/Loads

    • Preventing Loads from passing Loads/Stores Big performance loss

      • Need to allow loads to pass stores, and loads to pass loads

  • MEU allows speculative execution as much as possible


OOO Execution of Memory Operations

  • Stores are not executed OOO

    • Stores are never performed speculatively

      • there is no transparent way to undo them

    • Stores are also never re-ordered among themselves

      • The Store Buffer dispatches a store only when

        • the store has both its address and its data, and

        • there are no older stores awaiting dispatch

  • Resolving memory dependencies

    • Some memory dependencies can be resolved statically

      • store r1,a

      • load r2,b

    • Problem: some cannot

      • store r1,[r3];

      • load r2,b

 can advance load before store

  •  load must wait till r3 is known


Memory Order Buffer (MOB)

  • Every memory uop is allocated an entry in order

    • De-allocated when retires

  • Address & data (for stores), are updated when known

  • Load is checked against all previous stores

    • Waits if store to same address exist, but data not ready

      • If store data exists, just use it: Store → Load forwarding

    • Waits till all previous store addresses are resolved

      • In case of no address collision – go to memory

  • Memory Disambiguation

    • MOB predicts if a load can proceed despite unknown STAs

      • Predict colliding  block Load if there is unknown STA (as usual)

      • Predict non colliding  execute even if there are unknown STAs

    • In case of wrong prediction

      • The entire pipeline is flushed when the load retires


Pipeline: Load: Allocate

Alloc

Schedule

Retire

  • Allocate ROB/RS, MOB entries

  • Assign Store ID to enable ordering

IDQ

RS

ROB

LB

AGU

LB Write

MOB

DTLB

DCU

WB


Pipeline: Bypassed Load: EXE

Alloc

Schedule

AGU

LB Write

Retire

  • RS checks when data used for address calculation is ready

  • AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp.

  • Write load into Load Buffer

  • DTLB Virtual → Physical + DCU set access

  • MOB checks blocking and forwarding

  • DCU read / Store Data Buffer read (Store → Load forwarding)

  • Write back data / write block code

IDQ

RS

ROB

MOB

DTLB

DCU

WB

LB


Pipeline: Blocked Load Re-dispatch

Alloc

Schedule

AGU

LB Write

Retire

  • MOB determines which loads are ready, and schedules one

  • Load arbitrates for MEU

  • DTLB Virtual → Physical + DCU set access

  • MOB checks blocking/forwarding

  • DCU way select / Store Data Buffer read

  • write back data / write block code

IDQ

RS

ROB

MOB

DTLB

DCU

WB

LB


Pipeline: Load: Retire

Alloc

Schedule

AGU

LB Write

Retire

  • Reclaim ROB, LB entries

  • Commit results to RRF

IDQ

RS

ROB

MOB

DTLB

DCU

WB

LB


DCU Miss

  • If load misses in the DCU, the DCU

    • Marks the write-back data as invalid

    • Assigns a fill buffer to the load

    • Starts MLC access

    • When critical chunk is returned

      • DCU writes it back on WB bus


SB

Pipeline: Store: Allocate

Alloc

Schedule

AGU

SB

Retire

  • Allocate ROB/RS

  • Allocate Store Buffer entry

IDQ

RS

ROB

DTLB


SB

Pipeline: Store: STA EXE

Alloc

Schedule

AGU

SB

V.A.

Retire

  • RS checks when data used for address calculation is ready

    • dispatches STA to AGU

  • AGU calculates linear address

  • Write linear address to Store Buffer

  • DTLB Virtual → Physical

  • Load Buffer Memory Disambiguation verification

  • Write physical address to Store Buffer

IDQ

RS

ROB

DTLB

SB

P.A.


SB

Pipeline: Store: STD EXE

Alloc

Schedule

SB data

Retire

  • RS checks when data for STD is ready

    • dispatches STD

  • Write data to Store Buffer

IDQ

RS

ROB


SB

Pipeline: Senior Store Retirement

Alloc

Schedule

MOB

SB

DCU

Retire

  • When STA (and thus STD) retires

    • Store Buffer entry marked as senior

  • When DCU idle  MOB dispatches senior store

  • Read senior entry

    • Store Buffer sends data and physical address

  • DCU writes data

  • Reclaim SB entry

IDQ

RS

ROB


ad
  • Login