Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor

Computer ArchitectureThe P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Lihu Rappoport and Adi Yoaz

The P6 Family • Features • Out Of Order execution • Register renaming • Speculative Execution • Multiple Branch prediction • Super pipeline: 12 pipe stages *off die

External Bus L2 MOB BIU DCU MIU IFU AGU R IEU BPU S I FEU D MS ROB RAT P6 Arch • In-Order Front End • BIU: Bus Interface Unit • IFU: Instruction Fetch Unit (includes IC) • BPU: Branch Prediction Unit • ID: Instruction Decoder • MS: Micro-Instruction Sequencer • RAT: Register Alias Table • Out-of-order Core • ROB: Reorder Buffer • RRF: Real Register File • RS: Reservation Stations • IEU: Integer Execution Unit • FEU: Floating-point Execution Unit • AGU: Address Generation Unit • MIU: Memory Interface Unit • DCU: Data Cache Unit • MOB: Memory Order Buffer • L2: Level 2 cache • In-Order Retire

Next IP Reg Ren RS Wr Icache Decode I1 I2 I3 I4 I5 I6 I7 I8 P6 Pipeline RS disp In-Order Front End Ex 1:Next IP 2: ICache lookup 3: ILD (instruction length decode) 4: rotate 5: ID1 6: ID2 7: RAT- rename sources, ALLOC-assign destinations 8: ROB-read sources RS-schedule data-ready uops for dispatch 9: RS-dispatch uops 10:EX 11:Retirement Out-of-order Core O1 O3 Retirement R1 R2 In-order Retirement

In-Order Front End Bytes Instructions uops BPU Next IP Mux MS • BPU – Branch Prediction Unit – predict next fetch address • IFU – Instruction Fetch Unit • iTLB translates virtual to physical address (access PMH on miss) • IC supplies 16byte/cyc (access L2 cache on miss) • ILD – Induction Length Decode – split bytes to instructions • IQ – Instruction Queue – buffer the instructions • ID – Instruction Decode – decode instructions into uops • MS – Micro-Sequencer – provides uops for complex instructions ILD ID IFU IQ IDQ

Branch Prediction • Implementation • Use local history to predict direction • Need to predict multiple branches • Need to predict branches before previous branches are resolved • Branch history updated first based on prediction, later based on actual execution (speculative history) • Target address taken from BTB • Prediction rate: ~92% • ~60 instructions between mispredictions • High prediction rate is very crucial for long pipelines • Especially important for OOOE, speculative execution: • On misprediction all instructions following the branch in the instruction window are flushed • Effective size of the window is determined by prediction accuracy • RSB used for Call/Return pairs

Jump out of the line Jump into the fetch line jmp jmp jmp jmp Predict taken Predict not taken Predict taken Predict taken Branch Prediction – Clustering • Need to provide predictions for the entire fetch line each cycle • Predict the first taken branch in the line, following the fetch IP • Implemented by • Splitting IP into offset within line, set, and tag • If the tag of more than one way matches the fetch IP • The offsets of the matching ways are ordered • Ways with offset smaller than the fetch IP offset are discarded • The first branch that is predicted taken is chosen as the predicted branch

Prediction bit 1001 0 counters Tag Target ofst Hist V T P IP LRR 9 Pred= msb of counter 4 1 2 1 4 2 9 32 32 128 sets 15 Branch Type 00- cond 01- ret 10- call 11- uncond Return Stack Buffer Per-Set Way 1 Way 3 Way 0 Way 2 The P6 BTB • 2-level, local histories, per-set counters • 4-way set associative: 512 entries in 128 sets • Up to 4 branches can have a tag match

D0 D1 D2 D3 4 uops 1 uop 1 uop 1 uop In-Order Front End: Decoder 16 Instruction bytes from IFU Determine where each IA instruction starts Instruction Length Decode Buffer Instructions IQ • Convert instructions Into uops • If inst aligned with dec1/2/3 decodes into >1 uops, defer it to next cycle • Buffers uops • Smooth decoder’s variable throughput IDQ

Micro Operations (Uops) • Each CISC inst is broken into one or more RISC uops • Simplicity: • Each uop is (relatively) simple • Canonical representation of src/dest (2 src, 1 dest) • Increased ILP • e.g., pop eaxbecomes esp1<-esp0+4, eax1<-[esp0] • Simple instructions translate to a few uops • Typical uop count (it is not necessarily cycle count!)Reg-Reg ALU/Mov inst: 1 uop Mem-Reg Mov (load) 1 uop Mem-Reg ALU (load + op) 2 uops Reg-Mem Mov (store) 2 uops (st addr, st data) Reg-Mem ALU (ld + op + st) 4 uops • Complex instructions need ucode

External Bus L2 MOB BIU DCU MIU IFU AGU R IEU BTB S I FEU MIS D ROB RAT Out-of-order Core • Reorder Buffer (ROB): • Holds “not yet retired” instructions • 40 entries, in-order • Reservation Stations (RS): • Holds “not yet executed” instructions • 20 entries • Execution Units • IEU: Integer Execution Unit • FEU: Floating-point Execution Unit • Memory related units • AGU: Address Generation Unit MIU: Memory Interface Unit • DCU: Data Cache Unit • MOB: Orders Memory operations • L2: Level 2 cache

Alloc & Rat • Perform register allocation and renaming for ≤4 uops/cyc • The Register Alias Table (RAT) • Maps architectural registers into physical registers • For each arch reg, holds the number of latest phyreg that updates it • When a new uop that writes to a arch regR is allocated • Record phyreg allocated to the uop as the latest reg that updates R • The Allocator (Alloc) • Assigns each uop an entry number in the ROB / RS • For each one of the sources (architectural registers) of the uop • Lookup the RAT to find out the latest phyreg updating it • Write it up in the RS entry • Allocate Load & Store buffers in the MOB

Re-order Buffer (ROB) • Hold 40 uops which are not yet committed • At the same order as in the program • Provide a large physical register space for register renaming • One physical register per each ROB entry • physical register number = entry number • Each uop has only one destination • Buffer the execution results until retirement • Valid data is set after uop executed and result written to physical reg

RRF – Real Register File • Holds the Architectural Register File • Architectural Register are numbered: 0 – EAX, 1 – EBX, … • The value of an architectural register • is the value written to it by the last instruction committed which writes to this register RRF:

Uop flow through the ROB • Uops are entered in order • Registers renamed by the entry number • Once assigned: execution order unimportant • After execution: • Entries marked “executed” and wait for retirement • Executed entry can be “retired” once all prior instruction have retired • Commit architectural state only after speculation (branch, exception) has resolved • Retirement • Detect exceptions and mispredictions • Initiate repair to get machine back on right track • Update “real registers” with value of renamed registers • Update memory • Leave the ROB

Reservation station (RS) • Pool of all “not yet executed” uops • Holds the uop attributes and the uop source data until it is dispatched • When a uop is allocated in RS, operand values are updated • If operand is from an architectural register, value is taken from the RRF • If operand is from a phy reg, with data valid set, value taken from ROB • If operand is from a phy reg, with data valid not set, wait for value • The RS maintains operands status “ready/not-ready” • Each cycle, executed uops make more operands “ready” • The RS arbitrate the WB busses between the units • The RS monitors the WB bus to capture data needed by awaiting uops • Data can be bypassed directly from WB bus to execution unit • Uops whose all operands are ready can be dispatched for execution • Dispatcher chooses which of the ready uops to execute next • Dispatches chosen uops to functional units

Register Renaming example IDQ Add EAX, EBX, EAX RAT / Alloc  Add ROB37, ROB19, RRF0 ROB  RS RRF:

Register Renaming example (2) IDQ sub EAX, ECX, EAX RAT / Alloc  sub ROB38, ROB23, ROB37 ROB  RS RRF:

Out-of-order Core: Execution Units internal 0-dealy bypass within each EU 1st bypass in MIU 2nd bypass in RS SHF FMU RS MIU FDIV IDIV FAU IEU Port 0 JEU IEU Port 1 Port 2 AGU Load Address Port 3,4 Store Address AGU SDB DCU

External Bus L2 MOB BIU DCU MIU IFU AGU R IEU BTB S I FEU MIS D ROB RAT In-Order Retire • ROB: • Retires up to 3 uops per clock • Copies the values to the RRF • Retirement is done In-order • Performs exception checking

In-order Retirement • The process of committing the results to the architectural state of the processor • Retires up to 3 uops per clock • Copies the values to the RRF • Retirement is done In Order • Performs exception checking • An instruction is retired after the following checks • Instruction has executed • All previous instructions have retired • Instruction isn’t mis-predicted • no exceptions

Predict/Fetch Decode Alloc Schedule EX Retire IQ IDQ RS ROB Pipeline: Fetch • Fetch 16B from I$ • Length-decode instructions within 16B • Write instructions into IQ

Pipeline: Decode Predict/Fetch Decode Alloc Schedule EX Retire • Read 4 instructions from IQ • Translate instructions into uops • Asymmetric decoders (4-1-1-1) • Write resulting uops into IDQ IQ IDQ RS ROB

Pipeline: Allocate Predict/Fetch Decode Alloc Schedule EX Retire • Allocate, port bind and rename 4 uops • Allocate ROB/RS entry per uop • If source data is available from ROB or RRF, write data to RS • Otherwise, mark data not ready in RS IQ IDQ RS ROB

Predict/Fetch Decode Alloc Schedule EX Retire IQ IDQ RS ROB Pipeline: EXE • Ready/Schedule • Check for data-ready uops if needed functional unit available • Select and dispatch ≤6 ready uops/clock to EXE • Reclaim RS entries • Write back results into RS/ROB • Write results into result buffer • Snoop write-back ports for results that are sources to uops in RS • Update data-ready status of these uops in the RS

Pipeline: Retire Predict/Fetch Decode Alloc Schedule EX Retire • Retire ≤4 oldest uops in ROB • Uop may retire if • its ready bit is set • it does not cause an exception • all preceding candidates are eligible for retirement • Commit results from result buffer to RRF • Reclaim ROB entry • In case of exception • Nuke and restart IQ IDQ RS ROB

Jump Misprediction – Flush at Retire • Flush the pipe when the mispredicted jump retires • retirement is in-order • all the instructions preceding the jump have already retired • all the instructions remaining in the pipe follow the jump • flush all the instructions in the pipe • Disadvantage • Misprediction known after the jump was executed • Continue fetching instructions from wrong path until the branch retires

Jump Misprediction – Flush at Execute • When the JEU detects jump misprediction it • Flush the in-order front-end • Instructions already in the OOO part continue to execute • Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed • Start fetching and decoding from the “correct” path • The “correct” path still be wrong • A preceding uop that hasn’t executed may cause an exception • A preceding jump executed OOO can also mispredict • The “correct” instruction stream is stalled at the RAT • The RAT was wrongly updated also by wrong path instruction • When the mispredicted branch retires • Resets all state in the Out-of-Order Engine (RAT, RS, RB, MOB, etc.) • Only instruction following the jump are left – they must all be flushed • Reset the RAT to point only to architectural registers • Un-stalls the in-order machine • RS gets uops from RAT and starts scheduling and dispatching them

Pipeline: Branch gets to EXE Alloc Schedule JEU Fetch Decode Retire IQ IDQ RS ROB

Flush Pipeline: Mispredicted Branch EXE • Flush front-end and re-steer it to correct path • RAT state already updated by wrong path • Block further allocation • Block younger branches from clearing • Update BPU Alloc Schedule JEU Fetch Decode Retire IQ IDQ RS ROB

Pipeline: Mispredicted Branch Retires Clear • When mispredicted branch retires • Flush OOO • Allow allocation of uops from correct path Alloc Schedule JEU Fetch Decode Retire IQ IDQ RS ROB

Instant Reclamation • Allow a faster recovery after jump misprediction • Allow execution/allocation of uops from correct path before mispredicted jump retires • Every few cycles take a checkpoint of the RAT • In case of misprediction • Flush the frontend and re-steer it to the correct path • Recover RAT to latest checkpoint taken prior to misprediction • Recover RAT to exact state at misprediction • Rename 4 uops/cycle from checkpoint and until branch • Flush all uops younger than the branch in the OOO

Clear Instant ReclamationMispredicted Branch EXE • JEClear raised on mispredicted macro-branches Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB

Instant ReclamationMispredicted Branch EXE • JEClear raised on mispredicted macro-branches • Flush frontend and re-steer it to the correct path • Flush all younger uops in OOO • Update BPU • Block further allocation BPU Update Clear Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB

Pipeline: Instant Reclamation: EXE • Restore RAT from latest check-point before branch • Recover RAT to its states just after the branch • Before any instruction on the wrong path • Meanwhile front-end starts fetching and decoding instructions from the correct path Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB

Pipeline: Instant Reclamation: EXE • Once done restoring the RAT • allow allocation of uops from correct path Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB

Large ROB and RS are Important • Large RS • Increases the window in which looking for impendent instructions • Exposes more parallelism potential • Large ROB • The ROB is a superset of the RS  ROB size ≥ RS size • Allows for of covering long latency operations (cache miss, divide) • Example • Assume there is a Load that misses the L1 cache • Data takes ~10 cycles to return  ~30 new instrs get into pipeline • Instructions following the Load cannot commit  Pile up in the ROB • Instructions independent of the load are executed, and leave the RS • As long as the ROB is not full, we can keep executing instructions • A 40 entry ROB can cover for an L1 cache miss • Cannot cover for an L2 cache miss, which is hundreds of cycles

OOO Execution of Memory Operations

P6 Caches • Blocking caches severely hurt OOO • A cache miss prevents from other cache requests (which could possibly be hits) to be served • Hurts one of the main gains from OOO – hiding caches misses • Both L1 and L2 cache in the P6 are non-blocking • Initiate the actions necessary to return data to cache miss while they respond to subsequent cached data requests • Support up to 4 outstanding misses • Misses translate into outstanding requests on the P6 bus • The bus can support up to 8 outstanding requests • Squash subsequent requests for the same missed cache line • Squashed requests not counted in number of outstanding requests • Once the engine has executed beyond the 4 outstanding requests • subsequent load requests are placed in the load buffer

Memory Operations • Loads • Loads need to specify • memory address to be accessed • width of data being retrieved • the destination register • Loads are encoded into a single uop • Stores • Stores need to provide • a memory address, a data width, and the data to be written • Stores require two uops: • One to generate the address • One to generate the data • Uops are scheduled independently to maximize concurrency • Can calculate the address before the data to be stored is known • must re-combine in the store buffer for the store to complete

The Memory Execution Unit • As a black box, the MEU is just another execution unit • Receives memory reads and writes from the RS • Returns data back to the RS and to the ROB • Unlike many execution units, the MEU has side effects • May cause a bus request to external memory • The RS dispatches memory uops • When all the data used for address calculation is ready • Both to the MOB and to the Address Generation Unit (AGU) are free • The AGU computes the linear address: Segment-Base + Base-Address + (Scale*Index) + Displacement • Sends linear address to MOB, where it is stored with the uop in the appropriate Load Buffer or Store Buffer entry

The Memory Execution Unit (cont.) • The RS operates based on register dependencies • Cannot detect memory dependencies, even as simple as: i1: movl -4(%ebp), %ebx # MEM[ebp-4] ← ebx i2: movl %eax, -4(%ebp) # eax ← MEM[ebp-4] • The RS may dispatch these operations in any order • MEU is in-charge of memory ordering • If MEU blindly executes above operations, eax may get wrong value • Could require memory operations dispatch non-speculatively • x86 has small register set  uses memory often • Preventing Stores from passing Stores/Loads: 3%~5% perf. loss • P6 chooses not allow Stores to pass Stores/Loads • Preventing Loads from passing Loads/Stores Big performance loss • Need to allow loads to pass stores, and loads to pass loads • MEU allows speculative execution as much as possible

OOO Execution of Memory Operations • Stores are not executed OOO • Stores are never performed speculatively • there is no transparent way to undo them • Stores are also never re-ordered among themselves • The Store Buffer dispatches a store only when • the store has both its address and its data, and • there are no older stores awaiting dispatch • Resolving memory dependencies • Some memory dependencies can be resolved statically • store r1,a • load r2,b • Problem: some cannot • store r1,[r3]; • load r2,b  can advance load before store •  load must wait till r3 is known

Memory Order Buffer (MOB) • Every memory uop is allocated an entry in order • De-allocated when retires • Address & data (for stores), are updated when known • Load is checked against all previous stores • Waits if store to same address exist, but data not ready • If store data exists, just use it: Store → Load forwarding • Waits till all previous store addresses are resolved • In case of no address collision – go to memory • Memory Disambiguation • MOB predicts if a load can proceed despite unknown STAs • Predict colliding  block Load if there is unknown STA (as usual) • Predict non colliding  execute even if there are unknown STAs • In case of wrong prediction • The entire pipeline is flushed when the load retires

Pipeline: Load: Allocate Alloc Schedule Retire • Allocate ROB/RS, MOB entries • Assign Store ID to enable ordering IDQ RS ROB LB AGU LB Write MOB DTLB DCU WB

Pipeline: Bypassed Load: EXE Alloc Schedule AGU LB Write Retire • RS checks when data used for address calculation is ready • AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp. • Write load into Load Buffer • DTLB Virtual → Physical + DCU set access • MOB checks blocking and forwarding • DCU read / Store Data Buffer read (Store → Load forwarding) • Write back data / write block code IDQ RS ROB MOB DTLB DCU WB LB

Pipeline: Blocked Load Re-dispatch Alloc Schedule AGU LB Write Retire • MOB determines which loads are ready, and schedules one • Load arbitrates for MEU • DTLB Virtual → Physical + DCU set access • MOB checks blocking/forwarding • DCU way select / Store Data Buffer read • write back data / write block code IDQ RS ROB MOB DTLB DCU WB LB

Pipeline: Load: Retire Alloc Schedule AGU LB Write Retire • Reclaim ROB, LB entries • Commit results to RRF IDQ RS ROB MOB DTLB DCU WB LB

DCU Miss • If load misses in the DCU, the DCU • Marks the write-back data as invalid • Assigns a fill buffer to the load • Starts MLC access • When critical chunk is returned • DCU writes it back on WB bus

SB Pipeline: Store: Allocate Alloc Schedule AGU SB Retire • Allocate ROB/RS • Allocate Store Buffer entry IDQ RS ROB DTLB

Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor

Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor

Presentation Transcript

The Micro Architecture Level

An Architecture of Enterprise Architecture

Computer Structure The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor

Octavo: An FPGA-Centric Processor Architecture

Computer Architecture Out-of-order execution

Computer Architecture Out-Of-Order Execution

Micro-processor vs. Micro-controller

Architecture of the C6x Processor

Computer Architecture Out-of-order execution

Micro processor

The Multi Micro Processor

Chapter 6 – MSP430 Micro-Architecture

uTCA Micro Telecommunications Computing Architecture

Computer Architecture: Out-of-Order Execution II

MultiAgent Architecture and an Example

Architecture of an Animal

uTCA Micro Telecommunications Computing Architecture

Computer Architecture The P6 Microarchitecture An Example of an Out-Of-Order Micro-processor

An Example Architecture