Chapter 6 The PowerPC 620

Chapter 6The PowerPC 620

The PowerPC 620 • The 620 was the first 64-bit superscalar processor to employ: • True out-of-order execution, aggressive branch prediction, distributed multientry reservation stations, dynamic renaming for all register files, six pipelined execution units, and a completion buffer to ensure precise exceptions • An instruction-level, or machine-cycle level, performance evaluation of the 620 microarchitecture • Using a VMW-generated performance simulator of the 620

The PowerPC 620 (cont.) • Describe the IBM POWER3 and POWER4 designs • Highlight how they differ from the predecessor PowerPC 620 • They are fundamentally similar in that they aggressively extract instruction-level parallelism from sequential code • To highlight recent trends in processor implementation: increased memory bandwidth through aggressive cache hierarchies, better branch prediction, more execution resources, and deeper pipelining

Introduction • The PowerPC Architecture is the result of the PowerPC alliance among IBM, Motorola, and Apple • Based on the Performance Optimized with Enhanced RISC (POWER) Architecture • To facilitate parallel instruction execution and to scale well with advancing technology • The PowerPC alliance has released and announced a number of chips • The fourth chip was the 64-bit 620

Introduction (cont.) • Motorola and IBM have pursued independent development of general-purpose PowerPC-compatible parts • Motorola has focused on 32-bit desktop chips for Apple • PowerPC G3 and G4 are derived from the PowerPC 603, with short pipelines, limited execution resources, but very low cost • IBM has concentrated on server parts for its Unix (AIX) and business (OS/400) systems • Consider the PowerPC 620 and its heirs at the high-performance end: the POWER3 and the POWER4

Introduction (cont.) • The PowerPC Architecture has 32 general-purpose registers (GPRs) and 32 floating-point registers (FPRs) • It also has a condition register which can be addressed as one 32-bit register (CR) • Or as a register file of 8 four-bit fields (CRFs) • Or as 32 single-bit fields • The architecture has a count register (CTR) and a link register (LR) • Primarily used for branch instructions

Introduction (cont.) • Also an integer exception register (XER) and a floating-point status and control register (FPSCR) • To record the exception status of the appropriate instruction types • The PowerPC instructions are typical RISC instructions, with the addition of: • Floating-point fused multiply-add instructions • Load/store instructions with addressing modes that update the effective address • Instructions to set, manipulate, and branch off of the condition register bits

Introduction (cont.) • The 620 is a four-wide superscalar machine • Aggressive branch prediction to fetch instructions as early as possible • A dispatch policy to distribute those instructions to the execution units • The 620 uses six parallel execution units: • Two simple (single-cycle) integer units • One complex (multicycle) integer unit • One floating-point unit (three stages) • One load/store unit (two stages) • A branch unit • Distributed reservation stations and register renaming to implement out-of-order execution

Introduction (cont.)

Introduction (cont.) • The 620 processes instructions in five major stages: • The fetch, dispatch, execute, complete, and writeback stages • Some of these stages are separated by buffers to take up slack in the dynamic variation of available parallelism • The instruction buffer, the reservation stations, and the completion buffer • Some of the units in the execute stage are actually multistage pipelines

Introduction (cont.)

Fetch Stage • The fetch unit accesses the instruction cache to fetch up to four instructions per cycle into the instruction buffer • The end of a cache line or a taken branch can prevent the fetch unit from fetching four useful instructions in a cycle • A mispredicted branch can waste cycles while fetching from the wrong path • During the fetch stage, a preliminary branch prediction ismade • Using the branch target address cache (BTAC) to obtain the target address for fetching in the next cycle

Instruction Buffer • The instruction buffer holds instructions between the fetch and dispatch stages • If the dispatch unit cannot keep up with the fetch unit, instructions are buffered until the dispatch unit can process them • A maximum of eight instructions can be buffered at a time • Instructions are buffered and shifted in groups of two to simplify the logic

Dispatch Stage • It decodes instructions in the instruction buffer and checks whether they can be dispatched to the reservation stations • Allocates a reservation station entry, a completion buffer entry, and an entry in the rename buffer for the destination, if needed • If all dispatch conditions are fulfilled for an instruction • Each of the six execution units can accept at most one instruction per cycle • Up to four instructions can be dispatched in program order per cycle

Dispatch Stage (cont.) • Certain infrequent serialization constraints can also stall instruction dispatch • There are eight integer register rename buffers, eight floating-point register rename buffers, and 16 condition register field rename buffers • The count register and the link register have one shadow register each used for renaming • During dispatch, the appropriate buffers are allocated

Dispatch Stage (cont.) • Any source operands which have been renamed by previous instructions are marked with the tags of the associated rename buffers • If the source operand is not available when the instruction is dispatched, the appropriate result buses for forwarding results are watched to obtain the operand data • Source operands which have not been renamed by previous instructions are read from the architected register files

Dispatch Stage (cont.) • If a branch is being dispatched, resolution of the branch is attempted immediately • If resolution is still pending, i.e., the branch depends on an operand that is not yet available, it is predicted using the branch history table (BHT) • If the prediction made by the BHT disagrees with the prediction made earlier by the BTAC, the BTAC-based prediction is discarded • Fetching proceeds along the direction predicted by the BHT

Reservation Stations • Each execution unit in the execute stage has an associated reservation station • Each holds those instructions waiting to execute • A reservation station can hold two to four instruction entries, depending on the execution unit • Each dispatched instruction waits in a reservation station until all its source operands have been read or forwarded and the execution unit is available • Instructions can leave reservation stations and be issued into the execution units out of order • Except for FPU and branch unit (BRU)

Execute Stage • This major stage can require multiple cycles to produce its results • Depending on the type of instruction • The load/store unit is a two-stage pipeline • The floating-point unit is a three-stage pipeline • The instruction results are sent to the destination rename buffers and forwarded to any waiting instructions

Completion Buffer • The 16-entry completion buffer records the state of the in-flight instructions until they are architecturally complete • An entry is allocated for each instruction during the dispatch stage • The execute stage marks an instruction as finished when the unit is done executing the instruction • Eligible for completion

Complete Stage • During the completion stage, finished instructions are removed from the completion buffer in order • Up to four at a time • Passed to the writeback stage • Fewer instructions will complete in a cycle if there are an insufficient number of write ports to the architected register files • The architected registers hold the correct state up to the most recently completed instruction • Precise exception is maintained even with aggressive out-of-order execution

Writeback Stage • The writeback logic retires those instructions completed in the previous cycle • By committing their results from the rename buffers to the architected register files

Experimental Framework • The performance simulator for the 620 was implemented using the VMW framework • Developed based on design documents provided and periodically updated by the 620 design team • Instruction and data traces are generated on an existing PowerPC 601 microprocessor via software instrumentation • Traces for several SPEC 92 benchmarks, four integer and three floating-point, are generated • The benchmarks and their dynamic instruction mixes are shown below:

Experimental Framework (cont.) • Most integer benchmarks have similar instruction mixes • li contains more multicycle instructions than the rest • Most of these instructions move values to and from special-purpose registers • There is greater diversity among the floating-point benchmarks • Hydor2d uses more nonpipelined floating-point instructions • These instructions are all floating-point divides, which require 18 cycles on the 620

Experimental Framework (cont.) • Instructions with variable latency are assumed the minimum latency • Integer multiply/divide and floating point divide • No speculative instructions that are later discarded due to misprediction are included in the simulation runs • Both I-cache and D-cache activities are included in the simulation • 32K bytes and 8-way set-associative (I-cache) • The D-cache is two-way interleaved • Cache miss latency of eight cycles • A perfect unified L2 cache are also assumed

Experimental Framework (cont.) The IPC rating reflects the overall degree of instruction-level parallelism achieved by the 620 microarchitecture

Instruction Fetching • Provided that the instruction buffer is not saturated, the 620's fetch unit is capable of fetching four instructions in every cycle • Machine execution would be drastically slowed by the bottleneck in fetching down taken branches • If the fetch unit were to wait for branch resolution before continuing to fetch nonspeculatively • If it were to bias naively for branch-not-taken • Accurate branch prediction is crucial in keeping a wide superscalar processor busy

Branch Prediction • Branch prediction in the 620 takes place in two phases • The first prediction uses the BTAC to provide a preliminary guess of the target address when a branch is encountered during instruction fetch • Done in the fetch stage • The second, and more accurate, prediction makes predictions based on the two history bits • Done in the dispatch stage using the BHT, which contains branch history • During the dispatch stage, the 620 attempts to resolve immediately a branch • Based on available information

Branch Prediction (cont.) • No branch prediction is necessary • If the branch is unconditional • If the condition register has the appropriate bits ready • The branch is executed immediately • The source condition register bits are unavailable because the instruction generating them is not finished • Branch prediction is made using the BHT • The BHT predicts whether the branch will be taken or not taken

Branch Prediction (cont.) • It contains two history bits per entry that are accessed during the dispatch stage • Upon resolution of the predicted branch, the actual direction of the branch is updated to the BHT • The 2048-entry BHT is a direct-mapped table • There is no concept of a hit or a miss • If two branches that update the BHT are an exact multiple of 2048 instructions apart, i.e., aliased, they will affect each other‘s predictions • The BTAC is an associative cache • The 620 can resolve or predict a branch at the dispatch stage

Branch Prediction (cont.) • This can incur one cycle delay until the new target of the branch can be fetched • The 620 makes a preliminary prediction during the fetch stage • Based solely on the address of the instruction that it is currently fetching • If one of these addresses hits in the BTAC, the target address stored in the BTAC is used as the fetch address in the next cycle • The BTAC has 256 entries • It is two-way set-associative • It holds only the targets of those branches that are predicted taken

Branch Prediction (cont.) • Branches that are predicted not taken (fall through) are not stored in the BTAC • Only unconditional and PC-relative conditional branches use the BTAC • Branches to the count register or the link register have unpredictable target addresses • They are never stored in the BTAC • Effectively, these branches are always predicted not taken by the BTAC in the fetch stage • A link register stack is used for predicting conditional return instructions • It stores the addresses of subroutine returns

Branch Prediction (cont.) • Four possible cases in the BTAC prediction: • A BTAC miss for which the branch is not taken • Correct prediction • A BTAC miss for which the branch is taken • Incorrect prediction • A BTAC hit for a taken branch • Correct prediction • A BTAC hit for a not-taken branch • Incorrect prediction • The BTAC can never hit on a taken branch and get the wrong target address

Branch Prediction (cont.) • Only PC-relative branches can hit in the BTAC • They must always use the same target address • Two predictions are made for each branch • Once by the BTAC in the fetch stage • Another by the BHT in the dispatch stage • If the BHT prediction disagrees with the BTAC prediction, the BHT prediction is used • The BTAC prediction is discarded • If the predictions agree and are correct, all instructions that are speculatively fetched are used and no penalty is incurred

Branch Prediction (cont.) • In combining the possible predictions and resolutions of the BHT and BTAC, there are six possible outcomes • The predictions made by the BTAC and BHT are strongly correlated • There is a small fraction of the time that the wrong prediction made by the BTAC is corrected by the right prediction of the BHT • There is the unusual possibility of the correct prediction made by the BTAC being undone by the incorrect prediction of the BHT • However, such cases are quite rare

Branch Prediction (cont.) • The BTAC makes an early prediction without using branch history • A hit in the BTAC effectively implies that the branch is predicted taken • A miss in the BTAC means a not-taken prediction • The BHT prediction is based on branch history and is more accurate • It can potentially incur a one-cycle penalty if its prediction differs from that made by the BTAC • The BHT tracks the branch history and updates the entries in the BTAC • The reason for the strong correlation between the two predictions

Branch Prediction (cont.)

Branch Prediction (cont.) • Summary of the branch prediction statistics for the benchmarks • The BTAC prediction accuracy for the integer benchmarks ranges from 75% to 84% • For the floating-point benchmarks it ranges from 88% to 94% • For these correct predictions by the BTAC, no branch penalty is incurred if they are likewise predicted correctly by the BHT • The overall branch prediction accuracy is determined by the BHT • For the integer benchmarks, about 17% to 29% of the branches are resolved by the time they reach the dispatch stage

Branch Prediction (cont.) • For the floating-point benchmarks, this range is 17% to 45% • The overall misprediction rate for the integer benchmarks ranges from 8.7% to 11.4% • For the floating-point benchmarks it ranges from 0.9% to 5.8% • The existing branch prediction mechanisms work quite well for the floating-point benchmarks • There is still room for improvement in the integer benchmarks

Fetching and Speculation • The purpose for branch prediction is to sustain a high instruction fetch bandwidth • To keep the rest of the superscalar machine busy • Misprediction translates into wasted fetch cycles • It reduces the effective instruction fetch bandwidth • Another source of fetch bandwidth loss is due to I-cache misses • The effects of these two impediments on fetch bandwidth for the benchmarks • For the integer benchmarks, significant percentages (6.7% to 11.8%) of the fetch cycles are lost due to misprediction

Fetching and Speculation (cont.) • For all the benchmarks, the I-cache misses resulted in the loss of less than 1% of the fetch cycles

Fetching and Speculation (cont.) • Branch prediction is a form of speculation • When speculation is done effectively, it can increase the performance of the machine • By alleviating the constraints imposed by control dependences • The 620 can speculate past up to four predicted branches before stalling the fifth branch at the dispatch stage • Speculative instructions are allowed to move down the pipeline stages until the branches are resolved • If the speculation proves to be incorrect, the speculated instructions are canceled

Fetching and Speculation (cont.) • Speculative instructions can potentially finish execution and reach the completion stage prior to branch resolution • They are not allowed to complete until the resolution of the branch • The frequency of bypassing specific numbers of branches • This reflects the degree of speculation sustained • Determined by obtaining the number of correctly predicted branches that are bypassed in each cycle • Once a branch is determined to be mispredicted, speculation of instructions beyond that branch is not simulated

Fetching and Speculation (cont.) • For the integer benchmarks, in 34% to 5 1% of the cycles, the 620 is speculatively executing beyond one or more branches • For floating-point benchmarks, the degree of speculation is lower • The frequency of misprediction is related to the combination of the average number of branches bypassed and the prediction accuracy

Fetching and Speculation (cont.)

Chapter 6 The PowerPC 620