The anatomy of a modern superscalar processor

The anatomy of a modern superscalar processor Constantinos Kourouyiannis Madhava Rao Andagunda

Outline • Introduction • Microarchitecture • Alpha 21264 processor • Sim-alpha simulator • Out of order execution • Prediction-Speculation

Introduction • Superscalar processing is the ability to initiate multiple instructions during the same cycle. • It aims at producing ever faster microprocessors. • A typical superscalar processor fetches and decodes several instructions at a time. Instructions are executed in parallel based on the availability of operand data rather than their original program sequence. Upon completion instructions are re-sequenced so that they can be used to update the process state in the correct program order.

Microarchitecture • Instruction Fetch and Branch Prediction • Decode and Register Dependence Analysis • Issue and Execution • Memory Operation Analysis and Execution • Instruction Reorder and Commit

Organization of a superscalar processor

Instruction Fetch and Branch Prediction • The fetch phase supplies instructions to the rest of the processing pipeline. • An instruction cache is used to reduce the latency and increase the bandwidth of instruction fetch process. • PC searches the cache contents to determine if the instruction being addressed is present in one of the cache lines. • In a superscalar implementation, the fetch phase fetches multiple instructions per cycle from cache memory.

Branch Instructions • Recognizing conditional branches • Decode information (extra bits) is held in the instruction cache with every instruction • Determining the branch outcome • Branch prediction using information regarding past history of branch outcomes. • Computing the branch target • Usually integer addition (PC+ offset value) • Branch Target Buffer holds target address used last time the branch was executed • Transferring control • If branch taken  at least one clock cycle delay to recognize branch, modify PC and fetch instructions from target address

Instruction Decode • Instructions are removed from fetch buffers, examined and data dependence linkages are set up. • Data dependences • True dependences :can cause a read after write (RAW) hazard • Artificial dependences: can cause write after read (WAR) and write after write (WAW) hazards. • Hazards • RAW: occurs when a consuming instruction reads a value before the producing instruction writes it. • WAR: occurs when an instruction writes a value before a preceding instruction reads it. • WAW: occurs when multiple instructions update the same storage location but not in the proper order.

Instruction Decode (cont.) • Example of data hazards • For each instruction, decode phase sets up the operation to be executed, the identities of storage elements where input reside and the locations where result must be placed. • Artificial dependences are eliminated through register renaming.

Instruction Issue and Parallel Execution • Run-time checking for availability of data and resources. • An instruction is ready to execute as soon as its input operands are available. However, there are other constraints such as execution units and register file ports. • An issue queue is responsible for holding the instructions until their input operands are available. • Out-of-order execution: the ability of executing instructions not in the program order but as soon as their operands are ready.

Handling Memory Operations • For memory operations, the decode phase cannot identify the memory locations that will be accessed. • The determination of the memory location that will be accessed requires an address calculation, usually integer addition. • Once a valid address is obtained, the load or store operation is submitted to memory.

Committing State • The effects of an instruction are allowed to modify the logical process state. • The purpose of this phase is to implement the appearance of a sequential execution model, even though the actual execution is not sequential. • Machine state is separated into physical and logical. Physical state is updated as the operations complete while logical is updated in sequential program order.

Alpha 21264 (EV6)

Instruction Fetch • Fetches 4 instructions per cycle • Large 64 KB 2-way associative instruction cache • Branch predictor: dynamically chooses between local and global history

Register Renaming • Assignment of a unique storage location with each write-reference to a register. • The register allocated becomes part of the architectural state only when the instruction commits. • Elimination of WAW and WAR dependences but preservation of RAW dependences necessary for correct computation. • 64 architectural + 41 integer + 41 floating point registers available to hold speculative results prior to instruction retirement in an 80 instruction in-flight window.

Issue Queues • 20 entry integer queue • can issue 4 instructions per cycle • 15 entry floating-point queue • can issue 2 instructions per cycle • A list of pending instructions is kept and each cycle these queues select from these instructions as their input data are ready. • Queues issue instructions speculatively and older instructions are given priority over newer in the queue. • An issue queue entry becomes available when the instruction issues or is squashed due to mis-speculation.

Execution Engine • All execution units require access to the register file. • The register file is split into two clusters that contain duplicates of the 80-entry register file. • Two pipes access a single register file to form a cluster and the two clusters are combined to support 4-way integer execution. • Two floating point execution pipes are organized in a single cluster with a single 72-entry register file.

Memory System • Supports in-flight memory references and out-of-order operation • Receives up to 2 memory operations from the integer execution pipes every cycle • 64 KB 2-way set associative data cache and direct mapped level-two cache (ranges from 1 to 16 MB) • 3-cycle latency for integer loads and 4 cycles for FP loads

Store/Load Memory Ordering • Memory system supports capabilities of out-of-order execution but maintains an in-order architectural memory model. • It would be wrong if a later load issued prior to an earlier store to the same address. • This RAW memory dependency cannot be handled by rename logic because it doesn’t know the memory address before instruction issue. • If a load is incorrectly issued before an earlier store to the same address, the 21264 trains the out-of-order execution core to avoid it on subsequent executions of the same load. It sets a bit on a load wait table, that forces the issue point of the load to be delayed until all prior stores have issued.

Load Hit/ Miss Prediction • To achieve the 3-cycle integer load hit latency, it is necessary to speculatively issue consumers of integer load data before knowing if the load hit or missed in the data cache. • If the load eventually misses, two integer cycles are squashed and all integer instructions that issued during those cycles are pulled back in the issue queue to be re-issued later. • The 21264 predicts when loads will miss and does not speculatively issue the consumers of the load in that case. Effective load latency: 5 cycles for an integer load hit that is incorrectly predicted to miss.

Sim-alpha simulator • Sim-alpha is a simulator that models Alpha 21264. • It models the implementation constraints and low-level features in the 21264. • Allows user to vary the different parameters of the processor, such as fetch width, reorder buffer size and issue queue sizes.

Out-Of-Order Execution • Why Out-Of-Order Execution? - In-order Processors Stalls Pipeline may not be full because of the frequent stalls Example: In the Out-Of-Order Processors - No dependency Move the Instruction for execution - Means Allow the Instructions that are Ready

Out-Of-Order Execution • Theme : Stall Only on RAW hazards & Structural Hazards - RAW Hazard Example: LD R4 10(R5) ADD R6 R4 R8 - Structural Hazard - Occurs because of Resource Conflicts - Example: If the CPU is designed with a single Interface to Memory This Interface is always used during IF Also used in MEM for LOAD or STORE Operations. When a Load or Store gets into MEM stage IF must stall

Prediction & Speculation • Problem: Serialization due to Dependences - Wait Until Producing Instruction Execute - But We need Optimal Utilization of Resources • Solution: - Predict Unknown information - Allow the processor to proceed (Assuming predicted information is correct) - If the Prediction is Correct Proceed Normally - If not squash the speculatively executed instructions - Restore the status - Start In the correct Direction

Taxonomy Of Speculation Two Possibilities -Worst case : 50 % accuracy For a 32 bit Register Possibilities - Worst case: ?

Prediction & Speculation • Control Speculation -Current Branch Predictors are highly Accurate - Implemented in Commercial Processors • Data Speculation - Lot of research (Value Profiling, Value prediction etc) - Not Implemented in the Current Processors - Till Now Very Little Effects on Current Designs

The anatomy of a modern superscalar processor

The anatomy of a modern superscalar processor

Presentation Transcript

The Anatomy of a Gamer

Computer Architecture The Anatomy of Modern Processors

Computer Architecture The Anatomy of Modern Processors

Computer Architecture The Anatomy of Modern Processors

Lecture 11 : Modern Superscalar Processor Models

Empirical Study of the Anatomy of Modern SAT Solvers

Designing a superscalar processor simulation

The Anatomy of a Murderer

Superscalar Processor Design Superscalar Architecture

Accurately Approximating Superscalar Processor Performance from Traces

Timing Model of a Superscalar O-o-O processor in HAsim Framework

The Anatomy of a Tweet

Superscalar Processor

Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor

Chapter 7 โพรเซสเซอร์แบบไปป์ลายน์และซุปเปอร์สเกลาร์ Pipeline and Superscalar processor

Anatomy of Modern Malware Attacks

Decoupled Fetch/Execute Superscalar Processor Engines

The Anatomy of a Ring

The Anatomy of a Plug

Modern processor design

Modern processor design

The Anatomy of a Ring