the anatomy of a modern superscalar processor l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The anatomy of a modern superscalar processor PowerPoint Presentation
Download Presentation
The anatomy of a modern superscalar processor

Loading in 2 Seconds...

play fullscreen
1 / 31

The anatomy of a modern superscalar processor - PowerPoint PPT Presentation


  • 237 Views
  • Uploaded on

The anatomy of a modern superscalar processor . Constantinos Kourouyiannis Madhava Rao Andagunda. Outline. Introduction Microarchitecture Alpha 21264 processor Sim-alpha simulator Out of order execution Prediction-Speculation. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The anatomy of a modern superscalar processor' - cira


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the anatomy of a modern superscalar processor

The anatomy of a modern superscalar processor

Constantinos Kourouyiannis

Madhava Rao Andagunda

outline
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation
introduction
Introduction
  • Superscalar processing is the ability to initiate multiple instructions during the same cycle.
  • It aims at producing ever faster microprocessors.
  • A typical superscalar processor fetches and decodes several instructions at a time. Instructions are executed in parallel based on the availability of operand data rather than their original program sequence. Upon completion instructions are re-sequenced so that they can be used to update the process state in the correct program order.
outline4
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation
microarchitecture
Microarchitecture
  • Instruction Fetch and Branch Prediction
  • Decode and Register Dependence Analysis
  • Issue and Execution
  • Memory Operation Analysis and Execution
  • Instruction Reorder and Commit
instruction fetch and branch prediction
Instruction Fetch and Branch Prediction
  • The fetch phase supplies instructions to the rest of the processing pipeline.
  • An instruction cache is used to reduce the latency and increase the bandwidth of instruction fetch process.
  • PC searches the cache contents to determine if the instruction being addressed is present in one of the cache lines.
  • In a superscalar implementation, the fetch phase fetches multiple instructions per cycle from cache memory.
branch instructions
Branch Instructions
  • Recognizing conditional branches
    • Decode information (extra bits) is held in the instruction cache with every instruction
  • Determining the branch outcome
    • Branch prediction using information regarding past history of branch outcomes.
  • Computing the branch target
    • Usually integer addition (PC+ offset value)
    • Branch Target Buffer holds target address used last time the branch was executed
  • Transferring control
    • If branch taken  at least one clock cycle delay to recognize branch, modify PC and fetch instructions from target address
instruction decode
Instruction Decode
  • Instructions are removed from fetch buffers, examined and data dependence linkages are set up.
  • Data dependences
    • True dependences :can cause a read after write (RAW) hazard
    • Artificial dependences: can cause write after read (WAR) and write after write (WAW) hazards.
  • Hazards
    • RAW: occurs when a consuming instruction reads a value before the producing instruction writes it.
    • WAR: occurs when an instruction writes a value before a preceding instruction reads it.
    • WAW: occurs when multiple instructions update the same storage location but not in the proper order.
instruction decode cont
Instruction Decode (cont.)
  • Example of data hazards
  • For each instruction, decode phase sets up the operation to be executed, the identities of storage elements where input reside and the locations where result must be placed.
  • Artificial dependences are eliminated through register renaming.
instruction issue and parallel execution
Instruction Issue and Parallel Execution
  • Run-time checking for availability of data and resources.
  • An instruction is ready to execute as soon as its input operands are available. However, there are other constraints such as execution units and register file ports.
  • An issue queue is responsible for holding the instructions until their input operands are available.
  • Out-of-order execution: the ability of executing instructions not in the program order but as soon as their operands are ready.
handling memory operations
Handling Memory Operations
  • For memory operations, the decode phase cannot identify the memory locations that will be accessed.
  • The determination of the memory location that will be accessed requires an address calculation, usually integer addition.
  • Once a valid address is obtained, the load or store operation is submitted to memory.
committing state
Committing State
  • The effects of an instruction are allowed to modify the logical process state.
  • The purpose of this phase is to implement the appearance of a sequential execution model, even though the actual execution is not sequential.
  • Machine state is separated into physical and logical. Physical state is updated as the operations complete while logical is updated in sequential program order.
outline14
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation
instruction fetch
Instruction Fetch
  • Fetches 4 instructions per cycle
  • Large 64 KB 2-way associative instruction cache
  • Branch predictor: dynamically chooses between local and global history
register renaming
Register Renaming
  • Assignment of a unique storage location with each write-reference to a register.
  • The register allocated becomes part of the architectural state only when the instruction commits.
  • Elimination of WAW and WAR dependences but preservation of RAW dependences necessary for correct computation.
  • 64 architectural + 41 integer + 41 floating point registers available to hold speculative results prior to instruction retirement in an 80 instruction in-flight window.
issue queues
Issue Queues
  • 20 entry integer queue
    • can issue 4 instructions per cycle
  • 15 entry floating-point queue
    • can issue 2 instructions per cycle
  • A list of pending instructions is kept and each cycle these queues select from these instructions as their input data are ready.
  • Queues issue instructions speculatively and older instructions are given priority over newer in the queue.
    • An issue queue entry becomes available when the instruction issues or is squashed due to mis-speculation.
execution engine
Execution Engine
  • All execution units require access to the register file.
  • The register file is split into two clusters that contain duplicates of the 80-entry register file.
  • Two pipes access a single register file to form a cluster and the two clusters are combined to support 4-way integer execution.
  • Two floating point execution pipes are organized in a single cluster with a single 72-entry register file.
memory system
Memory System
  • Supports in-flight memory references and out-of-order operation
  • Receives up to 2 memory operations from the integer execution pipes every cycle
  • 64 KB 2-way set associative data cache and direct mapped level-two cache (ranges from 1 to 16 MB)
  • 3-cycle latency for integer loads and 4 cycles for FP loads
store load memory ordering
Store/Load Memory Ordering
  • Memory system supports capabilities of out-of-order execution but maintains an in-order architectural memory model.
  • It would be wrong if a later load issued prior to an earlier store to the same address.
  • This RAW memory dependency cannot be handled by rename logic because it doesn’t know the memory address before instruction issue.
  • If a load is incorrectly issued before an earlier store to the same address, the 21264 trains the out-of-order execution core to avoid it on subsequent executions of the same load. It sets a bit on a load wait table, that forces the issue point of the load to be delayed until all prior stores have issued.
load hit miss prediction
Load Hit/ Miss Prediction
  • To achieve the 3-cycle integer load hit latency, it is necessary to speculatively issue consumers of integer load data before knowing if the load hit or missed in the data cache.
  • If the load eventually misses, two integer cycles are squashed and all integer instructions that issued during those cycles are pulled back in the issue queue to be re-issued later.
  • The 21264 predicts when loads will miss and does not speculatively issue the consumers of the load in that case. Effective load latency: 5 cycles for an integer load hit that is incorrectly predicted to miss.
outline23
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation
sim alpha simulator
Sim-alpha simulator
  • Sim-alpha is a simulator that models Alpha 21264.
  • It models the implementation constraints and low-level features in the 21264.
  • Allows user to vary the different parameters of the processor, such as fetch width, reorder buffer size and issue queue sizes.
outline25
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation
out of order execution
Out-Of-Order Execution
  • Why Out-Of-Order Execution?

- In-order Processors Stalls

Pipeline may not be full because of the frequent stalls

Example:

In the Out-Of-Order Processors

- No dependency Move the Instruction for execution

- Means Allow the Instructions that are Ready

out of order execution27
Out-Of-Order Execution
  • Theme : Stall Only on RAW hazards & Structural Hazards

- RAW Hazard

Example: LD R4 10(R5)

ADD R6 R4 R8

- Structural Hazard

- Occurs because of Resource Conflicts

- Example: If the CPU is designed with a single Interface to Memory

This Interface is always used during IF

Also used in MEM for LOAD or STORE Operations.

When a Load or Store gets into MEM stage IF must stall

outline28
Outline
  • Introduction
  • Microarchitecture
  • Alpha 21264 processor
  • Sim-alpha simulator
  • Out of order execution
  • Prediction-Speculation
prediction speculation
Prediction & Speculation
  • Problem: Serialization due to Dependences

- Wait Until Producing Instruction Execute

- But We need Optimal Utilization of Resources

  • Solution:

- Predict Unknown information

- Allow the processor to proceed

(Assuming predicted information is correct)

- If the Prediction is Correct Proceed Normally

- If not squash the speculatively executed instructions

- Restore the status

- Start In the correct Direction

taxonomy of speculation
Taxonomy Of Speculation

Two Possibilities

-Worst case : 50 % accuracy

For a 32 bit Register

Possibilities

- Worst case: ?

prediction speculation31
Prediction & Speculation
  • Control Speculation

-Current Branch Predictors are highly Accurate

- Implemented in Commercial Processors

  • Data Speculation

- Lot of research (Value Profiling, Value prediction etc)

- Not Implemented in the Current Processors

- Till Now Very Little Effects on Current Designs