by harsh sharangpani and ken arora presented by teresa watkins 4 16 02 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
by Harsh Sharangpani and Ken Arora Presented by Teresa Watkins 4/16/02 PowerPoint Presentation
Download Presentation
by Harsh Sharangpani and Ken Arora Presented by Teresa Watkins 4/16/02

Loading in 2 Seconds...

play fullscreen
1 / 16

by Harsh Sharangpani and Ken Arora Presented by Teresa Watkins 4/16/02 - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Itanium Processor Microarchitecture. by Harsh Sharangpani and Ken Arora Presented by Teresa Watkins 4/16/02. General Information. First implementation of the IA64 instruction set architecture Targets memory latency, memory address disambiguation, and control flow dependencies

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'by Harsh Sharangpani and Ken Arora Presented by Teresa Watkins 4/16/02' - georgina-vernon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
by harsh sharangpani and ken arora presented by teresa watkins 4 16 02

Itanium Processor

Microarchitecture

by Harsh Sharangpani and Ken Arora

Presented by Teresa Watkins

4/16/02

slide3

General Information

First implementation of the IA64 instruction set architecture

Targets memory latency, memory address disambiguation,

and control flow dependencies

0.18 micron process, 800MHz

EPIC design style shifts more responsibilities to compiler

Challenge

Try to identify which improvements discussed in this class

found their way into the Itanium.

slide4

EPIC Conceptual View

Idea

Compiler has larger instruction window than hardware.

Communicate

to the hardware more of the information gleaned at compile time.

slide5

Hardware Pipeline

Six instructions wide and ten stage deep

Tries to minimize latency of most frequent operations

Hardware support for compilation time indeterminacies

slide6

Front End

  • Software initiated prefetch (requests filtered by instruction cache)
    • prefetch must be 12 cycles before branch to hide latency
    • L2 -> streaming buffer -> instruction cache
  • Four level branch predictor hierarchy to prevent 9-cycle pipeline stall Decoupling buffer hold up to 8 bundles of code
slide7

Branch Predictors

  • Compiler provides branch hint directives
    • explicit branch predict (BRP) instructions
    • hint specifiers on branch instructions
  • which provide
    • branch target addresses
    • static hints on branch detection
    • indicators for when to use dynamic predictors
  • Four types of predictors
    • Resteer 1: single cycle predictor (4 BRP programmed TARs)
    • Resteer 2: Adaptive multi-way and return predictors (dynamic)
    • Resteer 3&4: Branch address calculation and correction
      • Resteer 3 includes “perfect-loop-exit-predictor”
slide8

Instruction Delivery

  • Plentiful Resources
  • four integer units
  • four multi-media units
  • two load/store units
  • three branch units
  • two extended precision FP units
  • two single precision FP units
  • SIMD allows up to 20 parallel operations per clock
  • Dispersal follows high level semantics provided by IA64 ISA
  • Check for:
    • Independence (determined by stop bits)
    • Oversubscription (determined by 8-bit instruction template)
  • Template allows for simplified dispersal routing
  • Organized around 9 issue ports
  • two memory
  • two integer
  • two FP
  • three branch
slide9

Registers

Two types of register renaming (virtual register addressing):

  • Register Stacking
    • reduces function call and return overhead by stacking new register frame on top of old frame to prevent explicit save of caller’s register (not supported in FP registers)
    • Register Rotation
  • supports software-pipelining by accessing the registers through an indirection based on the iteration count

If software allocates more virtual registers than are physically available (overflow), the Register Stack Engine takes control of the pipeline to store register values to memory, and the reverse for underflow. No pipeline flushes required :)

slide10

Register Files

  • Integer register file
    • 128 entries
    • 8 read ports
    • 6 write ports
    • postincrement performed by idle ALU and write ports
  • FP register file
    • 128 entries
    • 8 read ports
    • 4 write ports, separated into odd and even banks
    • supports double extended-precision arithmetic
  • Predicate register file: 1-bit entries with 15 read and 11 write ports
slide11

Execution Core

Non-blocking cache with scoreboard-based stall on use control strategy

Pipeline only stalls when data is needed, not on other hazards

Deferred-stall strategy (hazards evaluation in REG stage) allows more time for dependencies to resolve

Stalls in EXE stage, where input latches snoop returning data values for correct data using existing register bypass hardware.

Predication : turns control dependency into data dependency by executing all sides of a predicted branch and squashing the incorrect instructions before they change the machine state (speculative predicate register file vs architectural predicate register file)

Executes up to three parallel branch predictions a cycle, uses priority encoding to determine earliest taken branch.

slide12

Data Speculation

Exception tokens

In FP registers, exceptions are noted by storing a NaTVal value in the NaN space, but an extra bit is added to the INT register for the exception token (NaT). These bits must be stored in a special UNaT register in the event of a register spill because it won’t fit in memory, and it is restored during fills.

ALAT structure

If an instruction writes to a register between the time the speculative load reads that register and consumes the value, the ALAT invalidates the speculative load value and recovery is initiated. ALAT checks can be issued in parallel with the consuming instruction.

slide13
First Level Cache

Data and Instruction Separate

16Kbytes each, 32 byte line size

(6 instructions/cycle in I cache)

four-way set-associative

dual ported

2 cycle latency, fully pipelined

write through

physically addressed and tagged

single cycle, 64 entry, fully

associative iTLB (backed up by an on-chip hardware page walker)

iTLB and cache tags have an additional port to check address for miss

Second Level Cache

Combined data and instructions

96Kbytes

six-way set-associative

64 byte line size

two banks

four-state MESI for multi-processor coherence

4 double precision operands per clock to FP register file

On-Chip Memory

slide14

On-Chip Memory cont.

  • Third Level Cache
  • 4Mbytes
  • 64-byte line size
  • four-way set associative
  • 128-bit core speed bus line
  • (12 Gbytes/s bandwidth)
  • MESI protocol
  • Optimal Cache Management
  • Memory locality hints
    • allocation and replacement strategies
  • Bias hints
    • optimize MESI latency
slide15

System Bus

  • 64-bit system bus, source-synchronous data transfer (2.1 Gbytes/sec)
  • Multi-drop shared system bus uses MESI coherence protocol
  • Four-way glueless multiprocessor system support (4 processor nodes)
  • Multiple nodes connected through high speed interconnects
  • Transaction based bus protocol allows 56 pending transactions
  • ‘Defer mechanism’ for OoO data transfers and transaction completion
slide16

Comparison to Previous Work

Non-blocking caches as seen in

“Lockup-free instruction fetch cache organization”

Prefetch - decoupled prefetch based on branch hints as seen in

“A Scalable Front-End Architecture for Fast Instruction Delivery”

- software initiated prefetch as seen in

“Design and Evaluation of a Compiler Algorithm for Prefetching”

Memory locality hints for more efficient use of caches

Speculation - extra bit for deferred exception tokens

What else?

Do you think they made a simple, scalable hardware implementation?