Loading in 2 Seconds...
Loading in 2 Seconds...
Itanium Processor Microarchitecture. by Harsh Sharangpani and Ken Arora Presented by Teresa Watkins 4/16/02. General Information. First implementation of the IA64 instruction set architecture Targets memory latency, memory address disambiguation, and control flow dependencies
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
by Harsh Sharangpani and Ken Arora
Presented by Teresa Watkins
First implementation of the IA64 instruction set architecture
Targets memory latency, memory address disambiguation,
and control flow dependencies
0.18 micron process, 800MHz
EPIC design style shifts more responsibilities to compiler
Try to identify which improvements discussed in this class
found their way into the Itanium.
Compiler has larger instruction window than hardware.
to the hardware more of the information gleaned at compile time.
Six instructions wide and ten stage deep
Tries to minimize latency of most frequent operations
Hardware support for compilation time indeterminacies
Two types of register renaming (virtual register addressing):
If software allocates more virtual registers than are physically available (overflow), the Register Stack Engine takes control of the pipeline to store register values to memory, and the reverse for underflow. No pipeline flushes required :)
Non-blocking cache with scoreboard-based stall on use control strategy
Pipeline only stalls when data is needed, not on other hazards
Deferred-stall strategy (hazards evaluation in REG stage) allows more time for dependencies to resolve
Stalls in EXE stage, where input latches snoop returning data values for correct data using existing register bypass hardware.
Predication : turns control dependency into data dependency by executing all sides of a predicted branch and squashing the incorrect instructions before they change the machine state (speculative predicate register file vs architectural predicate register file)
Executes up to three parallel branch predictions a cycle, uses priority encoding to determine earliest taken branch.
In FP registers, exceptions are noted by storing a NaTVal value in the NaN space, but an extra bit is added to the INT register for the exception token (NaT). These bits must be stored in a special UNaT register in the event of a register spill because it won’t fit in memory, and it is restored during fills.
If an instruction writes to a register between the time the speculative load reads that register and consumes the value, the ALAT invalidates the speculative load value and recovery is initiated. ALAT checks can be issued in parallel with the consuming instruction.
Data and Instruction Separate
16Kbytes each, 32 byte line size
(6 instructions/cycle in I cache)
2 cycle latency, fully pipelined
physically addressed and tagged
single cycle, 64 entry, fully
associative iTLB (backed up by an on-chip hardware page walker)
iTLB and cache tags have an additional port to check address for miss
Second Level Cache
Combined data and instructions
64 byte line size
four-state MESI for multi-processor coherence
4 double precision operands per clock to FP register file
Non-blocking caches as seen in
“Lockup-free instruction fetch cache organization”
Prefetch - decoupled prefetch based on branch hints as seen in
“A Scalable Front-End Architecture for Fast Instruction Delivery”
- software initiated prefetch as seen in
“Design and Evaluation of a Compiler Algorithm for Prefetching”
Memory locality hints for more efficient use of caches
Speculation - extra bit for deferred exception tokens
Do you think they made a simple, scalable hardware implementation?