Advanced Microarchitecture

Advanced Microarchitecture Lecture 13: Commit, Exceptions, Interrupts

The End of the Road (um… Pipe) • Commit is typically the last stage of the pipeline • Anything that an instruction does at this point is irrevocable • therefore, only actions that correspond to a sequential execution can be allowed to pass this point • E.g., wrong path instructions may not commit because they do not exist in the sequential execution Lecture 13: Commit, Exceptions, Interrupts

Everything In-Order • The ISA defines program execution with respect to the sequential order of the instructions • To the outside world, the CPU must appear to execute in-order • What does it mean to “appear”? • … when someone looks • ok, so what does it mean to “look”? Lecture 13: Commit, Exceptions, Interrupts

“Looking” at CPU State • Examples • when the OS task swaps a program out, it copies the current program state (requires “looking”) so the program can be resumed later • when a program has a fault (e.g., page fault), again the OS usually steps in and it needs to “look” at the “current” CPU state Lecture 13: Commit, Exceptions, Interrupts

When Can Someone Look? • Divide-by-Zero • only on a divide • Page fault • only on a load or store? • can occur anytime if instruction fetch is on an unmapped page • OS Task/Process Scheduling • anytime Lecture 13: Commit, Exceptions, Interrupts

“Proper” State Always Required • So what does it mean to have a proper state? • one that corresponds to sequential execution • For a superscalar processor with degree N, the commit stage must be able to “retire” at least N instructions per cycle • retirement includes updating processor state in the same fashion that an instruction would for sequential execution Lecture 13: Commit, Exceptions, Interrupts

A B C A D B E F A A B B A C C B C D A E B F C G D A A A A B B B B C C C C D D D D E E E E F F G G H H Superscalar Commit is like Sampling Scalar Commit Processor States Superscalar Commit Processor States A Each “state” in the superscalar machine always corresponds to one state of the scalar machine (but not necessarily the other way around), and the ordering of states is preserved Lecture 13: Commit, Exceptions, Interrupts

Implementation in the CPU • The architected register file contains the state of the machine corresponding only to the committed instructions • commit happens in-order, so the ARF states always correspond to some RF state of a sequential execution • … this means anyone who wants to “look” will have to look in the ARF • What about the other instructions that have executed out of order? Lecture 13: Commit, Exceptions, Interrupts

LSQ Memory RS ROB PC RF PRF fPC Nothing Else Matters Memory PC RF Just ignore everything else! Sequential View of the Processor State of the Superscalar Out-of-Order Processor Lecture 13: Commit, Exceptions, Interrupts

Committing Instructions • “Retire” vs. “Commit” • sometimes people use this to mean the same thing • sometimes they mean different things • check the context! • An instruction commits by making its “effects” known/visible to the architected state • architected state: (A)RF, Memory/$, PC, other regs • speculative state: everything else (ROB, RS, LSQ, etc.) Lecture 13: Commit, Exceptions, Interrupts

Committing Instructions (2) • When an instruction executes, it modifies processor state • update a register • update memory • update the PC (almost all instructions do this) • To “make appear”, the processor just copies values • copy value from Physical Reg to Architected Reg • copy value from LSQ to memory/cache • copy value from ROB to Architected PC Lecture 13: Commit, Exceptions, Interrupts

Blocked Commit • To commit N instructions per cycle, the ROB needs to support N read ports • (in addition to other ports such as for reading from the PRF, write ports for allocation, and write ports for updates/writebacks) Helps to keep ROB port requirements under control, at cost of slight IPC loss due to occasional alloc stalls ROB ROB inst 1 inst 2 inst 3 inst 4 Four read ports for four commits inst 1 inst 2 inst 3 One wide read port inst 4 Can’t reuse (reallocate) any of these ROB entries until all have committed Lecture 13: Commit, Exceptions, Interrupts

Commit Restrictions • If any N instructions can commit per cycle, may require heavy multi-porting of other structures • Stores: need N extra DL1 write ports, N extra DTLB read ports • Branches: need N branch predictor update ports; also need to deallocate N RAT checkpoints • Others: if processor has memory dependence predictor, then may need N update ports for loads • Solution: limit maximum number of commits per cycle of special types of instructions (e.g., max one branch per cycle) Don’t we need to check DTLB during store-address computation anyway? Do we need to do it again here? Lecture 13: Commit, Exceptions, Interrupts

x86 Commit (macros vs. uops) • ROB contains uops, but outside world only knows about macro-ops ROB ADD EAX, EBX SUB EBX, ECX ???? ADD EDX, [EAX] uop 1 (ADD) commit uop 1 (SUB) uop 1 (LD) uop 2 (ADD) uop 1 uop 2 commit uop 3 uop 4 If we take an interrupt right now, we’ll see a half-executed instruction! uop 5 POPA uop 6 uop 7 uop 8 Lecture 13: Commit, Exceptions, Interrupts

x86 Commit (con’t) • Works fine when uop-flow length ≤ commit width • What to do with long flows? • In all cases: can’t commit until alluops in a flow have completed • Just commit N uops per cycle • ... but make commit uninteruptable commit commit POPA ROB uop 1 uop 2 uop 3 uop 4 uop 5 uop 6 Now do something about the interrupt. uop 7 Timer interrupt! Defer: Can’t act on this yet... uop 8 Lecture 13: Commit, Exceptions, Interrupts

Handling REP-prefixed Instructions • Ex. REP STOSB (memset with EAX’s value) • The entire sequence is effectively one x86 instruction • What if this REP’s for 1,000,000,000 iterations? • Does the CPU “lock up” for a billion or so cycles while we wait for the entire instruction to commit? • We also can’t wait until the entire instruction has completed to commit, because we can’t even fetch the entire instruction! • At the ISA level, REP iterations are interruptable... • so treat each iteration as a separate “macro-op” Lecture 13: Commit, Exceptions, Interrupts

REP (con’t) All of these are interruptible points (commit can stop and effects be seen by outside world), since they all have well-defined ISA-level states: A: ECX=3, EDI = ptr+1 B: ECX=2, EDI = ptr+2 C: ECX=1, EDI = ptr+3 D: ECX=0, EDI = ptr+4 • MOV EDI, <pointer> ; the array we want to memset • SUB EAX, EAX ; zero • CLD ; clear direction flag (REP fwd) • MOV ECX, 4 ; do for 100 iterations • REP STOSB ; memset! • ADD EBX, EDX ; unrelated instruction MOV EDI, xxx STA tmp, EDI STA tmp, EDI MOVS flow MOVS flow SUB EAX, EAX STD EAX, tmp STD EAX, tmp CLD ADD EDI, 1 ADD EDI, 1 REP 4thiter MOV ECX, 4 SUB ECX, 1 SUB ECX, 1 REP overhead for 2nd iter. uCMP ECX, 0 Check for zero iterations (could happen with MOV ECX, 0 ) uCMP ECX, 0 uCMP ECX, 0 uJCZ uJCZ uJCZ D: B: STA tmp, EDI STA tmp, EDI ADD EBX, EDX MOVS flow MOVS flow STD EAX, tmp STD EAX, tmp ADD EDI, 1 ADD EDI, 1 SUB ECX, 1 SUB ECX, 1 REP overhead for 1st iteration REP overhead for 3rditer uCMP ECX, 0 uCMP ECX, 0 uJCZ uJCZ A: C: Lecture 13: Commit, Exceptions, Interrupts

D$ st st 17 17 ld ld ld At commit, store Updates cache Store Retirement • Stores need to forward values to later loads from the same address • Normally, LSQ provides this facility D$ D$ st 33 ld After store has left the LSQ, the D$ can provide the correct value Lecture 13: Commit, Exceptions, Interrupts

store All instructions may have successfully executed, but none can commit! Store Retirement (2) • Store shouldn’t vacate LSQ entry until actual writeback has completed • this way it can continue to forward values until it is guaranteed that later loads can get the value from the cache • This can stall commit • what if there’s a cache miss? • what if there’s a TLB miss? Lecture 13: Commit, Exceptions, Interrupts

Writeback Buffer • Want to get stores out of the way Even if the store misses in cache, entering the WB buffer counts as committing. This allows other insts to commit as well. D$ store ld ld store The WB buffer is part of the cache hierarchy now. It may need to provide values to later loads WB Buffer Eventually, the cache update occurs, the WB buffer entry is emptied The cache can now provide the correct value Lecture 13: Commit, Exceptions, Interrupts

cache coherence interface st To other CPUs As an “official” write, the store must also be visible to other processors Writeback Buffer (2) • The WBB is part of the “Architected” state Processor stalls if there are no available writeback buffer entries • Loads must check both structures for a hit • more routing to get request to both places • yet another possible bypass source • also want to normalize latencies for easier scheduling D$ ld WBB Lecture 13: Commit, Exceptions, Interrupts

No one can “see” this store anymore! Load 42 Store 42 Load 42 42 5678 Writeback Buffer (3) • Stores enter WB Buffer in program order • If there are multiple stores to the same address, only the last one is “visible” Addr Value next to write to cache 42 1234 oldest 13 -1 8 90901 youngest Lecture 13: Commit, Exceptions, Interrupts

Load 42 Load 42 42 5678 Write Combining Buffer • Augment WBB to combine writes together Addr Value Only one writeback Now instead of two 42 1234 Store 42 If writing to same address, just combine the writes (similar to writeback cache) Lecture 13: Commit, Exceptions, Interrupts

One cache write can serve multiple original store instructions Store 84 Write Combining Buffer (2) • Can combine stores to same cache line $-Line Addr Cache Line Data 5678 80 1234 Aggressiveness of write-combining may be limited by memory ordering model Benefit: reduces cache traffic, reduces pressure on store buffers Writeback/combining buffer can be implemented in/integrated with the MSHRs Only certain memory regions may be “write-combinable” (e.g., USWC in x86) Lecture 13: Commit, Exceptions, Interrupts

Senior Store Queue • Use STQ as WBB (not necessarily write combining) STQ Store DL1 L2 Store STQ head STQ head STQ head Store Store Store Store Store While stores are completing, other accesses (loads, snoops) can continue getting the values from the “senior” STQ Can continue allocating stores Store STQ tail STQ tail Store Store New stores cannot allocate into Senior STQ entries until stores complete Benefit: don’t need separate WBB, but we don’t need to stall commit on stores either Lecture 13: Commit, Exceptions, Interrupts

Silent Stores • Many stores write the same value to memory that’s already there D$ 42 Store 42 This is a “silent store” Ideally, this store shouldn’t have to happen Silent Stores are another way to exploit value locality; they are like the “dual” of value-predictable loads that always (frequently) load the same value Lecture 13: Commit, Exceptions, Interrupts

data hit = Store executes as a Load Silent Store Verification • Pro: stores can commit faster (reduces commit-time cache port pressure) • Con: more total accesses • every store converted to load • non-silent still has to store D$ Silent? Fetch Decode Rename Dispatch Schedule Execute WB Commit Check load value against store data for equality If equal, then store is silent, suppress store at commit Lecture 13: Commit, Exceptions, Interrupts

Other Duties • Besides updating architected state, commit needs to deallocate microarchitecture resources • ROB/LSQ entries • Physical register • “colors” of various sorts if used • RAT checkpoints • Most are FIFO’s or Queues, so alloc/dealloc is usually just inc/dec head/tail pointers • Unified PRF requires a little more work Lecture 13: Commit, Exceptions, Interrupts

In-Order Commit is Sufficient • … but not necessary • Necessary conditions for retirement: • Finished • result produced as defined by ISA semantics • No WAR hazard • in the architected register file • No unresolved branches • No unresolved exceptions • No memory ordering violations • No memory consistency problems Lecture 13: Commit, Exceptions, Interrupts

Commitability % of ROB Combination Many instructions could commit (all requirements satisfied) but don’t due to in-order constraint WAR Earlier agen Earlier Branch Not Finished None Figure from Bell and Lipasti, “Deconstructing Commit,” in ISPASS’04 Lecture 13: Commit, Exceptions, Interrupts

Backup Slides on Interrupts/Faults Lecture 13: Commit, Exceptions, Interrupts

LSQ Memory RS ROB PC ARF PRF fPC External Appearance Maintained If you want to peek in, all you get to see are the (arch) register file, the PC and the cache state (incl. WB buffers) But what if there’s no ARF? Lecture 13: Commit, Exceptions, Interrupts

ARF View of the Unified Register File If you need to “see” a register, you go through the aRAT first. sRAT PRF aRAT Lecture 13: Commit, Exceptions, Interrupts

LSQ Memory RS ROB PC ARF PRF fPC Mispredicted Branch View of Branch Mispredictions Wrong-path instructions are flushed… architected state has never been touched Fetch correct path instructions Which can update the architected state when they commit Lecture 13: Commit, Exceptions, Interrupts

DBZ! Trap? (when?) Trap Divide may have executed before other instructions due to OOO scheduling! (resume execution) Faults • Div-by-Zero, Overflow, Page-Fault • All occur at a specific point in execution (precise) DBZ! • CPU must maintain the appearance of sequential execution Lecture 13: Commit, Exceptions, Interrupts

Let earlier instructions commit Timing of DBZ Fault • Need to hold on to your faults On a fault, flush the machine and switch to the kernel ROB Architected State RS Now, the arch. state is the same as just before the divide executed in the sequential order Exec: DBZ Now, raise the DBZ fault and when you switch to the kernel, everything appears as it should Just make note of the fault, but don’t do anything (yet) Lecture 13: Commit, Exceptions, Interrupts

Speculative Faults • Faults might not be faults… Buffering faults until commit filters out speculative faults ROB Branch Mispredict (flush wrong-path) DBZ! The fault goes away, too Which is what we want since in a sequential execution, the wrong-path divide would not have executed (and faulted) Lecture 13: Commit, Exceptions, Interrupts

Walk page-table, may find a page fault … Trap (resume execution) Re-execute store Timing of TLB Miss • Store must re-execute (or re-commit, really), but this means it cannot leave the ROB • Store TLB miss can stall the processor TLB miss Lecture 13: Commit, Exceptions, Interrupts

Load Faults are Similar • Load issues, misses in TLB • Must wait! When load is oldest, allow switch to kernel to do page-table walk • …could be painful; there are lots of loads • Some processors have hardware page-table walkers • OS loads a few registers with PT information (pointers) and then simple logic fetches mapping info from memory OS must support particular PT format to make use of HW page-table walker Lecture 13: Commit, Exceptions, Interrupts

Key Pressed Key Pressed Key Pressed Asynchronous Interrupts • Some interrupts are not associated with any particular instruction • timer interrupt • disk, network interrupts (I/O) • low battery, UPS shutdown • When the CPU “notices” doesn’t matter (too much) Lecture 13: Commit, Exceptions, Interrupts

Handling Async Interrupts • Handle right away • Just use the current architected state, and flush the pipeline • Deferred • Stop fetching, let the processor drain, then switch to the interrupt handler • What if CPU takes a fault in the mean time? • Which came “first”, the External interrupt or the fault? Lecture 13: Commit, Exceptions, Interrupts

Advanced Microarchitecture