1 / 29

Exceptions

Exceptions. Exceptions. Just when you thought pipelines were not that hard… Pipeline benefit = overlapped instructions Good: Increased throughput Sort of bad: Hazards – but we’ve seen how to deal with them Bad: Exceptions Exception Oddities Multi-stage / multi-cycle instructions

thai
Download Presentation

Exceptions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exceptions

  2. Exceptions Just when you thought pipelines were not that hard… • Pipeline benefit = overlapped instructions • Good: Increased throughput • Sort of bad: Hazards – but we’ve seen how to deal with them • Bad: Exceptions • Exception Oddities • Multi-stage / multi-cycle instructions • Exceptions can happen anywhere • Instruction order and exception order might be different • Handling exceptions in instruction order is required • So what’s the strategy? • Depends on exception type…

  3. Exception Types • Terminology varies all over the place • I/O device request • Invoking an OS service from a user program • Tracing instruction execution • Breakpoints • Integer or FP arithmetic error such as overflow • Misaligned memory access • Page fault • Memory protection violation • Undefined instruction • Used on old Macs to invoke an OS service… • Hardware malfunction (like parity or ECC error) • Power failure

  4. Response Requirements – 5 axes • Synchronous vs. Asynchronous • Synchronous caused by a particular instruction • Asynchronous caused by external devices and HW failures • User requested vs. Coerced • Requested is predictable and can happen after the instruction • User maskable vs. user non-maskable • E.g. arithmetic overflow is maskable on some machines • Within vs. Between instructions • Where is the exception located? • Does excepting instruction complete? • Resume vs. Terminate • Implications for how much state must be preserved

  5. Examples of Exception Types

  6. Biggest Problem: Within and Resume • For DLX these tend to occur in the EX or MEM stage (I.e. late in the pipe) • Pipeline must be shut down safely • PC must be saved so restart point is known • If restart is branch, it will need to be re-executed • Which means condition must not change • Steps (in DLX) • Force TRAP instruction in pipe • Kill all following instructions (I.e. prevent state updates) • Let all preceding instructions finish if they can • Save the restart PC value (faulting inst. Or faulting inst + 1) • Let the OS handle the exception • TRAP says where the handler code lives

  7. Making things harder Consider delayed branches • A single restart PC isn’t enough • Assume we have 2 branch delay slots • Branch is fine, and in this case is “taken” • First delay slot causes a page fault • Second slot is killed • Exception is handled and default restart is first delay slot • Second slot instruction is executed • Then the next instruction following the slot is executed… • OOPS! No branch! • Thisis a side effect of the effective instruction reordering due to the delayed branch • Hence, must save delay slot size + 1 of PC’s

  8. Precise Interrupts • All instructions before the fault complete • All instructions after the fault can be restarted from scratch • Note the assumption that the faulting instruction doesn’t change state • In some cases this can be relaxed, while in others it will be a requirement for precise instructions to work • Example: Floating point exceptions • Longer pipeline, may have written result before fault is known • Particularly bad if the destination was also a source… • Hence, must save the original operands as part of the argument stream passed to the exception handler • I.e. enough state to reconstruct things after all possible exceptions

  9. Precice and Non-Precise Modes • Typical in today’s high-performance processors • E.g. Alpha 21164, MIPS R8000, R10000, Power-3 • Precise mode is as much as 10x slower • Biggest source of the the problem is the FPU, and out of order completion • Hence, in precise mode overlap (I.e. pipelining) is constrained • Result is LOTS of bubbles • Use precise mode when debugging • Also a requirement in many systems – e.g. IEEE FP standard handlers, virtual memory support, OS interfaces… • Not too difficult for the integer pipe anyway… • Use non-precise mode when you think your code works

  10. Precise and Non-Precise Modes • Typical in today’s high-performance processors • E.g. Alpha 21164, MIPS R8000, R10000, Power-3 • Precise mode is as much as 10x slower • Biggest source of the the problem is the FPU, and out of order completion • Hence, in precise mode overlap (I.e. pipelining) is constrained • Result is LOTS of bubbles • Use precise mode when debugging • Also a requirement in many systems – e.g. IEEE FP standard handlers, virtual memory support, OS interfaces… • Not too difficult for the integer pipe anyway… • Use non-precise mode when you think your code works • Done laughing yet?

  11. Even for DLX Exceptions can happen anywhere • IF • Page fault, misaligned address, memory protection violation • ID • Undefined or illegal opcode • EX • Arithmetic exception • MEM • Page fault, misaligned address, memory protection violation • WB • None… • So, within any 1 clock cycle, 4 exceptions could occur!

  12. Making DLX Precise • Exception order may not match pipeline order • But, we must take them in pipeline order to be precise • Currently, program order = pipeline order for DLX • Committing an instruction • When an instruction guarantees to complete, it commits • In the DLX, this happens at the end of the MEM stage • Prior to commit, carry exception state • Exception type, restart PC value pass through pipe • ALL destructive writes (memory or RF) are deferred until commit point • Easy for DLX, hard for VAX (surprised?) • VAX HW must save back-out state (I.e. undo autoinc, etc.)

  13. Possible Design Decision Problems Decisions that complicate precise exception handling • Early register changes • VAX auto-increment, auto-decrement address modes • Iterative instructions • IBM 360: block memory move • How much moved before the fault? • Use of registers as working storage • 80x86 string instructions • Condition codes, many machines use them • Problem if they can be set in multiple stages • If set early, then they must be restored on exception • Multi-cycle instruIctions (all machines)

  14. More about Multi-Cycle Instructions Abundant options in the VAX ISA • This can be fixed in a good ISA design • After all, the VAX is ancient and we’re all smarter now… • Sure…. Look at some modern ISAs… • Common modern multi-cycle causes: • Simple loads and stores that miss in the L1 cache • Reasonable cost FPU latencies are 5x+ of the IU • Co-processor or SFU instructions • Result • Stuck with the reality of multi-cycle instructions or stages • Real complication for laminar pipeline design goals, and fast precise exception modes…

  15. A Multi-Cycle DLX

  16. Latency vs. Repeat Cycle • Latency = number of cycles to complete • Defined to be the cycle distance between instruction producing the value and the instructions tht use that result • Repeat / Initiation interval • Number of cycles that must elapse between issue of instructions of the same type

  17. DLX FP Pipe • Note # of stages are 1+ latency

  18. New Hazard and Forwarding Problems • Structural Hazards Increase • Unpiped divide causes huge 24 cycle delays • Number of register writes in a cycle goes up • 3 FPR writes possible now… • WAW hazards no possible since instructions no reach WB out of order • WAR hazards are still no problem since read happens early (in the ID stage) • Out of order completion complicates exception handling • RAW stalls will be more frequent due to longer latency instructions Was it worth it? How would you determine this?

  19. New Structural Hazard Source • Scan columns for common resource requirements • At cycle 10, 3 requirements for MEM • At cycle 11, 3 requirements for RF Write

  20. Dealing with Structural Hazards Consider a single write-port FPR in the previous example • Option 1: • Keep track of issued instructions and when they will write-back to the FPR • Stall instruction in ID if there’s a collision • Just takes a 1-bit, 25-deep shift register for all 3 pipes • Option 2: • Stall instructions at MEM entry • May also want to give preference to longest latency • Longest is most likely to cause RAW stalls anyway • Problem is that the control path has to go all the way back to the front of the pipe which is costly!

  21. Consider RAW Hazard Stalls • Long latency pipes cause frequency to go up • Finally, on cycle 16 SD gets to enter MEM • EX doesn’t need to stall since it’s the EFA calc which uses R2 • Note that figure 3.46 in the text is wrong

  22. New Hazard Sources • Dependencies between GPR and FPR • MOVI2FP and MOVFP2I instructions • Avoid the new ones with ID stage issue checks • Structural • Repeat interval check • Make sure register write port will be available when needed • RAW • List all pending destination registers • Don’t issue a source from a pending destination until it clears (value is available via forwarding logic) • WAW • Use same list of pending destination registers • Don’t issue a new pending destination which matches an existing one until it has finished WB • Can we do better? (I.e. like forwarding)

  23. Precise Exceptions and Long Pipes • Consider: DIVF F0, F2, F4 ADDF F10, F10, F8 SUBF F12, F12, F14 • Piece of cake! No dependencies! • Unfortunately, wrong… • Both ADF and SUBF will complete before DIVF • What happens if DIVF causes an exception… • Ideas?

  24. Four Precision Possibilities • Punt on precision • Old supercomputer trick • Not really viable today with IEEE standard exception handling and virtual memory • Buffer Something – two variants • Future File • Buffer results at commit – post them in program order • History File • Post ASAP • Buffer original operands and roll-back to proper state if an exception occurs • Forwarding still required • So both get more expensive as pipelines get longer

  25. Possibilities 3 and 4 • Go imprecise with SW fixup • Keep enough state around to do fixup • Let SW emulate the instructions that are not yet finished but prior to the excepting instruction • In general this is too hard • If the only uncompleted ones are FP, then it’s more tractable • Stall issue until previous instructions commit • Move commit point as far forward as possible in the pipe • Which, realistically, isn’t that far • Used by MIPS R4k and Pentium

  26. DLX Pipeline Performance • Stalls per FP operation

  27. Total Stalls / Instruction

  28. DLX FP Performance Averages Basis: Spec FP Benchmarks • Stalls per FP operation = approx. 50% of latency • 1.7 cycles for add/sum/convert (56% of 3 cycle latency) • 2.8 for multiply (or 40% of the 7 cycle latency) • 14.2 for divide (57% of the 25 cycle latency) • Total Stalls • Varies with application • Range from .65 (su2cor) to 1.21 (doduc) • Average over SPEC benchmarks is .87 / instruction • Note that this is for all instructions, hence CPI = Ideal CPI + .87 • Major contributor is RAW result wait

  29. ISA Design and Pipeline Complexity Things you can do wrong • Variable instruction lengths and CPIs • Imbalance will cause stall frequency to increase • Caches do this, but performance offsets the cost usually • Sophisticated address modes • Trashing registers during EFA calculation means state must be saved • I.e. auto-increment or auto-decrement mode • Permit self-modifying code (I.e. 80x86) • What if the instruction in the pipe is overwritten? • Then restarting becomes tricky - Old value must be saved • 80x86 takes extra undecoded instruction all the way to commit • Non-uniform implicitly set condition codes • Newer machines use uniform set stage • Plus explicit bit in the instruction to enable set-cc

More Related