1 / 19

CS 7810 Lecture 21

CS 7810 Lecture 21. Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998. Leveraging SMT. Recall branch fan-out from “Limits of ILP” Future processors will likely have no shortage of idle thread contexts

janethines
Download Presentation

CS 7810 Lecture 21

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998

  2. Leveraging SMT • Recall branch fan-out from “Limits of ILP” • Future processors will likely have no shortage of • idle thread contexts • Spawned threads are parallel, but have • dependences with earlier instructions: registers, • uncommitted stores, data cache values • SMT may be an ideal candidate as threads share • the same set of resources

  3. SMT Vs. CMP • A multi-threaded workload (on an SMT) is more • tolerant of branch mpreds – TME makes most • sense if there is a shortage of threads • Power overheads are enormous – on an SMT, • we may not have the option to execute speculative • threads on low-power pipelines • What about energy? • Is CMP a better candidate?

  4. Renaming Overview r1 maps to p1 r1  …  r1 br …. r1  p5  …  p5 br …. p3  • Every branch causes a checkpoint of mappings, so • we can recover quickly on a mis-predict • Each thread in the SMT can have 8 checkpoints

  5. Threaded Multi-Path Execution • Key elements in TME: • Identifying low-confidence branches • Efficient thread spawning • Efficient recovery on branch resolution • Fetch priorities for each thread on SMT

  6. Path Selection • Only the primary path can spawn threads • (prevents an exponential increase in threads) • For each bpred entry, keep track of successive • correct predictions (reset on mispredict) – if the • counter is less than a threshold, the branch is • low-confidence – note that a small counter size • is more selective in picking low-confidence • branches

  7. Register Mappings • In SMT, each thread can read any physical register • Thread spawning requires a copy of the register • mappings at that branch • A copy involves transfer of (32 x 9 bits) – the new thread • cannot begin renaming until this copy is complete – the • copy may also hold up the primary thread if map table • read ports are scarce • Every new mapping can be placed on a bus and • idle threads can snoop and keep pace

  8. Spawning Algorithm

  9. Spawning Algorithm • When threads are idle, they keep pace and spawn a thread • as soon as a low-confidence branch is encountered • When a thread context becomes free and a low-confidence • checkpoint already exists, the new context synchronizes • mappings with the primary context and executes the • primary path, while the old primary context executes the • alternate path after reinstating the checkpoint • If a newly idle thread has a low-confidence checkpoint, • it starts executing the alternate path

  10. Introduced Complexity • Book-keeping to manage checkpoint locations – every • branch has to track the location of its checkpoint • Who frees a register value? • What about memory dependences? • Loads can ignore stores that are not predecessors • Maintain an array of bits to represent the path taken (each basic block corresponds to a bit in the array) • Check for memory dependences only if the store’s path is a subset of the load’s path (p5) r1  (p7) r1  (p8) r1 

  11. Processor Parameters • Eight-wide processor with up to eight contexts; each • context has eight checkpoints • 32-entry issue queues, 4Kb gshare branch predictor, • 7 cycle mpred penalty, memory latency of 62 cycles • ICOUNT 2.8: first thread can bring in up to 8 instrs and • the second thread fills in unused slots; occupancy in the • front-end determines priority • Focus on branch-limited programs: compress (20%), • gcc (18%), go (30%), li (6%)

  12. Results: Spare Contexts

  13. Results: Bus Latency

  14. Results: Branch Confidence

  15. Results: Path Selection

  16. Results: Fetch Policy

  17. Results: Mpred Penalty

  18. Conclusions • Too much complexity/power overhead, too little benefit? • Benefits may be higher for deeper pipelines; larger windows • (this paper evaluates 8 windows of 48 instrs; does 2 x 192 • yield better results?); longer memory latencies • There is room for improvement with better branch • confidence metrics • CMPs will incur greater cost during thread spawning, but • may be more power-efficient

  19. Title • Bullet

More Related