1 / 36

Hardware Multithreading

Hardware Multithreading. COMP25212. Increasing CPU Performance. By increasing clock frequency – pipelining By increasing Instructions per Clock – superscalar Minimizing memory access impact – caches Maximizing pipeline utilization – branch prediction

rylee-rice
Download Presentation

Hardware Multithreading

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hardware Multithreading COMP25212

  2. Increasing CPU Performance By increasing clock frequency – pipelining By increasing Instructions per Clock – superscalar Minimizing memory access impact – caches Maximizing pipeline utilization – branch prediction Maximizing pipeline utilization – forwarding Maximizing instruction issue – dynamic scheduling

  3. Increasing Parallelism • Amount of parallelism that we can exploit is limited by the programs • Some areas exhibit great parallelism • Some others are essentially sequential • In the later case, where can we find additional independent instructions? • In a different program!

  4. Software Multithreading - Revision • Modern Operating Systems support several processes/threads to be run concurrently • Transparent to the user – all of them appear to be running at the same time • BUT, actually, they are scheduled (and interleaved) by the OS

  5. OS Thread Switching - Revision Operating System Thread T1 Thread T0 Exec Save state into PCB0 Wait Load state fromPCB1 Exec Wait Save state into PCB0 Load state fromPCB1 Wait Exec COMP25111 – Lect. 5

  6. Process Control Block (PCB) - Revision PCBs store information about the state of ‘alive’ processes handled by the OS Process ID Process State PC Stack Pointer General Registers Memory Management Info Open File List, with positions Network Connections CPU time used Parent Process ID

  7. OS Process States - Revision Wait (e.g. I/O) Terminated Running on a CPU Blocked waiting for event Pre-empted Ready waiting for a CPU Eventoccurs Dispatched New COMP25111 – Lect. 5

  8. Hardware Multithreading • Allow multiple threads to share a single processor • Requires replicating the independent state of each thread • Virtual memory can be used to share memory among threads

  9. CPU Support for Multithreading VA MappingA Address Translation VA MappingB Inst Cache Data Cache PCA PCB Fetch Logic Fetch Logic Decode Logic Fetch Logic Exec Logic Mem Logic Fetch Logic Write Logic RegA RegB

  10. Hardware Multithreading Issues • How to HW MT is presented to the OS • Normally present each hardware thread as a virtual processor (Linux, UNIX, Windows) • Requires multiprocessor support from the OS • Needs to share or replicate resources • Registers – normally replicated • Caches – normally shared • Each thread will use a fraction of the cache • Cache trashing issues – harm performance

  11. Example of Trashing - Revision Direct Mapped cache Same index

  12. Hardware Multithreading • Different ways to exploit this new source of parallelism • When & how to switch threads? • Coarse-grain Multithreading • Fine-grain Multithreading • Simultaneous Multithreading

  13. Coarse-Grain Multithreading

  14. Coarse-Grain Multithreading • Issue instructions from a single thread • Operate like a simple pipeline • Switch Thread on “expensive” operation: • E.g. I-cache miss • E.g. D-cache miss

  15. Switch Threads on Icache miss • Remove Inst c and switch to ‘grey’ thread • ‘Grey’ thread will continue its execution until there is another I-cache or D-cache miss

  16. Switch Threads on Dcache miss Abort these • Remove Inst a and switch to ‘grey’ thread • Remove issued instructions from ‘white’ thread • Roll back ‘white’ PC to point to Inst a

  17. Coarse Grain Multithreading • Good to compensate for infrequent, but expensive pipeline disruption • Minimal pipeline changes • Need to abort all the instructions in “shadow” of Dcache miss  overhead • Swap instruction streams • Data control hazards are not solved

  18. Fine-Grain Multithreading

  19. Fine-Grain Multithreading • Interleave the execution of several threads • Usually using Round Robin among all the ready hardware threads • Requires instantaneous thread switching • Complex hardware

  20. Fine-Grain Multithreading Multithreading helps alleviate fine-grain dependencies (e.g. forwarding?)

  21. I-cache misses in Fine Grain Multithreading Inst b is removed and ‘white’ is marked as not ‘ready’ ‘White’ thread is not ready so ‘grey’ is executed An I-cache miss is overcome transparently

  22. D-cache misses in Fine Grain Multithreading ‘White’ marked as not ‘ready’. Remove Inst b. Update PC. ‘White’ thread is not ready so ‘Grey’ is executed Mark the thread as not ‘ready’ and issue only from the other thread(s)

  23. Fine Grain Multithreadingin Out-of-order processors • In an out of order processor we may continue issuing instructions from both threads • Unless O-o-O algorithm stalls one of the threads

  24. Fine Grain Multithreading • Utilization of pipeline resources increased, i.e. better overall performance • Impact of short stalls is alleviated by executing instructions from other threads • Single-thread execution is slowed • Requires an instantaneous thread switching mechanism • Expensive in terms of hardware

  25. Simultaneous Multi-Threading

  26. Simultaneous Multi-Threading • The main idea is to exploit instructions level parallelismandthread level parallelism at the same time • In a superscalar processor issue instructions from different threads in the same cycle • Instructions from different threads can be using the same stage of the pipeline

  27. Simultaneous Multi-Threading Same thread Different thread

  28. SMT issues • Asymmetric pipeline stall (from superscalar) • One part of pipeline stalls – we want the other pipeline to continue • Overtaking – want non-stalled threads to make progress • Existing implementations on O-o-O, register renamed architectures (similar to Tomasulo) • e.g. Intel Hyperthreading

  29. SMT: Glimpse into the Future • Scout threads • A thread to prefetch memory – reduce cache miss overhead • Speculative threads • Allow a thread to execute speculatively way past branch/jump/call/miss/etc • Needs revised O-o-O logic • Needs and extra memory support

  30. Simultaneous Multi Threading • Extracts the most parallelism from instructions and threads • Implemented only in out-of-order processors because they are the only able to exploit that much parallelism • Has a significant hardware overhead

  31. Example Consider we want to execute 2 programs with 100 instructions each. The first program suffers an i-cache miss at instruction #30, and the second program another at instruction #70. Assume that: + There is parallelism enough to execute all instructions independently (no hazards) + Switching threads can be done instantaneously + A cache miss requires 20 cycles to get the instruction to the cache. + The two programs would not interfere with each other’s caches lines Calculate the execution time observed by each of the programs (cycles elapsed between the execution of the first and the last instruction of that application) and the total time to execute the workload a) Sequentially (no multithreading), b) With coarse-grain multithreading, c) With fine-grain multithreading, d) With 2-way simultaneous multithreading,

  32. Summary of Hardware Multithreading

  33. Benefits of Hardware Multithreading • Multithreading techniques improve the utilisation of processor resources and, hence, the overall performance • If the different threads are accessing the same input data they may be using the same regions of memory • Cache efficiency improves in these cases

  34. Disadvantages of Hardware Multithreading • The single-thread performance may be degraded when comparing with a single-thread CPU • Multiple threads interfering with each other • Shared caches mean that, effectively, threads would use a fraction of the whole cache • Trashing may exacerbate this issue • Thread scheduling at hardware level adds high complexity to processor design • Thread state, managing priorities, OS-level information, …

  35. Multithreading Summary A cost-effective way of finding additional parallelism for the CPU pipeline Available in x86, Itanium, Power and SPARC Present additional hardware thread as an additional virtual CPU to Operating System Operating Systems Beware!!! (why?)

  36. Comparison of Multithreading Techniques – 4-way superscalar

More Related