Hardware Multithreading

Hardware Multithreading COMP25212

Increasing CPU Performance By increasing clock frequency – pipelining By increasing Instructions per Clock – superscalar Minimizing memory access impact – caches Maximizing pipeline utilization – branch prediction Maximizing pipeline utilization – forwarding Maximizing instruction issue – dynamic scheduling

Increasing Parallelism • Amount of parallelism that we can exploit is limited by the programs • Some areas exhibit great parallelism • Some others are essentially sequential • In the later case, where can we find additional independent instructions? • In a different program!

Software Multithreading - Revision • Modern Operating Systems support several processes/threads to be run concurrently • Transparent to the user – all of them appear to be running at the same time • BUT, actually, they are scheduled (and interleaved) by the OS

OS Thread Switching - Revision Operating System Thread T1 Thread T0 Exec Save state into PCB0 Wait Load state fromPCB1 Exec Wait Save state into PCB0 Load state fromPCB1 Wait Exec COMP25111 – Lect. 5

Process Control Block (PCB) - Revision PCBs store information about the state of ‘alive’ processes handled by the OS Process ID Process State PC Stack Pointer General Registers Memory Management Info Open File List, with positions Network Connections CPU time used Parent Process ID

OS Process States - Revision Wait (e.g. I/O) Terminated Running on a CPU Blocked waiting for event Pre-empted Ready waiting for a CPU Eventoccurs Dispatched New COMP25111 – Lect. 5

Hardware Multithreading • Allow multiple threads to share a single processor • Requires replicating the independent state of each thread • Virtual memory can be used to share memory among threads

CPU Support for Multithreading VA MappingA Address Translation VA MappingB Inst Cache Data Cache PCA PCB Fetch Logic Fetch Logic Decode Logic Fetch Logic Exec Logic Mem Logic Fetch Logic Write Logic RegA RegB

Hardware Multithreading Issues • How to HW MT is presented to the OS • Normally present each hardware thread as a virtual processor (Linux, UNIX, Windows) • Requires multiprocessor support from the OS • Needs to share or replicate resources • Registers – normally replicated • Caches – normally shared • Each thread will use a fraction of the cache • Cache trashing issues – harm performance

Example of Trashing - Revision Direct Mapped cache Same index

Hardware Multithreading • Different ways to exploit this new source of parallelism • When & how to switch threads? • Coarse-grain Multithreading • Fine-grain Multithreading • Simultaneous Multithreading

Coarse-Grain Multithreading

Coarse-Grain Multithreading • Issue instructions from a single thread • Operate like a simple pipeline • Switch Thread on “expensive” operation: • E.g. I-cache miss • E.g. D-cache miss

Switch Threads on Icache miss • Remove Inst c and switch to ‘grey’ thread • ‘Grey’ thread will continue its execution until there is another I-cache or D-cache miss

Switch Threads on Dcache miss Abort these • Remove Inst a and switch to ‘grey’ thread • Remove issued instructions from ‘white’ thread • Roll back ‘white’ PC to point to Inst a

Coarse Grain Multithreading • Good to compensate for infrequent, but expensive pipeline disruption • Minimal pipeline changes • Need to abort all the instructions in “shadow” of Dcache miss  overhead • Swap instruction streams • Data control hazards are not solved

Fine-Grain Multithreading

Fine-Grain Multithreading • Interleave the execution of several threads • Usually using Round Robin among all the ready hardware threads • Requires instantaneous thread switching • Complex hardware

Fine-Grain Multithreading Multithreading helps alleviate fine-grain dependencies (e.g. forwarding?)

I-cache misses in Fine Grain Multithreading Inst b is removed and ‘white’ is marked as not ‘ready’ ‘White’ thread is not ready so ‘grey’ is executed An I-cache miss is overcome transparently

D-cache misses in Fine Grain Multithreading ‘White’ marked as not ‘ready’. Remove Inst b. Update PC. ‘White’ thread is not ready so ‘Grey’ is executed Mark the thread as not ‘ready’ and issue only from the other thread(s)

Fine Grain Multithreadingin Out-of-order processors • In an out of order processor we may continue issuing instructions from both threads • Unless O-o-O algorithm stalls one of the threads

Fine Grain Multithreading • Utilization of pipeline resources increased, i.e. better overall performance • Impact of short stalls is alleviated by executing instructions from other threads • Single-thread execution is slowed • Requires an instantaneous thread switching mechanism • Expensive in terms of hardware

Simultaneous Multi-Threading

Simultaneous Multi-Threading • The main idea is to exploit instructions level parallelismandthread level parallelism at the same time • In a superscalar processor issue instructions from different threads in the same cycle • Instructions from different threads can be using the same stage of the pipeline

Simultaneous Multi-Threading Same thread Different thread

SMT issues • Asymmetric pipeline stall (from superscalar) • One part of pipeline stalls – we want the other pipeline to continue • Overtaking – want non-stalled threads to make progress • Existing implementations on O-o-O, register renamed architectures (similar to Tomasulo) • e.g. Intel Hyperthreading

SMT: Glimpse into the Future • Scout threads • A thread to prefetch memory – reduce cache miss overhead • Speculative threads • Allow a thread to execute speculatively way past branch/jump/call/miss/etc • Needs revised O-o-O logic • Needs and extra memory support

Simultaneous Multi Threading • Extracts the most parallelism from instructions and threads • Implemented only in out-of-order processors because they are the only able to exploit that much parallelism • Has a significant hardware overhead

Example Consider we want to execute 2 programs with 100 instructions each. The first program suffers an i-cache miss at instruction #30, and the second program another at instruction #70. Assume that: + There is parallelism enough to execute all instructions independently (no hazards) + Switching threads can be done instantaneously + A cache miss requires 20 cycles to get the instruction to the cache. + The two programs would not interfere with each other’s caches lines Calculate the execution time observed by each of the programs (cycles elapsed between the execution of the first and the last instruction of that application) and the total time to execute the workload a) Sequentially (no multithreading), b) With coarse-grain multithreading, c) With fine-grain multithreading, d) With 2-way simultaneous multithreading,

Summary of Hardware Multithreading

Benefits of Hardware Multithreading • Multithreading techniques improve the utilisation of processor resources and, hence, the overall performance • If the different threads are accessing the same input data they may be using the same regions of memory • Cache efficiency improves in these cases

Disadvantages of Hardware Multithreading • The single-thread performance may be degraded when comparing with a single-thread CPU • Multiple threads interfering with each other • Shared caches mean that, effectively, threads would use a fraction of the whole cache • Trashing may exacerbate this issue • Thread scheduling at hardware level adds high complexity to processor design • Thread state, managing priorities, OS-level information, …

Multithreading Summary A cost-effective way of finding additional parallelism for the CPU pipeline Available in x86, Itanium, Power and SPARC Present additional hardware thread as an additional virtual CPU to Operating System Operating Systems Beware!!! (why?)

Comparison of Multithreading Techniques – 4-way superscalar

Hardware Multithreading