Multithreaded Processors

Multithreaded Processors Dezső Sima Spring 2007 (Ver. 2.1)  Dezső Sima, 2007

Overview 1. Introduction 2. Overview of multithreaded cores 3. Thread scheduling 4. Case examples 4.1. Coarse grained multithreaded cores 4.2. Fine grained multithreaded cores 4.3. SMT cores

1. Introduction

1. Introduction (1) Aim of multithreading to raise performance (beyond superscalar or EPIC execution) by introducing and utilizing finer grained parallelism than multitasking at execution. Thread flow of control dynamic sequence of instructions to be executed.

1. Introduction (2) P1 P1 P1 fork() CreateThread() T1 exec() Create Process() P2 T2 fork() T3 P2 P2 T4 Process / Thread Management Example T5 P3 exec() T6 join() P3 Figure 1.1: Principle of sequential-, multitasked- and multithreaded programming

1. Introduction (3) Main features of multithreading Threads • belong to the same process, • share usually a common address space • (else multiple address translation paths (virtual to real) need to be maintained • concurrently), • are executed concurrently (simultaneously (i.e. overlapped by time sharing) • or in parallel), depending on the impelmentation of multithreading . Main tasks of thread management • creation, control and termination of individual threads, • context swithing between threads, • maintaining multiple sets of thread states. Basic thread states • thread program state (state of the ISA) • including the PC, FX/FP architectural registers, state registers, • thread microstate(supplementary state of the microarchitecture) • including the rename register mappings, branch history, ROB etc.

1. Introduction (4) Implementation of multithreading (while executing multithreaded apps/OSs) Software multithreading Hardware multithreading Execution of multithreaded apps/OSs on a single threaded processor simultaneously (i.e. by time sharing) Execution of multithreaded apps/OSs on a multithreaded processor concurrently Maintaining multiple threads simultaneously by the OS Maintaining multiple threads concurrently by the processor Multithreaded OSs Multithreaded processors Fast context swithing between threads required.

1. Introduction (5) MTcore Core Core L2/L3 L2/L3 L3/Memory L3/Memory Multithreaded processors Multicore processors Multithreaded cores (SMP: Symmetric MultiprocessingCMP: Chip Multiprocessing) Chip

1. Introduction (6) Requirement of software multithreading Maintaining multiple thread program states concurrently by the OS, including the PC, FX/FP architectural registers, state registers Core enhancements needed in multithreaded cores • Maintainingmultiple thread program statesconcurrentlyby the processor, including the PC, FX/FP architectural registers, state registers • Maintaningmultiple thread microstates, pertaining to: rename register mappings, the RAS (Return Address Stack), theROB, etc. • Providingincreased sizes for scarce or sensitive resorces, such as: the instruction buffer, store queue,in case of merged arch. and rename registers appropriatly large file sizes (FX/FP) etc. Options to provide multiple states • Implementing individual per thread structures, like 2 or 4 sets of FX registers, • Implementingtagged structures, like a tagged ROB, a tagged buffer etc.

1. Introduction (7)

1. Introduction (8) Multithreaded OSs • Windows NT • OS/2 • Unix w/Posix • most OSs developed from the 90’s on

Introduction (9) Description Key Features Key Issues Figure 1.2: Contrasting sequential-, multitasked- and multithreaded execution (2)

Introduction (10) OS Support Performance Level Software Development Figure 1.3: Contrasting sequential-, multitasked- and multithreaded execution (2)

2. Overview of multithreaded cores

2. Overview of multithreaded cores (1) 8CMT QCMT 1/06 5/05 DCMT Pentium EE 840 Pentium EE 955/965 (Smithfield) (Presler) 90 nm/2*103 mm2 65 nm/2*81 mm2 230 mtrs./130 W 2*188 mtrs./130 W 2-way MT/core 2-way MT/core 11/02 02/04 SCMT Pentium 4 Pentium 4 (Northwood B) (Prescott) 130 nm/146 mm2 90 nm/112 mm2 55 mtrs./82 W 125 mtrs./103 W 2-way MT 2-way MT 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2003 2005 2004 2006 2002 Figure 2.1: Intel’s multithreaded desktop families

2. Overview of multithreaded cores (2) 8CMT QCMT 6/06 10/05 DCMT Xeon 5000 Xeon DP 2.8 (Dempsey) (Paxville DP) 65 nm/2*81 mm2 90 nm/2*135 mm2 2*188 mtrs./95/130 W 2*169 mtrs./135 W 2-way MT/core 2-way MT/core 2/02 11/03 6/04 SCMT Pentium 4 Pentium 4 Pentium 4 (Nocona) (Prestonia-A) (Irwindale-A) 90 nm/112 mm2 130 nm/146 mm2 130 nm/135 mm2 125 mtrs./103 W 55 mtrs./55 W 169mtrs./110 W 2-way MT 2-way MT 2-way MT 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2003 2005 2004 2006 2002 Figure 2.2.: Intel’s multithreaded Xeon DP-families

2. Overview of multithreaded cores (3) 8CMT QCMT 11/05 DCMT 8/06 Xeon 7000 Xeon 7100 (Paxville MP) (Tulsa) 90 nm/2*135 mm2 65 nm/435 mm2 2*169 mtrs./95/150 W 1328 mtrs./95/150 W 2-way MT/core 2-way MT/core 3/05 3/04 3/02 SCMT Pentium 4 Pentium 4 Pentium 4 (Potomac) (Gallatin) (Foster-MP) 90 nm/339 mm2 130 nm/310 mm2 180 nm/ n/a 675 mtrs./95/129 W 178/286 mtrs./77 W 108 mtrs./64 W 2-way MT 2-way MT 2-way MT 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2003 2005 2004 2006 2002 Figure 2.3.: Intel’s multithreaded Xeon MP-families

2. Overview of multithreaded cores (4) 8CMT QCMT 7/06 DCMT 9x00 (Montecito) 90 nm/596 mm2 1720 mtrs./104 W 2-way MT/core SCMT 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2003 2005 2004 2006 2002 Figure 2.4.: Intel’s multithreaded EPIC based server family

2. Overview of multithreaded cores (5) ~ ~ 8CMT QCMT 2007 10/05 5/04 DCMT POWER6 POWER5+ POWER5 65 nm/341 mm2 90 nm/230 mm2 130 nm/389 mm2 750 mtrs./~100W 276 mtrs./70 W 276 mtrs./80W (est.) 2-way MT/core 2-way MT/core 2-way MT/core 5/04 2006 SCMT RS 64 IV(Sstar) Cell BE PPE 90 nm/221* mm2 180 nm/n/a 234* mtrs./95* W 44 mtrs./n/a 2-way MT (*: entire proc.) 2-way MT 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2004 2006 2005 2007 2000 Figure 2.5.: IBM’s multithreaded server families

2. Overview of multithreaded cores (6) 2007 11/2005 8CMT UltraSPARC T2 UltraSPARC T1 (Niagara II) (Niagara) 65 nm/342 mm2 90 nm/379 mm2 72 W (est.) 279 mtrs./63 W 8-way MT/core 4-way MT/core 2008 QCMT APL SPARC64 VII (Jupiter) 65 nm/464 mm2 ~120 W 2-way MT/core 2007 DCMT APL SPARC64 VI (Olympus) 90 nm/421 mm2 540 mtrs./120 W 2-way MT/core SCMT 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2005 2007 2006 2008 2004 Figure 2.6: Sun’s and Fujitsu’s multithreaded server families

2. Overview of multithreaded cores (7) 5/05 8CMT XLR 5xx 90 nm/~220 mm2 333 mtrs./10-50 W 4-way MT/core QCMT DCMT SCMT 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2003 2005 2004 2006 2002 Figure 2.7: RMI’s multithreaded XLR family (scalar RISC)

2. Overview of multithreaded cores (8) 8CMT QCMT DCMT 2003 SCMT Alpha 21464 (V8) 130 nm/ n/a 250 mtrs./10-50 W 4-way MT Cancelled 6/2001 1H 2H 1H 2H 1H 2H 1H 2H 1H 2H 2003 2005 2004 2006 2002 Figure 2.8: DEC’s/Compaq’s multithreaded processor

2. Overview of multithreaded cores (9) Underlying core(s) Scalar core(s) Superscalar core(s) VLIW core(s) SUN UltraSPARC T1 (2005)(Niagara) up to 8 cores/4T RMI XLR 5xx (2005) 8 core/4T IBM RS64 IV (2000) (SStar) Single-core/2T Pentium 4 based processors Single-core/2T (2002-)Dual-core/2T (2005-) DEC 21464 (2003) Single-core/4T IBM POWER5 (2005) Dual-core/2T PPE of Cell BE (2006) Single-core/2T Fujitsu SPARC64 VI / VII Dual-core/Quad-core/2T SUN MAJC 5200 (2000) Quad-core/4T (dedicated use) Intel Montecito (2006) Dual-core/2T

3. Thread scheduling

3. Thread scheduling (1) Dispatch slots Thread1 Context switch Thread2 Clock cycles Thread scheduling in software multithreading on a traditional supercalar processor The execution of a new thread is initiated by a context switch (needed to save the state of the suspended thread and loading the state of the thread to be executed next). Figure 3.1: Thread scheduling assuming software multithreading on a 4-way superscalar processor

3. Thread scheduling (2) Thread scheduling in multicore processors (CMP-s) Dispatch slots Thread2 Thread1 Clock cycles Both t-way superscalar cores execute different threads independently. Figure 3.2: Thread scheduling in a dual core processor

3. Thread scheduling (3) Thread scheduling in multithreaded cores Coarse grained MT

3. Thread scheduling (4) Dispatch/issue slots Clock cycles Thread1 Context switch Thread2 Threads are switched by means of rapid, HW-supported context switches. Figure 3.3: Thread scheduling in a 4-way coarse grained multithreaded processor

3. Thread scheduling (5) Coarse grained MT Scalar based Superscalar based VLIW based IBM RS64 IV (2000) (SStar) Single-core/2T SUN MAJC 5200 (2000) Quad-core/4T (dedicated use) Intel Montecito (2006?) Dual-core/2T

3. Thread scheduling (6) Thread scheduling in multithreaded cores Coarse grained MT Fine grained MT

3. Thread scheduling (7) Dispatch/issue slots Clock cycles Thread4 Thread3 Thread1 Thread2 The hardware thread scheduler choses a thread in each cycle and instructions from this thread are dispatched/issued in this cycle.. Figure 3.4: Thread scheduling in a 4-way fine grained multithreaded processor

3. Thread scheduling (8) Fine grained MT Round robin selection policy Priority based selection policy Scalar based Superscalar based Scalar based VLIW based Superscalar based VLIW based SUN UltraSPARC T1 (2005)(Niagara) up to 8 cores/4T PPE of Cell BE (2006) single-core/2T

3. Thread scheduling (9) Thread scheduling in multithreaded cores Coarse grained MT Fine grained MT Simultaneous MT (SMT)

3. Thread scheduling (10) Dispatch/issue slots Clock cycles Thread1 Thread3 Thread4 Thread2 Available instructions (chosen according to an appropriate selection policy, such as the priority of the threads) are dispatched/issued for execution in each cycle. SMT: Proposed by Tullsen, Eggers and Levy in 1995 (U. of Washington). Figure 3.5: Thread scheduling in a 4-way symultaneous multithreaded processor

3. Thread scheduling (11) SMT cores Scalar based Superscalar based VLIW based Pentium 4based proc.s Single-core/2T (2002-) Dual-core/2T (2005-) DEC 21464 (2003) Dual-core/4T (canceled in 2001) IBM POWER5 (2005) Dual-core/2T

4. Case examples 4.1. Coarse grained multithreading 4.2. Fine grained multithreading 4.3. SMT multithreading

4.1 Coarse grained multithreaded processors 4.1.1. IBM RS64 IV 4.1.2. SUN MAJC 5200 4.1.3. Intel Montecito

4.1. Coarse grained multithreaded processors Thread scheduling in multithreaded cores Coarse grained MT Fine grained MT Simultaneous MT (SMT)

4.1.1. IBM RS 64 IV (1) Microarchitecture 4-way superscalar, dual-threaded. Used in IBM’s iSeries and pSeries commercial servers. Optimized for commercial server workloads, such as on-line transaction processing, Web-serving, ERP (Enterprise Resource Planning). Characteristics of server workloads: • large working sets, • poor locality of references and • frequently occurring task switches • high cache miss rates, • Memory bandwidth and latency strongly limits performance. need for high instruction and data fetch bandwidth, need for large L1 $s, using multithreadingto hide memory latency.

4.1.1. IBM RS 64 IV (2) Main microarchitectural features of the RS64 IV to support commercial workloads: • 128 KB L1 D$ and L1 I$, • instruction fetch width: 8 instr./cycle, • dual-threaded core.

4.1.1. IBM RS 64 IV (3) 6XX bus IERAT: Effective to real address translation cache (2x64 entries) Figure 4.1.1: Microarchitecture of IBM’s RS 64 IV Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898

4.1.1. IBM RS 64 IV (4) Multithreading policy (strongly simplified) Coarse grained MT with two Ts; a foreground T and a background T. The foreground T executes until a long latency event, such as a cache miss or an IERAT miss occurs. Subsequently, a T switch is performed and the background T begins to execute. After the long latency event is serviced, a T switch occurs back to the foreground T. Both single threaded and multithreaded modes of execution. Threads can be allocated different priorities by explicit instructions. Implementation of multithreading Dual architectural states maintained for: • GPRs, FPRs, CR (condition reg.), CTR (count reg.), • spec. purpose priviledged mode reg.s, such as the MSR (machine state reg..) • status and control reg.s, such as T priority. Each T executes in its own effective address space (an unusual feature of multithreaded cores). Units used for address translation need to be duplicated, such as the SRs (Segment Address Reg.s) Thread Swith Buffer holds up to 8 instructions from the background T, to shorten context swithching by eliminating the latency of the I$ For multithreading additionally needed die area: ~ + 5 % die area

4.1.1. IBM RS 64 IV (5) (2 instructions/cycles) (8 cycles penalty) Figure 4.1.2: Thread switch on data cache miss in IBM’s RS 64 IV Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898

4.1.2. SUN MAJC 5200 (1) Dedicated use, high-end graphics, networking with wire-speed computational demands. Aim: Microarchitecture: • up to 4 processors on a die, • each processor has 4 FUs (Functional Units); 3 of them are identical, one is enhanced, • each FU has its private logic and register set (e.g. 32 or 64 regs., • the 4 FUs of a processor share a set of global regs., e.g. 64 regs., • all registers are unified (not splitted into FX/FP files), • any FU can process any data type. Each processor is a 4-wide VLIW and can be 4-way multithreaded.

4.1.2. SUN MAJC 5200 (2) Figure 4.1.3: General view of SUN’s MAJC 5200 Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc

4.1.2. SUN MAJC 5200 (3) Figure 4.1.4: The principle of private, unified register files associated with each FU Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc

4.1.2. SUN MAJC 5200 (4) Threading Each processor with its 4 FUs can be operated in a 4-way multithreaded mode (called Vertical Multithreading by Sun) Implementation of 4-way multithreading: by executing each T by one of the 4 FUs („Vertical multithreading”) Thread switch Following a cache miss, the processor saves the T state and begins to process the next T. Example Comparison of program execution without and with multithreading on a 4-wide VLIW Considered program: • It consists of 100 instructions, • on average 2.5 instrs./cycle executed on average, • giving birth to a cache miss after each 20 instructions. • Latency of serving a cache miss: 75 cycles.

4.1.2. SUN MAJC 5200 (5) Figure 4.1.5: Execution for subsequent cache misses in a single threaded processor Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc

4.1.2. SUN MAJC 5200 (6) Figure 4.1.6: Execution for subsequent cache misses in SUN’s MAJC 5200 Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc

4.1.3. Intel Montecito (1) Aim: High end servers Main enhancements of Montecito over Itanium2 • Split L2 caches for data and instructions, • larger unified L3 cache (for each core), • duplicated architectural states for • FX/FP-registers, • branch and predicate registers, • next address register • maintained. • (Foxton technology for power management/frequency boost, planned but not implemented). Additional support for dual-threading (duplicated microarchitectural states) • the branch prediction structures provide T tagging, • per thread return address stacks, • per thread ALATs (Advance Load Address Table) Additional core area needed for multithreading:~ 2 %.

Multithreaded Processors