Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) • An evolutionary processor architecture originally introduced in 1996 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors. • SMT has the potential of greatly enhancing processor computational capabilities by: • Exploiting thread-level parallelism (TLP), simultaneously executing instructions from different threads during the same cycle. • Providing multiple hardware contexts, hardware thread scheduling and context switching capability.

Microprocessor Architecture Trends

Performance Increase of Workstation-Class Microprocessors 1987-1997 Integer SPEC92 Performance

Microprocessor Logic Density Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million Moore’s Law Moore’s Law: 2X transistors/Chip Every 1.5 years

Increase of Capacity of VLSI Dynamic RAM Chips year size(Megabit) 1980 0.0625 1983 0.25 1986 1 1989 4 1992 16 1996 64 1999 256 2000 1024 1.55X/yr, or doubling every 1.6 years

CPU Architecture Evolution: Single Threaded Pipeline • Traditional 5-stage pipeline. • Increases Throughput: Ideal CPI = 1

CPU Architecture Evolution: Superscalar Architectures • Fetch, decode, execute, etc. more than one instruction per cycle (CPI < 1). • Limited by instruction-level parallelism (ILP).

Advanced CPU Architectures: Traditional Multithreaded Processor • Multiple HW contexts (PC, SP, and registers) • One context gets CPU for x cycles at a time. • Limited by thread-level parallelism (TLP).

Advanced CPU Architectures: VLIW: Intel/HP Explicitly Parallel Instruction Computing (EPIC) • Strengths: • Allows for a high level of instruction parallelism (ILP). • Takes a lot of the dependency analysis out of HW and places focus on smart compilers. • Weakness: • Keeping Functional Units (FUs) busy (control hazards). • Static FUs Scheduling limits performance gains.

Advanced CPU Architectures: Single Chip Multiprocessor • Strengths: • Create a single processor block and duplicate. • Takes a lot of the dependency analysis out of HW and places focus on smart compilers. • Weakness: • Performance limited by individual thread performance (ILP).

Advanced CPU Architectures: Single Chip Multiprocessor

SMT: Simultaneous Multithreading • Multiple Hardware Contexts running at the same time (HW context: registers, PC, and SP). • Avoids both horizontal and vertical waste by having multiple threads keeping functional units busy during every cycle. • Builds on top of current time-proven advancements in CPU design: superscalar, dynamic scheduling, hardware speculation, dynamic HW branch prediction. • Enabling Technology: VLSI logic density in the order of hundreds of millions of transistors/Chip.

SMT • With multiple threads running penalties from long-latency operations, cache misses, and branch mispredictions will be hidden. • Pipelines are separated until issue stage • Functional units are shared among all contexts during every cycle • More complicated writeback stage. • More threads issuing to functional units results in higher resource utilization

SMT: Simultaneous Multithreading

3 3 1 1 2 2 2 4 4 2 2 3 2 2 3 3 4 5 1 1 1 1 1 1 1 1 1 5 5 4 5 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 1 2 4 3 1 1 2 5 4 4 4 The Power Of SMT Time (processor cycles) Superscalar Traditional Multithreaded Simultaneous Multithreading Rows of squares represent instruction issue slots Box with number x: instruction issued from thread x Empty box: slot is wasted

Multiprogramming workload Superscalar Traditional SMT Threads Multithreading 1 2.7 2.6 3.1 2 - 3.3 3.5 4 - 3.6 5.7 8 - 2.8 6.2 Parallel Workload Superscalar MP2 MP4 Traditional SMT Threads Multithreading 1 3.3 2.4 1.5 3.3 3.3 2 - 4.3 2.6 4.1 4.7 4 - - 4.2 4.2 5.6 8 - - - 3.5 6.1 SMT Performance Comparison • Instruction throughput from simulations by Eggers et al. at The University of Washington, using both multiprogramming and parallel workloads:

SMT Instruction Scheduling Methods • Round Robin: • Instruction from Thread 1, then Thread 2, then Thread 3, etc. • I-Count: • Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline. • Other: • Branch First: Branch instructions issued first • Spec Last: Speculative instructions given low priority

Inst Code Description Functional unit A LUI R5,100 R5 = 100 Int ALU B FMUL F1,F2,F3 F1 = F2 x F3 FP ALU C ADD R4,R4,8 R4 = R4 + 8 Int ALU D MUL R3,R4,R5 R3 = R4 x R5 Int mul/div E LW R6,R4 R6 = (R4) Memory port F ADD R1,R2,R3 R1 = R2 + R3 Int ALU G NOT R7,R7 R7 = !R7 Int ALU H FADD F4,F1,F2 F4=F1 + F2 FP ALU I XOR R8,R1,R7 R8 = R1 XOR R7 Int ALU J SUBI R2,R1,4 R2 = R1 – 4 Int ALU K SW ADDR,R2 (ADDR) = R2 Memory port SMT Performance Example • 4 integer ALUs (1 cycle latency) • 1 integer multiplier/divider (3 cycle latency) • 3 memory ports (2 cycle latency, assume cache hit) • 2 FP ALUs (5 cycle latency) • Assume all functional units are fully-pipelined

SMT Performance Example (continued) • 2 additional cycles to complete program 2 • Throughput: • Superscalar: 11 inst/7 cycles = 1.57 IPC • SMT: 22 inst/9 cycles = 2.44 IPC

Simulator (sim-SMT) @ RIT CE • Execution-driven, performance simulator. • Derived from Simple Scalar tool set. • Simulates cache, branch prediction, five pipeline stages • Flexible: • Configuration File controls cache size, buffer sizes, number of functional units. • Cross compiler used to generate Simple Scalar assembly language. • Binary utilities, compiler, and assembler available. • Standard C library (libc) has been ported.

Simulator Memory Address Space

Alternate Functional Unit Configurations • New functional unit configurations attempted (by adding one of each type of FU): • +1 integer multiplier/divider • +2.8% IPC, issue rate • -74% times with no FU available • Simulator very flexible (only one line in configuration file required change)

Sim-SMT Simulator Limitations • Does not keep precise exceptions. • System Call’s instructions not tracked. • Limited memory space: • Four test programs’ memory spaces running on one simulator memory space • Easy to run out of stack space

Simulation Runs & Results • Test Programs used: • Newton interpolation. • Matrix Solver using LU decomposition. • Integer Test Program. • FP Test Program. • Simulations of a single program • 1,2, and 4 threads. • System simulations involve a combination of all programs simultaneously • Several different combinations were run • From simulation results: • Performance increase: • Biggest increase occurs when changing from one to two threads. • Higher issue rate, functional unit utilization.

Performance (IPC) Simulation Results:

Simulation Results: Simulation Time

Instruction Issue Rate Simulation Results:

Performance Vs. Issue BW Simulation Results:

Functional Unit Utilization Simulation Results:

No Functional Unit Available Simulation Results:

Horizontal Waste Rate Simulation Results:

Vertical Waste Rate Simulation Results:

SMT: Simultaneous Multithreading • Strengths: • Overcomes the limitations imposed by low single thread instruction-level parallelism. • Multiple threads running will hide individual control hazards (branch mispredictions). • Weaknesses: • Additional stress placed on memory hierarchy Control unit complexity. • Sizing of resources (cache, branch prediction, etc.) • Accessing registers (32 integer + 32 FP for each HW context): • Some designs devote two clock cycles for both register reads and register writes.

SMT: Simultaneous Multithreading Kernel Code • Many, if not all, benchmarks are based upon a limited interaction with kernel code. • How can the kernel overhead be minimized (context-switching, process management, etc.)? • CHAOS (Context Hardware Accelerated Operating System). • Introduce a lightweight dedicated kernel context to handle process management: • When there are 4 contexts, there is a good chance that one of them will continue to run, why take an (expensive) chance in swapping it out when it will be brought right back in by the swapper (process management).

SMT & Technology • SMT architecture has not been implemented in any existing commercial microprocessor yet (First 4-thread SMT CPU: Alpha EV8 ~2001). • Current technology has the potential for 4-8 simultaneous threads: • Based on transistor count and design complexity.

RIT-CE SMT Project Goals • Investigate performance gains from exploiting Thread-Level Parallelism (TLP) in addition to current Instruction-Level Parallelism (ILP) in processor design. • Design and simulate an architecture incorporating Simultaneous Multithreading (SMT). • Study operating system and compiler modifications needed to support SMT processor architectures. • Define a standard interface for efficient SMT-processor/OS kernel interaction. • Modify an existing OS kernel (Linux?) to take advantage of hardware multithreading capabilities. • Long term: VLSI implementation of an SMT prototype.

Current Project Status • Architecture/OS interface definition. • Study of design alternatives and impact on performance. • SMT Simulator Development: • System call development, kernel support, and compiler/assembler changes. • Development of code (programs and OS kernel) is key to getting results.

Short-Term Project Chart

Current/Future Project Goals • SMT simulator completion refinement, and further testing. • Development of an SMT-capable OS kernel. • Extensive performance studies with various workloads using the simulator/OS/compiler: • Suitability for fine-grained parallel applications? • Effect on multimedia applications? • Architectural changes based on benchmarks. • Cache impact on SMT performance investigation. • Investigation of an in-order SMT processor (C or VHDL model) • MOSIS Tiny Chip (partial/full) implementation. • Investigate the suitability of SMT processors as building blocks for MPPs.

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT)

Presentation Transcript

Simultaneous Multithreading (SMT)

Transient Fault Detection via Simultaneous Multithreading

SIMULTANEOUS MULTITHREADING

Symbiotic Jobscheduling for a Simultaneous Multithreading Processor

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

Transient Fault Detection and Recovery via Simultaneous Multithreading

Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)

Simultaneous Multithreading: Multiplying Alpha Performance

CSE 502 Graduate Computer Architecture Lec 11 –Simultaneous Multithreading

SMT Issues

Computer Architecture Lec 10 –Simultaneous Multithreading

Soft Real-Time Scheduling on Simultaneous Multithreaded Processors

Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

CMPE 511 Multithreading

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

Transient Fault Detection via Simultaneous Multithreading

WLAN Advanced Development Internet Infra Business Telecommunication Systems Division

High Quality SMT Spare Parts

Limits to ILP and Simultaneous Multithreading

Improving Database Performance on Simultaneous Multithreading Processors

EECS 252 Graduate Computer Architecture Lec 10 –Simultaneous Multithreading