12 multithreaded processors
Download
1 / 85

12. Multithreaded Processors - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

12. Multithreaded Processors. Dezső Sima Fall 2006.  D. Sima, 2006. Overview. 1 Introduction. 2 Overview. 3 Coarse grain multithreading. 4 Fine grain multithreading. 5 Simultaneous multithreading. 1. Introduction (1). Aim of multithreading:.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' 12. Multithreaded Processors' - minnie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
12 multithreaded processors

12. Multithreaded Processors

Dezső Sima

Fall 2006

 D. Sima, 2006


Overview

1 Introduction

2 Overview

3 Coarse grain multithreading

4 Fine grain multithreading

5Simultaneous multithreading


1. Introduction (1)

Aim of multithreading:

to raise performance compared to superscalar execution or multitasking

by increased parallelism at execution.

Thread: flow of control

Main features of multithreading:

Threads

  • belong to the same process,

  • share a common address space

  • (usually, else multiple address translation paths (virtual to real) need to be maintained

  • in parallel)

  • are executed simultaneously (overlapped or in parallel).

Thread management:

  • creation, control and termination of threads,

  • maintaining multiple sets of thread states,

  • context swithing between threads.


1. Introduction (2)

Implementation of multithreading

(while executing multithreaded apps/OSs)

Software implementation

Hardware implementation

Execution of multithreaded apps/OSs

on a single threaded processor

by time sharing

Execution of multithreaded apps/OSs

on a multithreaded processor

concurrently

Maintaining multiple threads

concurrently by the OS

Maintaining multiple threads

concurrently by the processor

Multithreaded OSs

Multithreaded processors

Fast context swithing between threads required.


1. Introduction (3)

MTcore

Core

Core

L2/L3

L2/L3

L3/Memory

L3/Memory

Basic options to implement multithreaded processors

Multicore processors

Multithreaded cores

(SMP: Symmetric MultiprocessingCMP: Chip Multiprocessing)

Chip


1. Introduction (4)

Requirement of software multithreading:

Maintaining multiple thread states concurrently by the OS, including:

PC, FX/FP registers, state registers

Core enhancements needed in case of multithreaded cores:

  • Maintaining multiple thread states, including:

PC, architectural registers, state registers

(in case of merged arch. and rename registers providing appropriatly large file sizes (FX/FP))

  • Maintaning multiple thread microstates, pertaining to:

rename mappings, the RAS (Return Address Stack), ROB, etc.

  • Providing increased sizes for scarce or sensitive resorces, such as:

the instruction buffer, store queue, etc.



1. Introduction (6)

Multithreaded OSs:

  • Windows NT

  • OS/2

  • Unix w/Posix

  • most OSs developed from the 90’s on


Principle of sequential-, multitask- and multithreaded programming

P1

P1

P1

fork()

CreateThread()

T1

exec()

Create Process()

P2

T2

fork()

T3

P2

P2

T4

Process / Thread Management Example

T5

P3

exec()

T6

join()

P3


Execution of sequential-, multitask- and multithreaded programs

Description

Key Advantages

Key Issues


Implementation of multiprocessing and multithreading (2) programs

OS Support

Performance Level

Software Development


2. Overview programs

2.1 Thread scheduling

Thread scheduling

while implementing software multithreading on a traditional supercalar processor

The execution of a new thread is initiated by a context switch

(needed to save the state of the suspended thread and loading the state of the thread to be executed).

Figure 2.1: Thread scheduling in a traditional superscalar processor

Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading


Thread scheduling in CMP-s programs

Cores execute different threads independently.

Figure 2.2: Thread scheduling in an CMP

Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading


2. Overview programs

Thread scheduling in multithreaded cores

Coarse grain MT


Threads are switched by means of rapid, HW-supported context switches.

Figure 2.3: Thread scheduling in a 4-way coarse grained multithreaded processor

Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading


2. Overview switches.

Thread scheduling in multithreaded cores

Coarse grain MT

Fine grain MT


The hardware thread scheduler choses a thread in each cycle and

instructions from this thread are dispatched/issued in this cycle..

Figure 2.4: Thread scheduling in a 4-way fine grained multithreaded processor

Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading


2. Overview and

Thread scheduling in multithreaded cores

Coarse grain MT

Fine grain MT

Simultaneous MT (SMT)


Available instructions (chosen according to an appropriate selection policy, such as the priority of the threads)

are dispatched/issued for execution in each cycle.

SMT: Proposed by Tullsen, Eggers and Levy in 1995 (U. of Washington).

Figure 2.5: Thread scheduling in a 4-way symultaneous multithreaded processor

Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading


2.2 Overview of multithreaded cores (1) selection policy, such as the priority of the threads)

Single coremulti- threaded

Dual coremulti- threaded

Multi coremulti- threaded

Superscalars

RISCs

RS64 IV

(Sstar)

POWER5

IBM

(2000)

2T

0.18  /44 mtrs

(2004)

2T

0.13  /276 mtrs

Alpha

21464 (V8)

DEC/Compaq

(2003)

4T

0.13  /250 mtrs

UltraSPARC T1

(Niagara)

(2005)

8 cores/4T

0.09  /279 mtrs

Sun

Figure 2.6: Multithreaded cores (1)


2.2 Overview of multithreaded cores (2) selection policy, such as the priority of the threads)

Single coremulti- threaded

Dual coremulti- threaded

Multi coremulti- threaded

Superscalars

CISCs

Pentium 4

(Northwood)

Pentium EE 840

Intel

(2002)

0.13  /55 mtrs

(4/2005)

0.09  /230 mtrs

Pentium EE 955/965 (Presler)

VLIWs

(4/2005)

0.065  /2*188 mtrs

Montecito

Intel

(2006?)

2*Itanium 2 (Madison)

0.09 /1730 mtrs.

Figure 2.7: Multithreaded cores (2)


2.2 Overview of multithreaded cores (3) selection policy, such as the priority of the threads)

Underlying core(s)

Scalar core(s)

Superscalar core(s)

VLIW core(s)

SUN UltraSPARC T1 (2005)(Niagara)

up to 8 cores, 4 threads

IBM RS64 IV (2000)

(SStar)

2-way

Pentium 4 (2002)

2-way

DEC 21464 (2003)

Dual-core/2-way

IBM POWER5 (2005)

Dual-core/2-way

Pentium EE 840 (2005)

Dual-core/2-way

Pentium EE 955/965 (2005)

Dual-core/2-way

SUN MAJC 5200 (2000)

Quad-core/4-way

(dedicated use)

Intel Montecito (2006?)

Dual-core/2-way


3. Coarse grain multithreading selection policy, such as the priority of the threads)

3.1 Overview (1)

Thread scheduling in multithreaded cores

Coarse grain MT

Fine grain MT

Simultaneous MT (SMT)


3. Coarse grain multithreading selection policy, such as the priority of the threads)

3.1 Overview (2)

Coarse grain MT

Scalar based

Superscalar based

VLIW based

IBM RS64 IV (2000)

(SStar)

2T

SUN MAJC 5200 (2000)

Quad-core/4T

(dedicated use)

Intel Montecito (2006?)

Dual-core/2T


3.2 Case example 1: IBM RS 64 IV (1) selection policy, such as the priority of the threads)

Microarchitecture

4-way superscalar, dual-threaded.

Used in IBM’s iSeries and pSeries commercial servers.

Optimized for commercial server workloads, such as

on-line transaction processing, Web-serving, ERP (Enterprise Resource Planning).

Instruction fetch width: 8 instr./cycle

Architectural state:

  • GPRs, FPRs, CR (condition reg.), CTR (count reg.),

  • spec. purpose priviledged mode reg.s, such as the MSR (machine state reg..)

  • status and control reg.s, such as T priority.

Each T executes in its own effective address space.

Units used for address translation need to be duplicated, such as the SRs (Segment Address Reg.s)

Duplicated resources:

~ + 5 % chip area

Both single threaded and multithreaded modes of execution.


3.2 Case example 1: IBM RS 64 IV (2) selection policy, such as the priority of the threads)

6XX bus

IERAT: Effective to real

address translation cache

(2x64 entries)

Figure 3.1: Microarchitecture of IBM’s RS 64 IV

Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898


3.2 Case example 1: IBM RS 64 IV (3) selection policy, such as the priority of the threads)

Aim: Commercial workloads

  • large working sets and

  • frequently occurring task switches

  • need for large L1$s

  • high cach miss rates

Thread switching (strongly simplified):

Two Ts are implemented; a foreground T and a background T.

The foreground T executes until a long latency event, such as a cache miss or an IERAT miss occurs.

Subsequently, a T switch is performed and the background T begins to execute.

After the miss is serviced, a T switch back to the foreground T occurs.

The Thread Swith Buffer holds up to 8 instructions from the background T,

to eliminate the latency of the I$

Threads can be allocated different priorities by explicit instructions.


3.2 Case example 1: IBM RS 64 IV (4) selection policy, such as the priority of the threads)

Figure 3.2: Thread switch on data cache miss in IBM’s RS 64 IV

Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898


3.2 Case example 2: SUN MAJC 5200 (1) selection policy, such as the priority of the threads)

Aim:

Dedicated use, high-end graphics, networking with wire-speed computational demands.

Microarchitecture:

  • up to 4 processors on a die,

  • each processor has 4 FUs (Functional Units); 3 of them are identical, one is enhanced,

  • each FU has its private logic and register set (e.g. 32 or 64 regs.,

  • the 4 FUs of a processor share a set of global regs., e.g. 64 regs.,

  • all registers are unified (not splitted to FX/FP files),

  • any FU can process any data type.

Each processor is a 4-wide VLIW and can be 4-way multithreaded.


3.2 Case example 2: SUN MAJC 5200 (2) selection policy, such as the priority of the threads)

Figure 3.3: General view of SUN’s MAJC 5200

Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc


3.2 Case example 2: SUN MAJC 5200 (3) selection policy, such as the priority of the threads)

Figure 3.4: The principle of private, unified register files associated with each FU

Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc


3.2 Case example 2: SUN MAJC 5200 (4) selection policy, such as the priority of the threads)

Threading

Each processor with its 4 FUs can be operated in a 4-way multithreaded mode

(called Vertical Multithreading by Sun)

Implementation of 4-way multithreading:

by executing each T by one of the 4 FUs („Vertical multithreading”)

Thread switch:

Following a cache miss, the processor saves the T state and begins to process the next T.

Example:

Comparison of program execution without and with multithreading on a 4-wide VLIW

Considered program:

  • It consists of 100 instructions,

  • on average 2.5 instrs./cycle executed on average,

  • giving birth to a cache miss after each 20 instructions.

  • Latency of serving a cache miss: 75 cycles.


3.2 Case example 2: SUN MAJC 5200 (5) selection policy, such as the priority of the threads)

Figure 3.5: Execution for subsequent cache misses in a single threaded processor

Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc


3.2 Case example 2: SUN MAJC 5200 (6) selection policy, such as the priority of the threads)

Figure 3.6: Execution for subsequent cache misses in SUN’s MAJC 5200

Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc


3.2 Case example 3: Intel Montecito (1) selection policy, such as the priority of the threads)

High end servers

Aim:

Main differencies between Itanim2 and Montecito:

  • Split L2 caches,

  • higher unified L3 cache,

  • duplicated architectural states maintained.

Additional support of dual-threading:

  • the branch prediction structures provide T tagging,

  • per stack return stack strucktures,

  • per thread ALATs (Advance Load Address Table)

Additional core area needed:~ 2 %.


3.2 Case example 3: Intel Montecito (2) selection policy, such as the priority of the threads)

Figure 3.7: Microarchitecture of Intel’s Itanium 2

Source: McNairy, C., „Itanium 2”, IEEE Micro, March/April 2003, Vol. 23, No. 2, pp. 44-55


3.2 Case example 3: Intel Montecito (3) selection policy, such as the priority of the threads)

Figure 3.8: Microarchitecture of Intel’s Montecito (ALAT: Advanced Load Address Table)

Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20


3.2 Case example 3: Intel Montecito (4) selection policy, such as the priority of the threads)

Thread swithes:

5 event types cause thread switches, such as L3 cache misses, programmed switched hints.

Total switch penalty: 15 cycles

Example for thread switching:

If control logic detects that a thread doesn’t make progress, a thread switch will be initiated.


3.2 Case example 3: Intel Montecito (5) selection policy, such as the priority of the threads)

Figure 3.9: Thread switch in Intel’s Montecito vs single thread execution

Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20


4. Fine grain multithreading selection policy, such as the priority of the threads)

4.1 Overview (1)

Thread scheduling in multithreaded cores

Coarse grain MT

Fine grain MT

Simultaneous MT (SMT)


4. Fine grain multithreading selection policy, such as the priority of the threads)

4.1 Overview (2)

Fine grain MT

Round robin selection policy

Priority based selection policy

Scalar based

Superscalar based

Scalar based

VLIW based

Superscalar based

VLIW based

SUN UltraSPARC T1 (2005)(Niagara)

up to 8 cores/4T


4.2 Case example: SUN UltraSPARC T1 (1) selection policy, such as the priority of the threads)

Aim:Commercial server applications, such as

  • web servicing,

  • transaction processing,

  • ERP (Enterprise Resource Planning),

  • DSS (Decision Support Systems)

Charasteristics of commercial server applications:

  • large working sets,

  • poor locality of memory references.

  • high cache miss rates,

  • low prediction accuracy for data dependent branches.

Memory latency strongly limits performance.

Multithreading to hide memory latency.


4.2 Case example: SUN UltraSPARC T1 (2) selection policy, such as the priority of the threads)

Structure:

  • 8 scalar cores, 4-way multithreaded each.

  • All 32 threads share an L2 cache of 3 MB, built up of 4 banks,


4.2 Case example: SUN UltraSPARC T1 (3) selection policy, such as the priority of the threads)

Figure 4.3: Block diagram of SUN’s UltraSPARC T1

Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29


4.2 Case example: SUN UltraSPARC T1 (2) selection policy, such as the priority of the threads)

Structure:

  • 8 scalar cores, 4-way multithreaded each.

  • All 32 threads share an L2 cache of 3 MB, built up of 4 banks,

  • 4 memory channels with on chip DDR2 memory controllers.

It runs under Solaris.


4.2 Case example: SUN UltraSPARC T1 (4) selection policy, such as the priority of the threads)

Figure 4.3: SUN’s UltraSPARC T1 chip

Source: www.princeton.edu/~jdonald/research/hyperthreading/romanescu_niagara.pdf


4.2 Case example: SUN UltraSPARC T1 (5) selection policy, such as the priority of the threads)

Processor Elements (Sparc pipes):

  • Scalar FX-units, 6-stage pipeline

  • all Processor Elements share a single FP-unit


4.2 Case example: SUN UltraSPARC T1 (6) selection policy, such as the priority of the threads)

Figure 4.3: Microarchitecture of the core of SUN’s UltraSPARC T1

Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29


4.2 Case example: SUN UltraSPARC T1 (5) selection policy, such as the priority of the threads)

Processor Elements (Sparc pipes):

  • Scalar FX-units, 6-stage pipeline

  • all Processor Elements share a single FP-unit

Each thread of a processor element has its private:

  • PC-logic

  • register file,

  • instruction buffer,

  • store buffer.


4.2 Case example: SUN UltraSPARC T1 (6) selection policy, such as the priority of the threads)

Figure 4.3: Microarchitecture of the core of SUN’s UltraSPARC T1

Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29


4.2 Case example: SUN UltraSPARC T1 (5) selection policy, such as the priority of the threads)

Processor Elements (Sparc pipes):

  • Scalar FX-units, 6-stage pipeline

  • all Processor Elements share a single FP-unit

Each thread of a processor element has its private:

  • PC-logic

  • register file,

  • instruction buffer,

  • store buffer.

No thread switch penalty!


4.2 Case example: SUN UltraSPARC T1 (7) selection policy, such as the priority of the threads)

Thread switch:

Threads are switched on a per cycle basis.

Selection of threads:

In the thread select pipeline stage

the thread select multiplexer selects a thread from the set of available threads in each clock cycle and

issues the subsequent instr. of this thread into the pipeline for execution.


4.2 Case example: SUN UltraSPARC T1 (6) selection policy, such as the priority of the threads)

Figure 4.3: Microarchitecture of the core of SUN’s UltraSPARC T1

Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29


4.2 Case example: SUN UltraSPARC T1 (7) selection policy, such as the priority of the threads)

Thread switch:

Threads are switched on a per cycle basis.

Selection of threads:

In the thread select pipeline stage

the thread select multiplexer selects a thread from the set of available threads in each clock cycle and

issues the subsequent instr. of this thread into the pipeline for execution.

Thread selection policy: the least recently used policy.

Threads become unavailable due to:

  • long-latency instructions, such as loads, branches, multiplies, divides,

  • pipeline stalls because of cache misses, traps, resource conflicts.

1.Example:

  • all 4 threads are available.


4.2 Case example: SUN UltraSPARC T1 (8) selection policy, such as the priority of the threads)

Figure 4.3: Thread switch in the SUN’s UltraSPARC T1 when all threads are available

Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29


4.2 Case example: SUN UltraSPARC T1 (9) selection policy, such as the priority of the threads)

2. Example:

  • There are only 2 threads available,

  • speculative execution of instructions following a load.

(Data referenced by a load instruction arrive in the 3. cycle after decoding, assuming a cache hit.

So, after issuing a load the thread becomes unavailable for the next two subsequent cycles.)


4.2 Case example: SUN UltraSPARC T1 (10) selection policy, such as the priority of the threads)

Figure 4.3: Thread switch in the SUN’s UltraSPARC T1 when all threads are available

(The add instruction from thread t0 is speculatively switched into the pipeline assuming a cache hit.)

Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29


5. Simultaneous multithreading selection policy, such as the priority of the threads)

5.1 Overview (2)

Thread scheduling in multithreaded cores

Coarse grain MT

Fine grain MT

Simultaneous MT (SMT)


5. Simultaneous multithreading selection policy, such as the priority of the threads)

5.1 Overview (2)

Simultaneous MT

Scalar based

Superscalar based

VLIW based

Pentium 4 (2002)

2T

DEC 21464 (2003)

Dual-core/2T

IBM POWER5 (2005)

Dual-core/2T

Pentium EE 840 (2005)

Dual-core/2T

Pentium EE 955/965 (2005)

Dual-core/2T


5.2 Case example 1: Intel Pentium 4 / HT (1) selection policy, such as the priority of the threads)

Intel designates SMT as Hyperthreading (HT)

Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.

(called the Prestonia and Foster MP cores),

followed by the Northwood core for desktops in 11/2002.

Additions for implementing MT:

  • Duplicated architectural state, including

  • instruction pointer,

  • the general purpose regs.,

  • the control regs.,

  • the APIC (Advanced Programable Interrupt Controller) regs.,

  • some machine state regs.


5.2 Case example 1: Intel Pentium 4 / HT (2) selection policy, such as the priority of the threads)

Figure 5.1. Intel Pentium 4 and the visible processor resources duplicated to support hyperthreading technology. Hyperthreading requires duplication of additional miscellaneous pointers and control logic, but these are too small to point out.

Source: Koufaty D. and Marr D.T. „Hyperthreading Technology in the Netburst Microarchitecture, IEEE. Micro, Vol. 23, No.2, March-April 2003, pp. 56-65.


5.2 Case example 1: Intel Pentium 4 / HT (1) selection policy, such as the priority of the threads)

Intel designates SMT as Hyperthreading (HT)

Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.

(called the Prestonia and Foster MP cores),

followed by the Northwood core for desktops in 11/2002.

Additions for implementing MT:

  • Duplicated architectural state, including

  • instruction pointer,

  • the general purpose regs.,

  • the control regs.,

  • the APIC (Advanced Programable Interrupt Controller) regs.,

  • some machine state regs.

  • Further enhancements to support MT (thread microstate):

  • TC-entries (Trace cache) are tagged,

  • BHB (Branch History Buffer) is duplicated,

  • Global History Table is tagged,

  • RAS (Return Address Stack) is duplicated,

  • Rename tables are duplicated,

  • ROB is tagged.


5.2 Case example 1: Intel Pentium 4/HT (3) selection policy, such as the priority of the threads)

Figure 5.2: SMT pipeline in Intel’s Pentium 4/HT

Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and Microarchitecture”,Intel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16


5.2 Case example 1: Intel Pentium 4 / HT (1) selection policy, such as the priority of the threads)

Intel designates SMT as Hyperthreading (HT)

Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.

(called the Prestonia and Foster MP cores),

followed by the Northwood core for desktops in 11/2002.

Additions for implementing MT:

  • Duplicated architectural state, including

  • instruction pointer,

  • the general purpose regs.,

  • the control regs.,

  • the APIC (Advanced Programable Interrupt Controller) regs.,

  • some machine state regs.

  • Further enhancements to support MT (thread microstate):

  • TC-entries (Trace cache) are tagged,

  • BHB (Branch History Buffer) is duplicated,

  • Global History Table is tagged,

  • RAS (Return Address Stack) is duplicated,

  • Rename tables are duplicated,

  • ROB is tagged.

Moore chip area required for MT: less than 5 %.

Single thread/dual thread modes:

To prevent single thread performance degradation: in single thred mode partitioned resources are recombined.


5.2 Case example 2: Alpha 21464 (V8) (1) selection policy, such as the priority of the threads)

8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line.

In 2001 all Alpha intellectual property rights were sold to Intel.

Core enhancements for 4-way multithreading:

  • Providing replicated (4 x) thread states for:

PC, architectural registers (by increasing the sizes of the merged GPR and FPR

architectural and rename reg. files):

Alpha 21464

Alpha 21264

GPRs

FPRs

80

80

512

Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”,

Proc. ISSCC, 2002, pp. 334-243


5.2 Case example 2: Alpha 21464 (V8) (2) selection policy, such as the priority of the threads)

Figure 5.3: SMT pipeline in the Alpha 21464 (V8)

Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com


5.2 Case example 2: Alpha 21464 (V8) (1) selection policy, such as the priority of the threads)

8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line.

In 2001 all Alpha intellectual property rights were sold to Intel.

Core enhancements for 4-way multithreading:

  • Providing replicated (4 x) thread states for:

PC, architectural registers (by increasing the sizes of the merged GPR and FPR

architectural and rename reg. files):

Alpha 21464

Alpha 21264

GPRs

FPRs

80

80

512

  • Providing replicated (4 x) thread microstates for:

Register Maps,

Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”,

Proc. ISSCC, 2002, pp. 334-243


5.2 Case example 2: Alpha 21464 (V8) (2) selection policy, such as the priority of the threads)

Figure 5.3: SMT pipeline in the Alpha 21464 (V8)

Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com


5.2 Case example 2: Alpha 21464 (V8) (1) selection policy, such as the priority of the threads)

8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line.

In 2001 all Alpha intellectual property rights were sold to Intel.

Core enhancements for 4-way multithreading:

  • Providing replicated (4 x) thread states for:

PC, architectural registers (by increasing the sizes of the merged GPR and FPR

architectural and rename reg. files):

Alpha 21464

Alpha 21264

GPRs

FPRs

80

80

512

  • Providing replicated (4 x) thread microstates for:

Register Maps,

Additional core area needed for SMT:~ 6 %

Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”,

Proc. ISSCC, 2002, pp. 334-243


5.2 Case example 3: IBM POWER5 (1) selection policy, such as the priority of the threads)

POWER5 enhancements vs the POWER4:

  • on-chip memory control,


5.2 Case example 3: IBM POWER5 (2) selection policy, such as the priority of the threads)

Fabric

Controller

Figure 5.14: POWER4 and POWER5 system structures

Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.


5.2 Case example 3: IBM POWER5 (1) selection policy, such as the priority of the threads)

POWER5 enhancements vs the POWER4:

  • on-chip memory control,

  • separate L3/memory attachment,


5.2 Case example 3: IBM POWER5 (2) selection policy, such as the priority of the threads)

Fabric

Controller

Figure 5.14: POWER4 and POWER5 system structures

Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.


5.2 Case example 3: IBM POWER5 (1) selection policy, such as the priority of the threads)

POWER5 enhancements vs the POWER4:

  • on-chip memory control,

  • separate L3/memory attachment,

  • dual threaded.


5.2 Case example 3: IBM POWER5 (3) selection policy, such as the priority of the threads)

Figure 5.3: Microarchitecture of IBM’s POWER5

Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003


5.2 Case example 3: IBM POWER5 (4) selection policy, such as the priority of the threads)

Figure 5.3: IBM POWER5Chip

Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003


5.2 Case example 3: IBM POWER5 (5) selection policy, such as the priority of the threads)

Core enhancements for multithreading:

  • Providing duplicated thread states for:

PC, architectural registers (by increasing the sizes of the merged GPR and FPR

architectural and rename reg. files):

POWER4

POWER5

GPRs

FPRs

80 120

72 120


5.2 Case example 3: IBM POWER5 (6) selection policy, such as the priority of the threads)

Figure 5.3: SMT pipeline of IBM’s POWER5

Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003


5.2 Case example 3: IBM POWER5 (5) selection policy, such as the priority of the threads)

Core enhancements for multithreading:

  • Providing duplicated thread states for:

PC, architectural registers (by increasing the sizes of the merged GPR and FPR

architectural and rename reg. files):

POWER4

POWER5

GPRs

FPRs

80 120

72 120

  • Providing duplicated thread microstates for:

Return Address Stack, Group Completion (ROB)


5.2 Case example 3: IBM POWER5 (6) selection policy, such as the priority of the threads)

Figure 5.3: SMT pipeline of IBM’s POWER5

Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003


5.2 Case example 3: IBM POWER5 (5) selection policy, such as the priority of the threads)

Core enhancements for multithreading:

  • Providing duplicated thread states for:

PC, architectural registers (by increasing the sizes of the merged GPR and FPR

architectural and rename reg. files):

POWER4

POWER5

GPRs

FPRs

80 120

72 120

  • Providing duplicated thread microstates for:

Return Address Stack, Group Completion (ROB)

  • Providing increased (duplicated) size for scarce or sensitive resorces, such as:

Instruction Buffer, Store Queue


5.2 Case example 3: IBM POWER5 (6) selection policy, such as the priority of the threads)

Figure 5.3: SMT pipeline of IBM’s POWER5

Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003


5.2 Case example 3: IBM POWER5 (5) selection policy, such as the priority of the threads)

Core enhancements for multithreading:

  • Providing duplicated thread states for:

PC, architectural registers (by increasing the sizes of the merged GPR and FPR

architectural and rename reg. files):

POWER4

POWER5

GPRs

FPRs

80 120

72 120

  • Providing duplicated thread microstates for:

Return Address Stack, Group Completion (ROB)

  • Providing increased (duplicated) size for scarce or sensitive resorces, such as:

Instruction Buffer, Store Queue

Additional core area needed for SMT:~ 10 %


5.2 Case example 3: IBM POWER5 (7) selection policy, such as the priority of the threads)

Unbalanced execution of threads:

(an enhancement of the single mode/dual mode thred execution model)

  • Threads have 8 priority levels (0...7) controlled by HW/SW,

  • the decode rate of each thread will be controlled according to the associated priority

Figure 5.3: Unbalanced execution of threads in IBM’s POWER5

Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003


5.2 Case example 3: IBM POWER5 (8) selection policy, such as the priority of the threads)

Development effort:

  • Concept phase: ~ 10 persons/ 4 month

  • High level design phase: ~ 50 persons/ 6 month

  • Implementation phase: ~ 200 persons/ 12-18 month

Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003


ad