Motivation for multithreaded architectures
This presentation is the property of its rightful owner.
Sponsored Links
1 / 59

Motivation for Multithreaded Architectures PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on
  • Presentation posted in: General

Motivation for Multithreaded Architectures. Processors not executing code at their hardware potential late 70’s: performance lost to memory latency 90’s: performance not in line with the increasingly complex parallel hardware instruction issue bandwidth is increasing

Download Presentation

Motivation for Multithreaded Architectures

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Motivation for multithreaded architectures

Motivation for Multithreaded Architectures

Processors not executing code at their hardware potential

  • late 70’s: performance lost to memory latency

  • 90’s: performance not in line with the increasingly complex parallel hardware

    • instruction issue bandwidth is increasing

    • number of functional units is increasing

    • instructions can execute out-of-order execution & complete in-order

    • still, processor utilization was decreasing & instruction throughput not increasing in proportion to the issue width

CSE 548P


Motivation for multithreaded architectures1

Motivation for Multithreaded Architectures

CSE 548P


Motivation for multithreaded architectures2

Motivation for Multithreaded Architectures

Major cause is the lack of instruction-level parallelism in a single executing thread

Therefore the solution has to be more general than building a smarter cache or a more accurate branch predictor

CSE 548P


Multithreaded processors

Multithreaded Processors

Multithreaded processors can increase the pool of independent instructions & consequently address multiple causes of processor stalling

  • holds processor state for more than one thread of execution

    • registers

    • PC

    • each thread’s state is a hardware context

  • execute the instruction stream from multiple threads without software context switching

  • utilize thread-level parallelism (TLP) to compensate for a lack in ILP

CSE 548P


Multithreading

Multithreading

Traditional multithreaded processors hardware switch to a different context to avoid processor stalls

3 styles

1. coarse-grain multithreading

  • switch on a long-latency operation (e.g., L2 cache miss)

  • another thread executes while the miss is handled

  • modest increase in instruction throughput with potentially no slowdown to the thread with the miss

    • doesn’t hide latency of short-latency operations

    • doesn’t add much to an already long stall

    • need to fill the pipeline on a switch

    • no switch if no long-latency operations

  • HEP, IBM RS64 III

CSE 548P


Traditional multithreading

Traditional Multithreading

3 styles

2. fine-grain multithreading

  • can switch to a different thread each cycle (usually round robin)

  • hides latencies of all kinds

  • larger increase in instruction throughput but slows down the execution of each thread

  • Cray (Tera) MTA

CSE 548P


Comparison of issue capabilities

Comparison of Issue Capabilities

CSE 548P


Multithreading1

Multithreading

3 styles

3. simultaneous multithreading (SMT)

  • issues multiple instructions from multiple threads each cycle

  • no hardware context switching

  • huge boost in instruction throughput with less degradation to individual threads

CSE 548P


Comparison of issue capabilities1

Comparison of Issue Capabilities

CSE 548P


Cray tera mta

Cray (Tera) MTA

Fine-grain multithreaded processor

  • can switch to a different thread each cycle

    • switches to ready threads only

    • allows execution to remain with the current thread for a specific number of cycles (discussed under compiler support)

  • up to 128 hardware contexts

    • lots of latency to hide, mostly from the multi-hop interconnection network

    • average instruction latency is 70-cycles (at one point)(i.e., 70 instruction streams needed to hide all latency, on average)

  • processor state for all 128 contexts

    • GPRs (total of 4K registers!)

    • status registers (includes the PC)

    • branch target registers

CSE 548P


Cray tera mta1

Cray (Tera) MTA

Interesting features

  • no data caches

    • to avoid having to keep caches coherent (topic of the next lecture)

    • increases the latency for data accesses but reduces the variation

    • L1 & L2 instruction caches

      • instruction accesses are more predictable & have no coherency problem

      • prefetch straight-line & target code

CSE 548P


Cray tera mta2

Cray (Tera) MTA

Interesting features

  • trade-off between avoiding memory bank conflicts & exploiting spatial locality

    • memory distributed among processors

    • memory addresses are randomized to avoid conflicts

      • want to fully utilize all memory bandwidth

    • run-time system can confine consecutive virtual addresses to a single (close-by) memory unit

      • reduces latency (used mainly for instructions)

CSE 548P


Cray tera mta3

Cray (Tera) MTA

Interesting features

  • tagged memory for synchronization, pointer-chasing, trap handling

    • indirectly set full/empty bits to prevent data races

      • prevents a consumer/producer from loading/overwriting a value before a producer/consumer has written/read it

        • set to empty when producer instruction starts

        • consumer instructions block if try to read the producer value

        • set to full when producer value is written

        • consumers can now read a valid value

    • explicitly set full/empty bits for synchronization

      • primarily used to synchronize threads that are accessing shared data (topic of the next lecture)

        • lock: read memory location & set to empty

        • other readers are blocked

        • unlock: write & set to full

CSE 548P


Cray tera mta4

Cray (Tera) MTA

Interesting features

  • virtual memory system

    • no paging: want pages pinned down in memory

    • page size is 256MB

  • user-mode trap handlers

    • fatal exceptions, normalizing floating point numbers

CSE 548P


Cray tera mta5

Cray (Tera) MTA

Compiler support

  • each instruction is a VLIW instruction

    • memory/arithmetic/branch

      • load/store architecture

    • need a good code scheduler

  • explicit dependence look-ahead

    • field in a memory instruction that specifies the number of independent LIW instructions that follow

      • deviation from fine-grain multithreading

    • 8 memory operations can be outstanding

  • handling branches

    • instruction to store a branch target in a register before the branch is executed

      • can start prefetching the target code

CSE 548P


Cray tera mta6

Cray (Tera) MTA

Run-time support

  • thread manipulation

    • protection domains: group of threads executing in the same virtual address space

    • OS sets the maximum number of thread contexts (instruction streams) a domain is allowed

    • domain can create & kill threads within that limit, depending on its need for them

  • OS bases resource allocations on resource usage

CSE 548P


Smt the executive summary

SMT: The Executive Summary

Simultaneous multithreaded (SMT) processors combine designs from:

  • superscalar processors

  • traditional multithreaded processors

    The combination:

  • SMT issues & executes instructions from multiple threads simultaneously

    => converting TLP to ILP

  • threads share almost all hardware resources

CSE 548P


Performance implications

Performance Implications

Multiprogramming workload

  • 2.5X on SPEC95, 4X on SPEC2000

    Parallel programs

  • ~.7 on SPLASH2

    Commercial databases

  • 2-3X on TPC B; 1.5 on TPC D

    Web servers & OS

  • 4X on Apache and Digital Unix

CSE 548P


Does this processor sound familiar

Does this Processor Sound Familiar?

Huge performance boost + straightforward implementation =>

  • 2-context Intel Hyperthreading

  • 2-context Sun UltraSPARC on 4-processor CMP

  • 4-context Compaq 21464

  • network processor & mobile device start-ups

  • others in the wings

CSE 548P


An smt architecture

An SMT Architecture

Three primary goals for this architecture:

1. Achieve significant throughput gains with multiple threads

2. Minimize the performance impact on a single thread executing alone

3. Minimize the architectural impact on a conventional out-of-order superscalar design

CSE 548P


Implementing smt

Implementing SMT

CSE 548P


Implementing smt1

Implementing SMT

Can use as is most hardware on current out-or-order processors

Out-of-order renaming & instruction scheduling mechanisms

  • physical register pool model

  • renaming hardware eliminates false dependences both within a thread (just like a superscalar) & between threads

  • map thread-specific architectural registers onto a pool of thread-independent physical registers

  • operands are thereafter called by their physical names

  • an instruction is issued when its operands become available & a functional unit is free

  • instruction scheduler not consider thread IDs when dispatching instructions to functional units(unless threads have different priorities)

CSE 548P


From superscalar to smt

From Superscalar to SMT

Extra pipeline stages for accessing thread-shared register files

  • 8 threads * 32 registers + renaming registers

    SMTinstruction fetcher (ICOUNT)

  • fetch from 2 threads each cycle

    • count the number of instructions for each thread in the pre-execution stages

    • pick the 2 threads with the lowest number

  • in essence fetching from the two highest throughput threads

CSE 548P


From superscalar to smt1

From Superscalar to SMT

Per-thread hardware

  • small stuff

  • all part of current out-of-order processors

  • none endangers the cycle time

  • other per-thread processor state, e.g.,

    • program counters

    • return stacks

    • thread identifiers, e.g., with BTB entries, TLB entries

  • per-thread bookkeeping for

    • instruction retirement

    • trapping

    • instruction queue flush

      This is why there is only a 10% increase to Alpha 21464 chip area.

CSE 548P


Implementing smt2

Implementing SMT

Thread-shared hardware:

  • fetch buffers

  • branch prediction structures

  • instruction queues

  • functional units

  • active list

  • all caches & TLBs

  • MSHRs

  • store buffers

    This is why there is little single-thread performance degradation (~1.5%).

CSE 548P


Architecture research

Architecture Research

Concept & potential of Simultaneous Multithreading: ISCA ’95 & ISCA 25th Anniversary Anthology

Designing the microarchitecture: ISCA ’96

  • straightforward extension of out-of-order superscalars

    I-fetch thread chooser: ISCA ’96

  • 40% faster than round-robin

    The lockbox for cheap synchronization: HPCA ’98

  • orders of magnitude faster

  • can parallelize previously unparallelizable codes

CSE 548P


Architecture research1

Architecture Research

Software-directed register deallocation: TPDS ’99

  • large register-file performance w. small register file

    Mini-threads: HPCA ’03

  • large SMT performance w. small SMTs

    SMT instruction speculation: TOCS ‘03

  • a good thread mix is the most important performance factor

CSE 548P


Compiler research

Compiler Research

Tuning compiler optimizations for SMT: Micro ‘97 & IJPP ’99

  • data decomposition: use cyclic iteration scheduling

  • tiling: use cyclic tiling; no tile size sweet spot

  • software speculation: generates too much overhead code on integer programs

    Communicate last-use info to HW for early register deallocation: TPDS ’99

  • now need ¼ the renaming registers

    Compiling for fewer registers/thread: HPCA ’03

  • surprisingly little additional spill code (avg. 3%)

CSE 548P


Os research

OS Research

Analysis of OS behavior on SMT: ASPLOS ‘00

  • Kernel-kernel conflicts in I$ & D$ & branch mispredictions ameliorated by SMT instruction issue + thread-sharing in HW

    OS/runtime support for mini-threads: HPCA ’03

  • dedicated server: recompile for fewer registers

  • multiprogrammed environment: multiple versions

    OS/runtime support for executing threaded programs: ISCA ’98 & PPoPP ’03

  • dynamic memory allocation, synchronization, stack offsetting, page mapping

CSE 548P


Others are now carrying the ball

Others are Now Carrying the Ball

Fault detection & recovery

Thread-level speculation

Instruction & data prefetching

Instruction issue hardware design

Thread scheduling

Single-thread execution

Profiling executing threads

SMT-CMP hybrids

CSE 548P


Smt collaborators

UW

Hank Levy

Steve Gribble

Dean Tullsen (UCSD)

Jack Lo (Transmeta)

Sujay Parekh (IBM Yorktown)

Brian Dewey (Microsoft)

Manu Thambi (Microsoft)

Josh Redstone (interviewing)

Mike Swift

Luke McDowell

Steve Swanson

Aaron Eakin (HP)

Dimitriy Portnov (Google)

DEC/Compaq

Joel Emer (now Intel)

Rebecca Stamm

Luiz Barroso (now Google)

Kourosh Gharachorloo (now Google)

For more info on SMT:

http://www.cs.washington.edu/research/smt

SMT Collaborators

CSE 548P


Smt performance

SMT Performance

Apache

SMT improves throughput by converting TLP from multiple contexts into ILP

Performance increases with more contexts

IPC

1

2

4

8

Contexts

CSE 548P


A problem with large smts

A Problem with Large SMTs

Register file will not scale in future technologies

  • e.g., 4-context Alpha 21464:

    • 3 cycles to access

    • 4 times the size of the 64KB instruction cache

      First SMT implementations will be small

  • limited to 2 hardware contexts

  • expect lower throughput gains

    Goal of mini-threads: achieve performance gains of larger SMTs without the register cost

CSE 548P


Mini threads the concept

Mini-threads: the Concept

Execute more than one thread, called a mini-thread, within each context

Mini-threads in a context share the context’s architectural register set

CSE 548P


Smt processor model

SMT Processor Model

Architectural Registers

Context 1

PC

Context 2

PC

CSE 548P


Mini thread processor model

Mini-thread Processor Model

Architectural Registers

Context 1

PC

PC

Context 2

PC

PC

CSE 548P


Mini thread processor model1

Mini-thread Processor Model

Architectural Registers

Context 1

PC

PC

Context 2

PC

PC

CSE 548P


Mini thread processor model2

Mini-thread Processor Model

Context 1

Architectural Registers

Context 1

PC

PC

Context 2

Context 2

PC

PC

CSE 548P


Mini threads executing on smt

Mini-threads Executing on SMT

The HW side: Add extra PCs to each hardware context

  • mini-threads execute using their own PC

  • mini-threads share the context’s architectural register set

CSE 548P


Mini threads executing on smt1

Mini-threads Executing on SMT

The SW side: applications can choose to trade-off registers for TLP

  • ignore the PCs and the machine is SMT

  • use the extra PCs for higher thread-level parallelism

    • an application is responsible for coordinating register usage across mini-threads within its context

CSE 548P


The register tlp trade off

The Register/TLP Trade-off

+ Throughput benefit of extra mini-threads (more TLP)

-- Instruction cost of fewer registers per mini-thread (spill code)

+/--Spill code impact on throughput

+/--Instruction impact of extra mini-threads

CSE 548P


Two mini threads context

Two Mini-threads/Context

Workload Average

90

  • 1 Context: 63%

70

50

30

% Increase in IPC

10

--10

Fmm

Barnes

Apache

Water-spatial

Raytrace

CSE 548P


Two mini threads context1

1 Context

2 Contexts

4 Contexts

8 Contexts

Two Mini-threads/Context

Workload Average

90

  • 1 Context: 63%

  • 2 Contexts: 40%

  • 4 Contexts: 26%

  • 8 Contexts: 9%

70

50

30

% Increase in IPC

10

-10

Fmm

Barnes

Apache

Water-spatial

Raytrace

CSE 548P


16 registers per mini thread

16 Registers per Mini-thread

  • Only 3% on average!

  • IPC cost can be greater

20

15

10

5

0

% Change in Dynamic Instructions

-5

-10

-15

-20

-25

Fmm

Barnes

Apache

Water-spatial

Raytrace

CSE 548P


16 registers per mini thread1

16 Registers per Mini-thread

20

15

  • 7% savings in spill code

10

5

0

% Change in Dynamic Instructions

-5

-10

-15

-20

-25

Fmm

Barnes

Apache

Water-spatial

Raytrace

CSE 548P


16 registers per mini thread2

16 Registers per Mini-thread

20

15

  • Kernel: .8%

10

5

0

% Change in Dynamic Instructions

-5

-10

-15

-20

-25

Fmm

Barnes

Apache

Water-spatial

Raytrace

CSE 548P


Bottom line for a 2 mini threads on smt

1 Context

2 Contexts

4 Contexts

8 Contexts

Bottom Line for a 2 Mini-threads on SMT

Workload Average

100

  • 1 Context: 60%

  • 2 Contexts: 38

  • 4 Contexts: 20%

  • 8 Contexts: -2%

80

60

40

% Speedup

20

0

-20

-40

Fmm

Barnes

Apache

Water-spatial

Raytrace

CSE 548P


Bottom line for a 2 mini threads on smt1

1 Context

2 Contexts

4 Contexts

8 Contexts

Bottom Line for a 2 Mini-threads on SMT

Workload Average

100

  • If free to choose:

    • 1 Contexts: 60%

    • 2 Contexts: 38%

    • 4 Contexts: 22%

    • 8 Contexts: 6%

80

60

40

% Speedup

20

0

-20

-40

Fmm

Barnes

Apache

Water-spatial

Raytrace

CSE 548P


Bottom line for a 2 mini threads on smt2

1 Context

2 Contexts

4 Contexts

8 Contexts

Bottom Line for a 2 Mini-threads on SMT

Workload Average

100

83

66

  • 1 Contexts: 60%

  • 2 Contexts: 38%

  • 4 Contexts: 20%

  • 8 Contexts: -2%

80

60

43

40

10

% Speedup

20

0

-20

-40

Fmm

Barnes

Apache

Water-spatial

Raytrace

CSE 548P


Operating system runtime support

Operating System & Runtime Support

The issues:

  • Runtime and/or OS support for managing mini-threads

  • Compatible register usage between mini-threads & OS/runtime

CSE 548P


Operating system runtime support1

Operating System & Runtime Support

1.Dedicated server, e.g., web server

  • Goal: maximum OS performance

  • Solution:

    • OS/runtime recompiled for half the register set

    • each mini-thread uses different registers

    • hardware partition bit steers register access to high or low set

      + Both mini-threads in a context can execute in the OS at the same time

      -- Only 1 partition allowed & no sharing of values

CSE 548P


Operating system runtime support2

Operating System & Runtime Support

2. Multiprogrammed environment

  • Goal: maximum flexibility to use registers

  • Solution:

    • standard OS compile, multiple versions of the runtime

    • runtime schedules mini-threads

    • hardware blocks 1 mini-thread when the other enters the OS

      + No restrictions on register usage and sharing among mini-threads

      -- Lower throughput because mini-threads are blocked

CSE 548P


Conclusion for a 2 mini threads on smt

Conclusion for a 2 Mini-threads on SMT

Technique for obtaining large-scale SMT throughput with a small-scale SMT implementation

Mini-threads bypass the register scaling problem on SMT

  • average 38% boost on 2-context SMT, and 20% on 4-context SMT, -2% on 8-context SMT

    Performance increases when applications choose to use mini-threads when it pays (38%/22%/6%)

    Reducing registers has low impact!

  • 3% average increase in instructions

CSE 548P


A problem or an opportunity

A Problem or an Opportunity?

Database workload

+ low throughput (0.8 IPC on an 8-wide superscalar)

+ naturally threaded (and widely used) application

- already high cache miss rates on a single-threaded machine

Key question:

Can SMT’s simultaneous multi-thread instruction issue hide the latencies from additional misses in this already memory-intensive databases?

CSE 548P


Database bottom line

Database Bottom Line

SMT gets good speedups on commercial database workloads

  • 3x for debit/credit systems, 1.7 for big search problems

    The reasons:

  • Limits inter-thread data interference in the caches to superscalar levels by:

    • a virtual-to-physical page mapping policy that exploits temporal locality in the physically-addressed L2 cache

    • address offsetting for thread-private data in the L1 data cache

  • Exploits SMT’s inter-thread instruction sharing(35% reduction in instruction cache miss rates)

  • Hides latencies of all kinds with simultaneous, multi-thread instruction issue

CSE 548P


Smt performance1

SMT Performance

CSE 548P


Smt register deallocation strategies

SMT Register Deallocation Strategies

(1) Compiler/architecture support to free registers after their last use

  • small SMT register files achieve large register file performance

  • Use a small register file if the register access sets the cycle time.

    (2) OS support to free registers in idle contexts:

    • enables SMT registers to be shared by all contexts (traditional multithreaded processors have separate contexts per thread)

    • increases performance with a small register file by:

      • up to 50% with 8 threads

      • 3 times with fewer threads.

         Design a good throughput CPU (i.e., SMT) & not penalize the performance of a single or few threads

  • CSE 548P


    Deallocating registers after last use

    Deallocating Registers After Last Use

    CSE 548P


    Putting this in context

    Putting This in Context

    CSE 548P


  • Login