cpe 631 multithreading thread level parallelism within a processor l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor PowerPoint Presentation
Download Presentation
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Loading in 2 Seconds...

play fullscreen
1 / 23

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor - PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor. Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic milenka@ece.uah.edu http://www.ece.uah.edu/~milenka. Outline. Trends in microarchitecture Exploiting thread-level parallelism

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CPE 631: Multithreading: Thread-Level Parallelism Within a Processor' - yamin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cpe 631 multithreading thread level parallelism within a processor

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Electrical and Computer EngineeringUniversity of Alabama in Huntsville

Aleksandar Milenkovicmilenka@ece.uah.edu

http://www.ece.uah.edu/~milenka

outline
Outline
  • Trends in microarchitecture
  • Exploiting thread-level parallelism
  • Exploiting TLP within a processor
  • Resource sharing
  • Performance implications
  • Design challenges
  • Intel’s HT technology
trends in microarchitecture
Trends in microarchitecture
  • Higher clock speeds
    • To achieve high clock frequency make pipeline deeper (superpipelining)
    • Events that disrupt pipeline (branch mispredictions, cache misses, etc) become very expensive in terms of lost clock cycles
  • ILP: Instruction Level Parallelism
    • Extract parallelism in a single program
    • Superscalar processors have multiple execution units working in parallel
    • Challenge to find enough instructions that can be executed concurrently
    • Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order
trends in microarchitecture4
Trends in microarchitecture
  • Cache hierarchies
    • Processor-memory speed gap
    • Use caches to reduce memory latency
    • Multiple levels of caches: smaller and faster closer to the processor core
  • Thread-level Parallelism
    • Multiple programs execute concurrently
      • Web-servers have an abundance of software threads
      • Users: surfing the web, listening to music, encoding/decoding video streams, etc.
exploiting thread level parallelism
Exploiting thread-level parallelism
  • CMP – Chip Multiprocessing
    • Multiple processors, each with a full set of architectural resources, reside on the same die
    • Processors may share an on-chip cache or each can have its own cache
    • Examples: HP Mako, IBM Power4
    • Challenges: Power, Die area (cost)
  • Time-slice multithreading
    • Processor switches between software threads after a predefined time slice
    • Can minimize the effects of long lasting events
    • Still, some execution slots are wasted
multithreading within a processor
Multithreading Within a Processor
  • Until now, we have executed multiple threads of an application on different processors – can multiple threads execute concurrently on the same processor?
  • Why is this desireable?
    • inexpensive – one CPU, no external interconnects
    • no remote or coherence misses (more capacity misses)
  • Why does this make sense?
    • most processors can’t find enough work – peak IPC is 6, average IPC is 1.5!
    • threads can share resources  we can increase threads without a corresponding linear increase in area
what resources are shared
What Resources are Shared?
  • Multiple threads are simultaneously active (in other words, a new thread can start without a context switch)
  • For correctness, each thread needs its own PC, its own logical regs (and its own mapping from logical to phys regs)
  • For performance, each thread could have its own ROB (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing  better utilization of resources
  • Each additional thread costs a PC, rename table, and ROB – cheap!
approaches to multithreading within a processor
Approaches to Multithreading Within a Processor
  • Fine-grained multithreading:switches threads on every clock cycle
    • Pro: hide latency of from both short and long stalls
    • Con: Slows down execution of the individual threads ready to go
  • Course-grained multithreading:switches threads only on costly stalls (e.g., L2 stalls)
    • Pros: no switching each clock cycle, no slow down for ready-to-go threads
    • Con: limitations in hiding shorter stalls
  • Simultaneous Multithreading:exploits TLP at the same time it exploits ILP
how resources are shared
How Resources are Shared?

Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC.

Thread 1

Thread 2

Thread 3

Cycles

Thread 4

Idle

Coarse-grainedMultithreading

Fine-Grained

Multithreading

Superscalar

Simultaneous

Multithreading

  • Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss
  • Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated
  • Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot
resource sharing
Resource Sharing

Thread-1

R1  R1 + R2

R3  R1 + R4

R5  R1 + R3

P73 P1 + P2

P74  P73 + P4

P75  P73 + P74

Instr Fetch

Instr Rename

Issue Queue

Instr Fetch

Instr Rename

P73 P1 + P2

P74  P73 + P4

P75  P73 + P74

P76  P33 + P34

P77  P33 + P76

P78  P77 + P35

R2  R1 + R2

R5  R1 + R2

R3  R5 + R3

P76  P33 + P34

P77  P33 + P76

P78  P77 + P35

Thread-2

Register File

FU

FU

FU

FU

performance implications of smt
Performance Implications of SMT
  • Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread
  • While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources
  • With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4
  • Alpha 21464 and Intel Pentium 4 are examples of SMT
design challenges
Design Challenges
  • How many threads?
    • Many to find enough parallelism
    • However, mixing many threads will compromise execution of individual threads
  • Processor front-end (instruction fetch)
    • Fetch as far as possible in a single thread (to maximize thread performance)
    • However, this limits the number of instructions available for scheduling from other threads
  • Larger register files (multiple contexts)
  • Minimize clock cycle time
  • Cache conflicts
pentium 4 hyperthreading architecture
Pentium 4: Hyperthreading architecture
  • One physical processor appears as multiple logical processors
  • HT implementation on NetBurst microarchitecture has 2 logical processors

Architectural State

Architectural State

  • Architectural state:
  • general purpose registers
  • control registers
  • APIC: advanced programmable interrupt controller

Processor execution resources

pentium 4 hyperthreading architecture14
Pentium 4: Hyperthreading architecture
  • Main processor resources are shared
    • caches, branch predictors, execution units, buses, control logic
  • Duplicated resources
    • register alias tables (map the architectural registers to physical rename registers)
    • next instruction pointer and associated control logic
    • return stack pointer
    • instruction streaming buffer and trace cache fill buffers
pentium 4 resources sharing schemes
Pentium 4: Resources sharing schemes
  • Partition – dedicate equal resources to each logical processors
    • Good when expect high utilization and somewhat unpredicatable
  • Threshold – flexible resource sharing with a limit on maximum resource usage
    • Good for small resources with bursty utilization and when the micro-ops stay in the structure for short predictable periods
  • Full sharing – flexible with no limits
    • Good for large structures, with variable working-set sizes
netburst pipeline
NetBurst Pipeline

threshold

partitioned

pentium 4 shared vs partitioned resources
Pentium 4: Shared vs. partitioned resources
  • Partitioned
    • E.g., major pipeline queues
  • Threshold
    • Puts a threshold on the number of resource entries a logical processor can have
    • E.g., scheduler
  • Fully shared resources
    • E.g., caches
      • Modest interference
      • Benefit if we have shared code and/or data