1 / 19

Observations of Simultaneous Multithreading Pentium 4 Processor

This paper discusses the initial observations of the Simultaneous Multithreading Pentium 4 Processor, including its architecture, performance improvements, static and dynamic resource partitioning, and the impact on multi-programmed speedup. The paper also explores the challenges and unanswered questions related to SMT processors.

carolinek
Download Presentation

Observations of Simultaneous Multithreading Pentium 4 Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7960-4 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September 2003

  2. Pentium 4 Architecture • Fetch/commit width = 3mops, execution width = 6 • 128 registers, 126 (48 lds, 24 strs) in-flight instrs • Trace cache has 12K entries, each line has 6 mops • Latencies: L1 – 2 cycles, L2 – 18 cycles, • memory – 361 cycles

  3. Hyper-Threading • Two threads – the Linux operating system operates • as if it is executing on a two-processor system • When there is only one available thread, it behaves • like a regular single-threaded superscalar processor • Statically divided resources: ROB, LSQ, issueq -- a • slow thread will not cripple thruput (might not scale) • Dynamically shared: trace cache and decode • (fine-grained multi-threaded, round-robin), FUs, • data cache, bpred

  4. Results Thruput has gone from 2.2 (single-thread) to 3.9 (base SMT) to 5.4 (ICOUNT.2.8)

  5. Methodology • Three workloads: single-threaded base, parallel • workload (two parallel threads of the same SPLASH • application), heterogeneous workload (single- • threaded app running with each of the other apps) • For heterogeneous workloads – execute two • threads together and restart the program when it • finishes, do this 12 times, discard the last • execution and compute average IPC for each thread • If thread-A executes at 85% efficiency and thread-B • at 75%, speedup equals 1.6

  6. Static Partitioning • A single thread is statically assigned half the • queues – this impacts IPC • A dummy thread ensures that there is no • contention for dynamically assigned resources • (caches, bpred) – helps isolate the effect of • static partitioning • SPEC-int achieves 83% efficiency and SPEC-fp • achieves 85%, range: 71-98%

  7. Multi-Programmed Speedup

  8. Multi-Programmed Speedup • sixtrack and eon do not degrade • their partners (small working sets?) • swim and art degrade their • partners (cache contention?) • Best combination: swim & sixtrack • worst combination: swim & art • Static partitioning ensures low • interference – worst slowdown • is 0.9

  9. Static vs. Dynamic • Statically partitioned resources: queues, ROB: • threads run at 83-85% efficiency • Dynamically partitioned resources: fetch bandwidth, • caches, bpred: threads run at ~60% efficiency • Both contribute equally – however, without static • partitioning, the effect of dynamic partitioning could • go out of control

  10. Parallel Threads Traditional multiproc High comm-cost No interference Parallelism on SMT Low comm-cost High interference • Parallel threads have similar characteristics and • put more pressure on shared resources • Running a parallel version vs. Running two copies • of a single thread: depends on algo, synch cost

  11. Parallel Thread Results

  12. Communication Speed • Locking and reading a value takes 68 cycles • Locking and updating a value takes 171 cycles • (lower than memory access time) • To parallelize efficiently, there has to be X amount • of parallel work in each loop to offset synch costs • -- X is 20,000 computations for SMT; 200,000 for • an SMP – the synch mechanism assumed in past • research was more optimistic than the real design

  13. Computation vs. Communication

  14. Thread Co-Scheduling • Diverse programs interfere less with each other • Avg. speedup is 1.20, but while running two copies • of the same thread, avg. speedup is only 1.11, • int-int is 1.17, fp-fp is 1.20, and int-fp is 1.21 • Symbiotic jobscheduling: each thread has two • favorable partners – construct a schedule such • that every thread is co-scheduled only with its • partners – avg. speedup of 1.27 • Linux can’t exploit -- has 2 independent schedulers

  15. Compiler Optimizations • Multithreading is tolerant of low-ILP codes • Higher optimization levels improve overall • performance, but reduce speedup from SMT

  16. Unanswered Questions • Area overhead of SMT? (multiple renames, RAS, • PC regs) • Register utilization • Effect of fetch policies – is it a bottleneck? • Influence on power, energy, temperature

  17. Conclusions • The real design matches simulation-based • expectations • Static partitioning is important to minimize conflicts • and control thruput losses • Dynamic partitioning might be required for 8 threads • Order of magnitude faster synch than an SMP, but • more room for improvement

  18. Next Class’ Paper • “The Case for a Single-Chip Multiprocessor”, • K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, • K. Chang, Proceedings of ASPLOS-VII, • October 1996

  19. Title • Bullet

More Related