Observations of Simultaneous Multithreading Pentium 4 Processor

CS 7960-4 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September 2003

Pentium 4 Architecture • Fetch/commit width = 3mops, execution width = 6 • 128 registers, 126 (48 lds, 24 strs) in-flight instrs • Trace cache has 12K entries, each line has 6 mops • Latencies: L1 – 2 cycles, L2 – 18 cycles, • memory – 361 cycles

Hyper-Threading • Two threads – the Linux operating system operates • as if it is executing on a two-processor system • When there is only one available thread, it behaves • like a regular single-threaded superscalar processor • Statically divided resources: ROB, LSQ, issueq -- a • slow thread will not cripple thruput (might not scale) • Dynamically shared: trace cache and decode • (fine-grained multi-threaded, round-robin), FUs, • data cache, bpred

Results Thruput has gone from 2.2 (single-thread) to 3.9 (base SMT) to 5.4 (ICOUNT.2.8)

Methodology • Three workloads: single-threaded base, parallel • workload (two parallel threads of the same SPLASH • application), heterogeneous workload (single- • threaded app running with each of the other apps) • For heterogeneous workloads – execute two • threads together and restart the program when it • finishes, do this 12 times, discard the last • execution and compute average IPC for each thread • If thread-A executes at 85% efficiency and thread-B • at 75%, speedup equals 1.6

Static Partitioning • A single thread is statically assigned half the • queues – this impacts IPC • A dummy thread ensures that there is no • contention for dynamically assigned resources • (caches, bpred) – helps isolate the effect of • static partitioning • SPEC-int achieves 83% efficiency and SPEC-fp • achieves 85%, range: 71-98%

Multi-Programmed Speedup

Multi-Programmed Speedup • sixtrack and eon do not degrade • their partners (small working sets?) • swim and art degrade their • partners (cache contention?) • Best combination: swim & sixtrack • worst combination: swim & art • Static partitioning ensures low • interference – worst slowdown • is 0.9

Static vs. Dynamic • Statically partitioned resources: queues, ROB: • threads run at 83-85% efficiency • Dynamically partitioned resources: fetch bandwidth, • caches, bpred: threads run at ~60% efficiency • Both contribute equally – however, without static • partitioning, the effect of dynamic partitioning could • go out of control

Parallel Threads Traditional multiproc High comm-cost No interference Parallelism on SMT Low comm-cost High interference • Parallel threads have similar characteristics and • put more pressure on shared resources • Running a parallel version vs. Running two copies • of a single thread: depends on algo, synch cost

Parallel Thread Results

Communication Speed • Locking and reading a value takes 68 cycles • Locking and updating a value takes 171 cycles • (lower than memory access time) • To parallelize efficiently, there has to be X amount • of parallel work in each loop to offset synch costs • -- X is 20,000 computations for SMT; 200,000 for • an SMP – the synch mechanism assumed in past • research was more optimistic than the real design

Computation vs. Communication

Thread Co-Scheduling • Diverse programs interfere less with each other • Avg. speedup is 1.20, but while running two copies • of the same thread, avg. speedup is only 1.11, • int-int is 1.17, fp-fp is 1.20, and int-fp is 1.21 • Symbiotic jobscheduling: each thread has two • favorable partners – construct a schedule such • that every thread is co-scheduled only with its • partners – avg. speedup of 1.27 • Linux can’t exploit -- has 2 independent schedulers

Compiler Optimizations • Multithreading is tolerant of low-ILP codes • Higher optimization levels improve overall • performance, but reduce speedup from SMT

Unanswered Questions • Area overhead of SMT? (multiple renames, RAS, • PC regs) • Register utilization • Effect of fetch policies – is it a bottleneck? • Influence on power, energy, temperature

Conclusions • The real design matches simulation-based • expectations • Static partitioning is important to minimize conflicts • and control thruput losses • Dynamic partitioning might be required for 8 threads • Order of magnitude faster synch than an SMP, but • more room for improvement

Next Class’ Paper • “The Case for a Single-Chip Multiprocessor”, • K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, • K. Chang, Proceedings of ASPLOS-VII, • October 1996

Title • Bullet

Observations of Simultaneous Multithreading Pentium 4 Processor

Observations of Simultaneous Multithreading Pentium 4 Processor

Presentation Transcript

CS 160: Lecture 20

CS 7960-4 Lecture 20

CS 7960-4 Lecture 24

CS 7960-4 Lecture 8

CS 519: Lecture 4

CS 7960-4 Lecture 5

CS 140L Lecture 4

CS 140L Lecture 4

CS 425 Lecture 4

CS 7960-4 Lecture 23

CS 584 Lecture 20

CS 7960-4 Lecture 2

CS 7960-4 Lecture 17

CS 160: Lecture 20

CS 160: Lecture 20

CS 7960-4 Lecture 10

CS 7960-4 Lecture 7

CS 7960-4 Lecture 20

CS 7960-4 Lecture 4

CS 160: Lecture 20

CS 7960-4 Lecture 14

CS 7960-4 Lecture 18