Korey Sewell* , Trevor Mudge, Steven K. Reinhardt † *Advanced Computer Architecture Labaratory (ACAL) University of

eXtremeVirtual Pipelining (XVP): Moving Towards Scalable Multithreaded Processors Korey Sewell*, Trevor Mudge*, Steven K. Reinhardt*† *Advanced Computer Architecture Labaratory (ACAL) University of Michigan, Ann Arbor †Advanced Micro Devices (AMD) ASPLOS – WACI ‘09

The Comp. Arch. Research Train P = Processor(s) T =Thread(s) Uniprocessor-Place (1P, 1T) Many-Core Mansion (~32-64P, ~2-4T) Multithreading-Ville (1P, ~2-4T) Did we miss a stop on the way???? What about “Many”-Threading?!!! Multicore-Estates (2-4P, ~2-4T)

Why “Many-Threading”? • CHANGES the way we think about architecture… • Moving from 2-4 threads per core to 16, 32 or even 64 threads per core • Threads aren’t just Parallel…They’re Adjacent! • What would you create if you had “threads to throw away”? • Hmmmmmmm…..

WACI,“Many”-Threading Possibilities • “Coherence-Free” Synchronization & Communication • Why Suffer from Non-Deterministic Memory Latency when so many threads are adjacent (on same core)? Memory System CPU CPU … … T0 T0 T1 T1 T2 T2 TN TN

WACI,“Many”-Threading Possibilities • Extremely Speculative Multithreading • Use extra threads during speculative events (e.g. branch misprediction, cache miss) • Fast forward execution by traversing speculation tree and then switching threads. T T F … Branch Misprediction T F F

WACI,“Many”-Threading Possibilities • Super Virtual Machines • Security: Every application given it’s own VM? • Many-Many Systems! • Many Threads, Many Cores • 1000 thread system = 64 cores, 16 threads per core • Redundant Multithreading • This list keeps going….and going…and going!!!

How do we get to Many-Threading? • A design that avoids non-scalable, conventional multithreading pitfalls such as… • Replication of per-thread resources • Extensive size increases of shared resources • Complex resource distribution methods amongst threads

WACI Solution:eXtremeVirtual Pipelining (XVP) = T1 = TN = T0 • Provide each thread the illusion that it has all the processor resources to itself • Traditionally, simultaneous executing threads have a shared pipeline view = T0 - TN IQ IQ IQ IQ RF RF RF RF ROB ROB ROB ROB F F F D D D D R R R R F EXE EXE EXE EXE LSQ LSQ LSQ LSQ …

WACI Solution:eXtremeVirtual Pipelining (XVP): • Pipeline Virtualization: Resource entries are mapped into each thread’s address space Resource “X” BaseT0 + BaseT1 + BaseTN + 0 … 7 0 … 7 0 … 7 0 T0 T1 TN CPU 7 MEMORY

WACI Solution:eXtremeVirtual Pipelining (XVP): • XVP extends the notion of a hardware context to include pipeline resources • Add a C-Cache (Context) to avoid D-Cache thrashing and potentially reduce memory footprint in workloads • Each stallable resource matched with it’s own “on-demand” Fill-Spill-Unit (FSU) • Ex:Spill IQ on dep. load miss / Fill when miss resolves • FSU allows resources to dynamically partition themselves for arbitrary workloads • Virtualize all stalling processor resources to memory • Fetch Buffer, Instruction Queue, Load/Store Queue, Register File, Reorder Buffer C-Cache FSU FSU FSU FSU IQ RF ROB F D R EXE LSQ

WACI Conclusion:eXtremeVirtual Pipelining (XVP) • A high # of threads per core opens up interesting multithreading research angles • XVP’s pipeline virtualization moves toward scalable many-threads per core • Each thread has illusion that it has it’s own pipeline • XVP can also benefit single-thread processors… • Because XVP’s virtualization provides more resources than traditionally available.

Thanks for Listening!

Korey Sewell* , Trevor Mudge, Steven K. Reinhardt † *Advanced Computer Architecture Labaratory (ACAL) University of