200 likes | 230 Views
Explore the evolving relationship between CPUs and GPUs, with insight into compute-maximizing processors and synthesis techniques. Discussion on industry trends reaching a wider audience and the importance of specialized hardware units.
E N D
Heterogeneous Multi-Core Processors Jeremy Sugerman GCafe May 3, 2007
Context • Exploring the CPU and GPU future relationship • Joint work, thinking with Kayvon • Much kibbitzing from Pat, Mike, Tim, Daniel • Vision and opinion, not experiments and results • More of a talk than a paper • The value is more conceptual than algorithmic • Wider gcafe audience appeal than our near term elbows-deep plans to dive into GPU guts
Outline • Introduction • CPU “Special Feature” Background • Compute-Maximizing Processors • Synthesis, with Extensions • Questions for the Audience…
Introduction • Multi-core is status quo for forthcoming CPUs • Variety of emerging (for “general purpose”) architectures try to offer discontinuous performance boost over traditional CPUs • GPU, Cell SPEs, Niagara, Larrabee, … • CPU vendors have a history of co-opting special purpose units for targeted performance wins: • FPU, SSE/Altivec, VT/SVM • CPUs should co-opt entire “compute” cores!
Introduction • Industry is already exploring hybrid models • Cell: 1 PowerPC and 8 SPEs • AMD Fusion: Slideware CPU + GPU • Intel Larrabee: Weirder, NDA encumbered • The programming model for communicating deserves to be architecturally defined. • Tighter integration than the current “host + accelerator” model eases porting and efficiency. • Work queues / buffers allow intregrated coordination with decoupled execution.
Outline • Introduction • CPU “Special Feature” Background • Compute-Maximizing Processors • Synthesis, with Extensions • Questions for the Audience…
CPU “Special Features” • CPUs are built for general purpose flexibility… • … but have always stolen fixed function units in the name of performance. • Old CPUs had schedulers, malloc burned in! • CISC instructions really were faster • Hardware managed TLBs and caches • Arguably, all virtual memory support
CPU “Special Features” • More relevantly, dedicated hardware has been adopted for domain-specific workloads. • … when the domain was sufficiently large / lucrative / influential • … and the increase in performance over software implementation / emulation was BIG • … and the cost in “design budget” (transistors, power, area, etc.) was acceptable. • Examples: FPUs, SIMD and Non-Temporal accesses, CPU virtualization
Outline • Introduction • CPU “Special Feature” Background • Compute-Maximizing Processors • Synthesis, with Extensions • Questions for the Audience…
Compute-Maximizing Processors • “Important” common apps are FLOP hungry • Video processing, Rendering • Physics / Game “Physics” • Even OS compositing managers! • HPC apps are FLOP hungry too • Computational Bio, Finance, Simulations, … • All can soak vastly more compute than current CPUs can deliver. • All can utilize thread or data parallelism. • Increased interest in custom / non-”general” processors
Compute-Maximizing Processors • Or “throughput oriented” • Packed with ALUs / FPUs • Application specified parallelism replaces the focus on single-thread ILP • Available in many flavours: • SIMD • Highly threaded cores • High numbers of tiny cores • Stream processors • Real life examples generally mix and match
Compute-Maximizing Processors • Offer an order of magnitude potential performance boost… if the workload sustains high processor utilization • Mapping / porting algorithms is a labour intensive and complex effort. • This is intrinsic. Within any design budget, a BIG performance win comes at a cost… • If it didn’t, the CPU designers would steal it.
Compute-Maximizing Programming • Generally offered as off-board “accelerators” • Data “tossed over the wall” and back • Only portions of computations achieve a speedup if offloaded • Accelerators mono-task one kernel at a time • Applications are sliced into successive statically defined phases separated by resorting, repacking, or converting entire datasets. • Limited to a single dataset-wide feed forward pipeline. Effectively back to batch processing
Outline • Introduction • CPU “Special Feature” Background • Compute-Maximizing Processors • Synthesis, with Extensions • Questions for the Audience…
Synthesis • Add at least one compute-max core to CPUs • Workloads that use it get BIG performance • Programmers are struggling to get any performance from having more normal cores • Being “on-chip” architected and ubiquitous is huge for application use of compute-max • Compute core exposed as programmable independent multithreaded execution engine • A lot like adding (only!) fragment shaders • Largely agnostic on hardware “flavour”
Extensions • Unified address space • Coherency is nice, but still valuable without it • Multiple kernels “bound” (loaded) at a time • All part of the same application, for now • “Work” delivered to compute cores through work queues • Dequeuing batches / schedules for coherence, not necessarily FIFO • Compute and CPU cores can insert on remote queues
Extensions CLAIM: Queues break the “batch processing” straitjacket and still expose enough coherent parallelism to sustain compute-max utilization. • First part is easy: • Obvious per-data element state machine • Dynamic insertion of new “work” • Instead of being idle as the live thread count in a “pass” drops, a core can pull in “work” from other “passes” (queues).
Extensions CLAIM: Queues break the “batch processing” straitjacket and still expose enough coherent parallelism to sustain compute-max utilization. • Second part is more controversial: • “Lots” of data quantized into a “few” states should have plentiful, easy coherence. • If the workload as a whole has coherence • Pigeon hole argument, basically • Also mitigates SIMD performance constraints • Coherence can be built / specified dynamically
Outline • Introduction • CPU “Special Feature” Background • Compute-Maximizing Processors • Synthesis, with Extensions • Questions for the Audience…
Audience Participation • Do you believe my argument conceptually? • For the heterogeneous / hybrid CPU in general? • For queues and multiple kernels? • What persuades you 3 x86 + compute is preferable to quad x86? • What app / class of apps and how much of a win? 10x? 5x? • How skeptical are you that queues can match the performance of multi-pass / batching? • What would you find a compelling flexibility / expressiveness justification for adding queues? • Performance wins regaining coherence in existing branching/looping shaders? • New algorithms if shaders and CPU threads can dynamically insert additional “work”?