Redefining the Role of the CPU in the Era of CPU-GPU Integration

Redefining the Role of the CPU in the Era of CPU-GPU Integration Manish Arora,Siddhartha Nath, SubhraMazumdar, Scott Baden and Dean Tullsen Computer Science and Engineering, UC San Diego IEEE Micro Nov – Dec 2012 AMD Research August 20th 2012

Overview • Motivation • Benchmarks and Methodology • Analysis • CPU Criticality • ILP • Branches • Loads and Stores • Vector Instructions • TLP • Impact on CPU Design

Historical Progression General Purpose Applications Throughput Applications Energy Efficient GPUs Multicore CPUs Performance/Energy/… gains with chip integration GPGPU APU Focus of Improvements ? Scaling Improved GPGPU CPU Architecture Next-Gen APU Improved Memory Systems … Easier Programming

The CPU-GPU Era Consumer: Phenom /Athlon II Server: Barcelona... Consumer: Vishera Server: Delhi/Abu Dhabi … APUs have essentially the same CPU cores as CPU-only parts Server Parts Husky (K10) CPU + NI GPU Piledriver CPU + SI GPU Steamroller CPU + Sea I GPU Components AMD APU Products Llano Trinity Kaveri 2011 2012 2013

Example CPU-GPU Benchmark • KMeans (Implementation from Rodinia) Randomly Pick Centers Find Closest Center for each Point Find new Centers GPU CPU Few Centers with possibly different #points Easy data parallelism over each point

Properties of KMeans CPU Performance Critical • +GPU drastically impacts CPU code properties • Aim: Understand and Evaluate this “new” CPU workload

The Need to Rethink CPU Design • APUs: Prime example of heterogeneous systems • Heterogeneity: Composing cores run subsets well • CPU need not be fully general-purpose • Sufficient to optimize for Non-GPU code Investigate Non-GPU code and guide CPU design

Benchmarks GPU Heavy Mixed CPU only Parallel Apps Partitioned Apps Serial Apps CPU Heavy CPU GPU

Benchmarks • CPU-Heavy (11 Apps) • Important computing apps with no evidence of GPU ports • SPEC: Parser, Bzip, Gobmk, MCF, Sjeng, GemsFDTD [Serial] • Parsec: Povray, Tonto, Facesim, Freqmine, Canneal [Parallel] • Mixed and GPU-Heavy (11 + 11 Apps) • Rodinia (7 Apps) • SPEC/Parsec mapped to GPUs (15 Apps)

Mixed

GPU-Heavy

Methodology • Interested in Non-GPU portions of CPU-GPU code • Ideal scenario: Port all applications on the GPU and use hardware counters • Man hours / Domain expertise needed / Platform and architecture dependent code • CPU-GPU partitioning based on expert information • Publically available source code (Rodinia) • Details of GPU portions from publications and own implementations (SPEC/Parsec)

Methodology • Microarchitectural simulations • Marked GPU portions on application code • Ran marked applications via PIN based microarchitecturalsimulators (ILP, Branches, Loads and Stores) • Machine measurements • Using marked code (CPU Criticality) • Used parallel CPU source code when available (TLP studies)

CPU Criticality

Future Averages weighted by conservative CPU time Mixed: Even though 80% code is mapped to the GPU, the CPU is still the bottleneck More time spend on the CPU than on the GPU CPU executes 7-14% of time even for GPU-Heavy apps Mixed GPU-Heavy

Instruction Level Parallelism • Measures inherent instruction stream parallelism • Measured ILP with perfect memory and branches

12.7 9.6 CPU-Heavy

Overall 9.9  9.5 (128) 13.7 12.2 (512) 10.3  9.2 (128) 15.3 11.1 (512) 14.6  13.7 12.7 9.6 CPU +GPU CPU +GPU Mixed GPU-Heavy

Instruction Level Parallelism • ILP dropped in 17 of 22 applications • 4% for 128 size and10.9% for 512 size • Dropped by half for 5 applications • Mixed apps ILP dropped by as much as 27.5% • Common case • Independent loops mapped to the GPU • Less regular dependence heavy code on the CPU • Occasionally long dependent chains on the GPU • Blackscholes (total of 5/22 outliers) Potential gains from larger windows are going to be degraded

Branches • Branches categorized into 4 categories • Biased (> 95% same direction) • Patterned (> 95% accuracy on very large local predictor) • Correlated (> 95% accuracy on very large gshare predictor) • Hard (Remaining)

24.7% 7.0% 13.1% 55.2% CPU-Heavy

Effect of CPU-Heavy Apps 11.3% 5.1% 9.4% Effects of Data-Dependent branches on GPU-Heavy Apps 18.6% Overall: Branch predictors tuned for generic CPU execution may not be sufficient CPU +GPU Mixed GPU-Heavy

Loads and Stores • Loads and Stores categorized into 4 categories • Static (> 95% same address) • Strided (> 95% accuracy on very large stride predictor) • Patterned (> 95% accuracy on very large Markov predictor) • Hard (Remaining)

77.5% 5.9% 16.6% CPU-Heavy

71.7% 10.2% 18.1% CPU-Heavy

44.4% 61.6% Overall: Stride or next line predictors will struggle Effects of kernels with Irregular accesses moving to the GPU 47.3% CPU +GPU 27.0% Mixed GPU-Heavy

38.6% 51.3% Overall: Slightly less pronounced but similar results as loads 48.6% CPU +GPU 34.9% Mixed GPU-Heavy

7.3% CPU-Heavy

Vector ISA enhancements targeting the same regions of code as the GPU 16.9% 15.0% 9.6% 8.5% CPU +GPU Mixed GPU-Heavy

CPU-Heavy

Abundant parallelism in GPU-Heavy disappears. No gain going from 8 cores to 32 cores. 14.0x 2.1x Overall 10% gain going from 8 cores to 32 cores. 32 core TLP dropped 60% from 5.5x to 2.2x Mixed: Gains drop from 4x to 1.4x CPU +GPU CPU +GPU Mixed GPU-Heavy

CPU Design in the post-GPU Era • Only modest gains from increasing window sizes • Considerably increased pressure on branch predictor • In spite of fewer static branches • Adopt techniques targeting fewer difficult branches (L-TageSeznec 2007 ) • Memory access will continue to be major bottlenecks • Stride or next-line prefetching significantly much less relevant • Lots of literature but never adapted on real machines (e.g. Helper thread prefetching or mechanisms targeted at pointer chains) • SSE rendered significantly less important • Every core need not have it / cores could share SSE hardware • Extra CPU cores/threads not of much use because of lack of TLP

CPU Design in the post-GPU Era Clear case for Big Cores (with a focus on loads/stores/branches and not ILP) + GPUs Need to start adopting proposals for few-thread performance (3) Start by relooking old techniques with current perspectives

Backup

On Using Unmodified Source Code • Most common memory layout change: AOS -> SOA • Still a change in stride value • AOS well captured by stride/markov predictors • CPU only code has even better locality well captured by strided/markov predictors • But the locality enhanced accesses map to the GPU • Minimal impact on CPU code with GPU: still irregular accesses

Redefining the Role of the CPU in the Era of CPU-GPU Integration

Redefining the Role of the CPU in the Era of CPU-GPU Integration

Presentation Transcript

Understanding the CPU

The Pipelined CPU

The CPU

The CPU

A Discussion of CPU vs. GPU

The CPU

The CPU

The CPU Challenge

Inside The CPU

GPU and CPU: The Differences

Understanding The CPU

GPU vs. CPU

The ROOT Project in the multi-core CPU era

The CPU - Outline.

Registers in the CPU

Inside the CPU

The CPU

Considered the brain of the computer GPU CPU/tutorialoutlet

Building the CPU

Building the CPU

The ROOT Project in the multi-core CPU era