1 / 38

Redefining the Role of the CPU in the Era of CPU-GPU Integration

Redefining the Role of the CPU in the Era of CPU-GPU Integration. Manish Arora , Siddhartha Nath , Subhra Mazumdar , Scott Baden and Dean Tullsen Computer Science and Engineering, UC San Diego IEEE Micro Nov – Dec 2012 AMD Research August 20 th 2012. Overview. Motivation

onella
Download Presentation

Redefining the Role of the CPU in the Era of CPU-GPU Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Redefining the Role of the CPU in the Era of CPU-GPU Integration Manish Arora,Siddhartha Nath, SubhraMazumdar, Scott Baden and Dean Tullsen Computer Science and Engineering, UC San Diego IEEE Micro Nov – Dec 2012 AMD Research August 20th 2012

  2. Overview • Motivation • Benchmarks and Methodology • Analysis • CPU Criticality • ILP • Branches • Loads and Stores • Vector Instructions • TLP • Impact on CPU Design

  3. Historical Progression General Purpose Applications Throughput Applications Energy Efficient GPUs Multicore CPUs Performance/Energy/… gains with chip integration GPGPU APU Focus of Improvements ? Scaling Improved GPGPU CPU Architecture Next-Gen APU Improved Memory Systems … Easier Programming

  4. The CPU-GPU Era Consumer: Phenom /Athlon II Server: Barcelona... Consumer: Vishera Server: Delhi/Abu Dhabi … APUs have essentially the same CPU cores as CPU-only parts Server Parts Husky (K10) CPU + NI GPU Piledriver CPU + SI GPU Steamroller CPU + Sea I GPU Components AMD APU Products Llano Trinity Kaveri 2011 2012 2013

  5. Example CPU-GPU Benchmark • KMeans (Implementation from Rodinia) Randomly Pick Centers Find Closest Center for each Point Find new Centers GPU CPU Few Centers with possibly different #points Easy data parallelism over each point

  6. Properties of KMeans CPU Performance Critical • +GPU drastically impacts CPU code properties • Aim: Understand and Evaluate this “new” CPU workload

  7. The Need to Rethink CPU Design • APUs: Prime example of heterogeneous systems • Heterogeneity: Composing cores run subsets well • CPU need not be fully general-purpose • Sufficient to optimize for Non-GPU code Investigate Non-GPU code and guide CPU design

  8. Overview • Motivation • Benchmarks and Methodology • Analysis • CPU Criticality • ILP • Branches • Loads and Stores • Vector Instructions • TLP • Impact on CPU Design

  9. Benchmarks GPU Heavy Mixed CPU only Parallel Apps Partitioned Apps Serial Apps CPU Heavy CPU GPU

  10. Benchmarks • CPU-Heavy (11 Apps) • Important computing apps with no evidence of GPU ports • SPEC: Parser, Bzip, Gobmk, MCF, Sjeng, GemsFDTD [Serial] • Parsec: Povray, Tonto, Facesim, Freqmine, Canneal [Parallel] • Mixed and GPU-Heavy (11 + 11 Apps) • Rodinia (7 Apps) • SPEC/Parsec mapped to GPUs (15 Apps)

  11. Mixed

  12. GPU-Heavy

  13. Methodology • Interested in Non-GPU portions of CPU-GPU code • Ideal scenario: Port all applications on the GPU and use hardware counters • Man hours / Domain expertise needed / Platform and architecture dependent code • CPU-GPU partitioning based on expert information • Publically available source code (Rodinia) • Details of GPU portions from publications and own implementations (SPEC/Parsec)

  14. Methodology • Microarchitectural simulations • Marked GPU portions on application code • Ran marked applications via PIN based microarchitecturalsimulators (ILP, Branches, Loads and Stores) • Machine measurements • Using marked code (CPU Criticality) • Used parallel CPU source code when available (TLP studies)

  15. Overview • Motivation • Benchmarks and Methodology • Analysis • CPU Criticality • ILP • Branches • Loads and Stores • Vector Instructions • TLP • Impact on CPU Design

  16. CPU Criticality

  17. Future Averages weighted by conservative CPU time Mixed: Even though 80% code is mapped to the GPU, the CPU is still the bottleneck More time spend on the CPU than on the GPU CPU executes 7-14% of time even for GPU-Heavy apps Mixed GPU-Heavy

  18. Instruction Level Parallelism • Measures inherent instruction stream parallelism • Measured ILP with perfect memory and branches

  19. 12.7 9.6 CPU-Heavy

  20. Overall 9.9  9.5 (128) 13.7 12.2 (512) 10.3  9.2 (128) 15.3 11.1 (512) 14.6  13.7 12.7 9.6 CPU +GPU CPU +GPU Mixed GPU-Heavy

  21. Instruction Level Parallelism • ILP dropped in 17 of 22 applications • 4% for 128 size and10.9% for 512 size • Dropped by half for 5 applications • Mixed apps ILP dropped by as much as 27.5% • Common case • Independent loops mapped to the GPU • Less regular dependence heavy code on the CPU • Occasionally long dependent chains on the GPU • Blackscholes (total of 5/22 outliers) Potential gains from larger windows are going to be degraded

  22. Branches • Branches categorized into 4 categories • Biased (> 95% same direction) • Patterned (> 95% accuracy on very large local predictor) • Correlated (> 95% accuracy on very large gshare predictor) • Hard (Remaining)

  23. 24.7% 7.0% 13.1% 55.2% CPU-Heavy

  24. Effect of CPU-Heavy Apps 11.3% 5.1% 9.4% Effects of Data-Dependent branches on GPU-Heavy Apps 18.6% Overall: Branch predictors tuned for generic CPU execution may not be sufficient CPU +GPU Mixed GPU-Heavy

  25. Loads and Stores • Loads and Stores categorized into 4 categories • Static (> 95% same address) • Strided (> 95% accuracy on very large stride predictor) • Patterned (> 95% accuracy on very large Markov predictor) • Hard (Remaining)

  26. 77.5% 5.9% 16.6% CPU-Heavy

  27. 71.7% 10.2% 18.1% CPU-Heavy

  28. 44.4% 61.6% Overall: Stride or next line predictors will struggle Effects of kernels with Irregular accesses moving to the GPU 47.3% CPU +GPU 27.0% Mixed GPU-Heavy

  29. 38.6% 51.3% Overall: Slightly less pronounced but similar results as loads 48.6% CPU +GPU 34.9% Mixed GPU-Heavy

  30. 7.3% CPU-Heavy

  31. Vector ISA enhancements targeting the same regions of code as the GPU 16.9% 15.0% 9.6% 8.5% CPU +GPU Mixed GPU-Heavy

  32. CPU-Heavy

  33. Abundant parallelism in GPU-Heavy disappears. No gain going from 8 cores to 32 cores. 14.0x 2.1x Overall 10% gain going from 8 cores to 32 cores. 32 core TLP dropped 60% from 5.5x to 2.2x Mixed: Gains drop from 4x to 1.4x CPU +GPU CPU +GPU Mixed GPU-Heavy

  34. Overview • Motivation • Benchmarks and Methodology • Analysis • CPU Criticality • ILP • Branches • Loads and Stores • Vector Instructions • TLP • Impact on CPU Design

  35. CPU Design in the post-GPU Era • Only modest gains from increasing window sizes • Considerably increased pressure on branch predictor • In spite of fewer static branches • Adopt techniques targeting fewer difficult branches (L-TageSeznec 2007 ) • Memory access will continue to be major bottlenecks • Stride or next-line prefetching significantly much less relevant • Lots of literature but never adapted on real machines (e.g. Helper thread prefetching or mechanisms targeted at pointer chains) • SSE rendered significantly less important • Every core need not have it / cores could share SSE hardware • Extra CPU cores/threads not of much use because of lack of TLP

  36. CPU Design in the post-GPU Era Clear case for Big Cores (with a focus on loads/stores/branches and not ILP) + GPUs Need to start adopting proposals for few-thread performance (3) Start by relooking old techniques with current perspectives

  37. Backup

  38. On Using Unmodified Source Code • Most common memory layout change: AOS -> SOA • Still a change in stride value • AOS well captured by stride/markov predictors • CPU only code has even better locality well captured by strided/markov predictors • But the locality enhanced accesses map to the GPU • Minimal impact on CPU code with GPU: still irregular accesses

More Related