Heterogeneous CPU/GPU co-processor clusters

Heterogeneous CPU/GPU co-processor clusters
Michael Fruchtman

Current State Eight of the top ten most efficient clusters are heterogeneous [1] Power law of efficiency

Current State At today’s efficiencies: An exascale (1018) cluster will require 200MegaWatts [2] Cluster efficiency must grow by 66% a year to keep up with Moore’s Law Most efficient cluster increased at normalized 61.4% average per year This gap represents the increase in power requirements to grow from petascale to exascale

Power Efficient Amdahl’s Law [3] Three transitions from P P to P*, P to c*, P+c* Speedup per watt f is fraction of parallel execution N total number of cores P+c* Wc percentage of power draw of c to P Kc percentage of power draw of idle c to active c K power draw of P Scperformance of c relative to P

Power Efficient Amdahl’s Law [3] Given Wc=0.25, Sc=0.5, Kc=0.60 N variable to power budget, K=1 Top: f=0.3 Bottom: f=0.9 P+c* is superior with increased parallelization

GPU Architecture [4]

P-E Amdahl’s Law and GPU Wc = 0.00417, 0.5 watts per core, K=120 Intel i7 980 XE Kc = 0.115 turning on a GPU is 71% of power draw [5] Sc is harder to measure, memory or computation bound? GPU memory architecture makes this difficult to measure. Sc = 0.172 assuming computational with the GTX580

Threads, Blocks and Performance [5]

Formal Power Modeling [6] Average Geometric Error of Power Prediction = 9.18%

Temperature Model [6] RC_Rise = 35 and RC_Decay = 65 GPU dependent constants

Conditions for GPU Use GTX 580 draws 244W on load Speedup must be greater than 2, 3 for safety f must be very high, preferably 0.9 or higher Improved energy efficiency is based on performance Example: GPUDB SQL queries Without joins speedup 20+ [7] With joins 2-7 [8]

Reducing GPU Power Usage Powergating Improved Memory Coalescence Memory Coalescence Models Incoherent Branching Incoherent Branching Models NVIDIA Optimus reduces idle power to near zero

References [1]Feng, Wu-chan and Kirk W. Cameron. "The Green 500 List - November 2010." The Green 500. Virginia Tech and Virginia Polytechnic Institute and State University. November 2010. Web. March 15 2011. [2] T. Agerwala. Challenges on the road to exascale computing. Proceedings of the 22nd annual international conference on Supercomputing (ICS '08). ACM, New York, NY, USA, 2-2. 2008. [3] D. Woo and H-H Lee. Extending Amdahl's Law for Energy-Efficient Computing in the Multi-Core Era. IEEE Xplore. IEEE Computer Society. December 2008. Web. March 15, 2011. [4] R. Smith. "NVIDIA's GeForce GTX 580: Fermi Redefined. AnandTech. November 9, 2010. Web. March 16, 2011. http://www.anandtech.com/show/4008/nvidias-geforce-gtx-580 [5] R. Suda and D. Ren. Accurate Measurements and Precise Modeling of Power Dissipation of CUDA Kernels towards Power Optimized High Performance Computing. International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE Computer Society. pp. 432-438. 2009. [6] S. Hong and H. Kim. An Integrated GPU Power and Performance Model. ISCA '10 Proceedings of the 37th annual international symposium on Computer architecture. ACM, New York, NY, USA. pp. 280-289. 2010. [7] P. Bakkum and K. Skadron. Accelerating SQL Database Operations on a GPU with CUDA. GPGPU '10 Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, New York, NY, USA. pp. 94-103. B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational Joins on Graphics Processors. SIGMOD '08 Proceeding on the 2008 ACM SIGMOD international conference on Management of data. ACM, New York, NY, USA. pp. 511-524. 2008.

Heterogeneous CPU/GPU co-processor clusters