Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Victor W. Lee, et al. Intel Corporation ISCA ’10 June 19-23, 2010, Saint-Malo, France

Mythbusters view on the topic • CPU vs GPU • http://videosift.com/video/MythBusters-CPU-vs-GPU-or-Paintball-Cannons-are-Cool • Full movie: • http://www.nvidia.com/object/nvision08_gpu_v_cpu.html

The Initial Claim • Over the past 4 years NVIDIA has made a great many claims regarding how porting various types of applications to run on GPUs instead of CPUs can tremendously improve performance by anywhere from 10x to 500x. • But it actually began much earlier (SIGGRAPH 2004) • http://pl887.pairlitesite.com/talks/2004-08-08-GP2-CPU-vs-GPU-BillMark.pdf

Intel’s Response? • Intel, unsurprisingly, sees the situation differently, but has remained relatively quiet on the issue, possibly because Larrabee was going to be positioned as a discrete GPU.

Intel’s Response? • The recent announcement that Larrabee has been repurposed as an HPC/scientific computing solution may therefore be partially responsible for Intel ramping up an offensive against NVIDIA's claims regarding GPU computing. • At the International Symposium On Computer Architecture (ISCA) this June, a team from Intel presented a whitepaper purporting to investigate the real-world performance delta between CPUs and GPUs.

But before that…. • December 16, 2009 • One month after ISCA’s final papers were due. • The Federal Trade Commission filed an antitrust-related lawsuit against Intel Wednesday, accusing the chip maker of deliberately attempting hurt its competition and ultimately consumers. • The Federal Trade Commission's complaint against Intel for alleged anticompetitive practices has a new twist: graphics chips.

2009 was expensive for Intel • The European Commission fined Intel for nearly 1.5 billion USD, • the US Federal Trade Commission sued Intel on anti-trust grounds, and • Intel settled with AMD for another 1.25 billion USD. • If nothing else it was an expensive year, and while Intel settling with AMD was a significant milestone for the company it was not the end of their troubles.

Finally the settlement(s) The EU Fine is still under appeal ($1.45B) 8/4/2010 Intel Settles with the FCC Then there is the whole Dell issue….

So back to the paper, What did Intel Say? • Throughput Computing • Kernels • What is a kernel? • Kernels selected: • SGEMM, MC, Conv, FFT, SAXPY, LBM, Solv, SpMV, GJK, Sort, RC, Search, Hist, Bilat

The Hardware selected • CPU: • 3.2GHz Core i7-960, 6GB RAM • GPU • 1.3GHz eVGAGeForce GTX280 w/ 1GB

Optimizations: • CPU • Mutithreading, • cache blocking, and • reorganization of memory accesses for SIMDification • GPU • Minimizing global synchronization, and • using local shared buffers.

This even made Slashdot • Hardware:Intel, NVIDIA Take Shots At CPU vs. GPU Performance

And PCWorld • Intel: 2-year-old Nvidia GPU Outperforms 3.2GHz Core I7 • Intel researchers have published the results of a performance comparison between their latest quad-core Core i7 processor and a two-year-old Nvidia graphics card, and found that the Intel processor can't match the graphics chip's parallel processing performance. • http://www.pcworld.com/article/199758/intel_2yearold_nvidia_gpu_outperforms_32ghz_core_i7.html

From the paper's abstract: • In the past few years there have been many studies claiming GPUs deliver substantial speedups ...over multi-core CPUs...[W]e perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average. • Do you have a problem with this statement?

Intel's own paper indirectly raises a question when it notes: • The previously reported LBM number on GPUs claims 114X speedup over CPUs. However, we found that with careful multithreading, reorganization of memory access patterns, and SIMD optimizations, the performance on both CPUs and GPUs is limited by memory bandwidth and the gap is reduced to only 5X.

What is important about the context? • The International Symposium on Computer Architecture (ISCA) in Saint-Malo, France, interestingly enough, is the same event where NVIDIA’s Chief Scientist Bill Dally received the prestigious 2010 Eckert-Mauchly Award for his pioneering work in architecture for parallel computing.

NVIDIA Blog Response: • It’s a rare day in the world of technology when a company you compete with stands up at an important conference and declares that your technology is *only* up to 14 times faster than theirs. • http://blogs.nvidia.com/ntersect/2010/06/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel.html

NVIDIA Blog Response: (cont) • The real myth here is that multi-core CPUs are easy for any developer to use and see performance improvements.

Undergraduate students learning parallel programming at M.I.T. disputed this when they looked at the performance increase they could get from different processor types and compared this with the amount of time they needed to spend in re-writing their code. • According to them, for the same investment of time as coding for a CPU, they could get more than 35x the performance from a GPU.

Despite substantial investments in parallel computing tools and libraries, efficient multi-core optimization remains in the realm of experts like those Intel recruited for its analysis. • In contrast, the CUDA parallel computing architecture from NVIDIA is a little over 3 years old and already hundreds of consumer, professional and scientific applications are seeing speedups ranging from 10 to 100x using NVIDIA GPUs.

Questions • Where did the 2.5x, 5x, and 14x come from? • How big were the problems that Intel used for comparisons? [compare w/ cache size] • How were they selected? • What optimizations were done?

Fermi cards were almost certainly unavailable when Intel commenced its project, but it's still worth noting that some of the GF100's architectural advances partially address (or at least alleviate) certain performance-limiting handicaps Intel points to when comparing Nehalem to a GT200 processor.

Bottom Line • Parallelization is hard, whether you're working with a quad-core x86 CPU or a 240-core GPU; each architecture has strengths and weaknesses that make it better or worse at handling certain kinds of workloads.

Other Reading • On the Limits of GPU Acceleration http://www.usenix.org/event/hotpar10/tech/full_papers/Vuduc.pdf

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU