180 likes | 239 Views
Research on porting Memcached to GPU and APU architectures for better efficiency, throughput, and latency while tackling irregular memory access patterns.
E N D
Rich Miler – www.datacenterknowledge.com Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetheringtonɣ Timothy G. Rogersɣ Lisa Hsu* Mike O’Connor* Tor M. Aamodtɣ ɣUBC*AMD University of British Columbia In Proc. 2012 ACM/IEEE Int’l Symp. On Performance Analysis of Systems and Software (ISPASS)
Bruno Giussani – ww.wired.com Motivation New types of workloads • Non-HPC • Server applications Server applications • Memcached Programmer’s initial intuition into an application’s behavior Server farms require a lot of power • Need for efficient, cost-effective solutions • GPU/APUs Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
BackgroundMemcached *Slide from HPCA-18, 2012 Facebook Keynote, Sanjeev Kumar Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Irregular control flow • Irregular memory access patterns • Large memory requirements • Highly input data dependent Memcached - Compatible with GPU? Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Porting MemcachedSimple key-value lookup Return Hit/Miss Key Comparison • READ (GET) requests on GPU • WRITE (SET) requests on CPU Server2 Hash chaining Memory Hash GET Miss Hit Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Porting Memcached - Batching Servern Return Hit/Miss Return Hit/Miss Return Hit/Miss Return Hit/Miss Return Hit/Miss Key Comparison Key Comparison Key Comparison Key Comparison Key Comparison Server2 Server2 Server2 Hash chaining Hash chaining Hash chaining Hash chaining Hash chaining Memory Memory Memory Memory Memory Hash Hash Hash Hash Hash GET GET GET GET GET Miss Miss Miss Miss Miss Hit Hit Hit Hit Hit Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Main Goals • Increase request throughput • Keep request latency reasonable • Main Challenges • Irregular memory access patterns • Irregular control flow • Data transfer overheads Porting Memcached Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Hardware • AMD Radeon HD 5870 (Discrete) • AMD Llano A8-3850 (Fusion) • AMD Zacate E-350 (Fusion) • Simulators • GPGPU-Sim v3.x • In-house GPU control flow simulator • Testing and Simulation • Traces of Wikipedia accesses Methodology Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
One request per work item • Data accesses for GET requests are input data dependent • Data can be anywhere in memory • Poor performance on GPU? Porting MemcachedMemory Access Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Porting MemcachedMemory Divergence Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Recall the control flow graph Many branch outcomes are input data dependent Porting MemcachedControl Flow Work item ID 1 – 2 – 3 – 4 – 5 3 – 4 1 – 2 – 5 1 – 5 2 3 – 4 Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Porting MemcachedControl Flow 29% 51% 62% Overall 15% 40% Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Dynamic memory manager Transfer memory regions to device Virtual addresses different on host and device Porting MemcachedData Management Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Fusion Systems • Physical shared memory region between host and device • Zero-copy data • Discrete Systems • Possible transfer reduction techniques • Reduction in unnecessary transfers • Acyclic data transfers (Overlap comm. with comp.) • Automatic data transfer frameworks Porting MemcachedData Transfer Reduction Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Porting Memcached Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
ResultsRadeon HD 5870 • ~8000 requests yields highest ratio of throughput to latency Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU
Rich Miler – www.datacenterknowledge.com Programmer intuition doesn’t always paint the whole picture We exploited the available parallelism on GPUs by batching requests, showing a 7.5X performance increase on the Llano system Data transfer overheads can have a large impact on overall performance Thank you – Questions? Summary Tayler Hetherington, Timothy Rogers, Lisa Hsu, Mike O'Connor, Tor M. Aamodt Memcached Key-value Store on GPU/APU