Maximizing Hybrid GPU Acceleration: Unlocking Compute Power & Efficiency

Harvesting the Opportunity of GPU-based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work Abdullah Gharaibeh, Samer Al-Kiswany

Networked Systems Laboratory (NetSysLab) University of British Columbia A golf course … … a (nudist) beach (… and 199 days of rain each year)

Hybrid architectures in Top 500 [Nov’10]

Hybrid architectures • High compute power / memory bandwidth • Energy efficient [operated today at low efficiency] • Agenda for this talk • GPU Architecture Intuition • What generates the above characteristics? • Progress on efficiently harnessing hybrid (GPU-based) architectures

Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Feed the cores with data Idea #3 The processing elements are data hungry!  Wide, high throughput memory bus

10,000x parallelism! Idea #4 Hide memory access latency  Hardware supported multithreading

Host Machine Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory Instruction Unit Registers Registers Registers Host Memory Core 1 Core 2 Core M Constant Memory Texture Memory Global Memory The Resulting GPU Architecture nVidia Tesla 2050 • 448 cores • Four ‘memories’ • Shared • fast – 4 cycles • small – 48KB • Global • slow – 400-600cycles • large – up to 3GB • high throughput – 150GB/s • Texture – read only • Constant – read only • Hybrid • PCI 16x -- 4GBps GPU

High peak compute power High host-device communication overhead Complex to program High peak memory bandwidth Limited memory space GPUs offer different characteristics

Projects at NetSysLab@UBChttp://netsyslab.ece.ubc.ca • Porting applications to efficiently exploit GPU characteristics • Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 • Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing Magazine, January/February 2011. • Middleware runtime support to simplify application development • CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, TR • GPU-optimized building blocks: Data structures and libraries • GPU Support for Batch Oriented Workloads, L. Costa, S. Al-Kiswany, M. Ripeanu, IPCCC’09 • Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 • A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10 • On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M. Ripeanu, JoCC‘08

Motivating Question: How should we design applications to efficiently exploit GPU characteristics? • Context: • A bioinformatics problem: Sequence Alignment • A string matching problem • Data intensive (102 GB) Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10

Past work: sequence alignment on GPUs MUMmerGPU [Schatz 07, Trapnell 09]: • A GPU port of the sequence alignment tool MUMmer[Kurtz 04] • ~4x (end-to-end) compared to CPU version (%) > 50% overhead Hypothesis: mismatch between the core data structure (suffix tree) and GPU characteristics

Idea: trade-off time for space • Use a space efficient data structure (though, from higher computational complexityclass): suffix array • 4x speedup compared to suffix tree-based on GPU Significant overhead reduction Consequences: • Opportunity toexploit multi-GPU systemsas I/O is less of a bottleneck • Focus is shifted towardsoptimizing the compute stage

Outline for the rest of this talk • Sequence alignment: background and offloading to GPU • Space/Time trade-off analysis • Evaluation

Background: Sequence Alignment Problem Problem: Find where each query most likely originated from Queries 108 queries 101 to 102 symbols length per query Reference 106 to 1011 symbols length (up to ~400GB) • CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG • ...TAGGC TGCGC... ...CGGCA... ...GGCG • ...GGCTA ATGCG… .…TCGG... TTTGCGG…. • ...TAGG ...ATAT… .…CCTA... CAATT…. • ..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Queries Reference

Sequence alignment Easy to partition Memory intensive GPU Massively parallel High memory bandwidth GPU Offloading: Opportunity and Challenges Opportunity • Data Intensive • Large output size • Limited memory space • No direct access to other I/O devices (e.g., disk) Challenges

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Data intensive problem and limited memory space divide and compute in rounds search-optimized data-structures Large output size compressed output representation (decompress on the CPU) GPU Offloading: addressing the challenges High-level algorithm (executed on the host)

Space/Time Trade-off Analysis

The core data structure massive number of queries and long reference => pre-process reference to an index Past work: build a suffix tree(MUMmerGPU [Schatz 07, 09]) • Search: O(qry_len) per query • Space: O(ref_len) but the constant is high ~20 x ref_len • Post-processing: DFS traversal for each query O(4qry_len - min_match_len)

The core data structure massive number of queries and long reference => pre-process reference to an index subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Past work: build a suffix tree(MUMmerGPU [Schatz 07]) • Search: O(qry_len) per query • Space: O(ref_len), but the constant is high: ~20xref_len • Post-processing: O(4qry_len - min_match_len), DFS traversal per query Expensive Efficient Expensive

A better matching data structure? Suffix Tree Suffix Array Less data to transfer Compute Impact 1: Reduced communication

A better matching data structure Suffix Tree Suffix Array Space for longer sub-references => fewer processing rounds Compute Impact 2: Better data locality is achieved at the cost of additional per-thread processing time

A better matching data structure Suffix Tree Suffix Array Compute Impact 3: Lower post-processing overhead

Evaluation

Evaluation setup • Testbed • Low-end Geforce 9800 GX2 GPU (512MB) • High-end Tesla C1060 (4GB) • Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09]) • Success metrics • Performance • Energy consumption • Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

Speedup: array-based over tree-based

Dissecting the overheads Significant reduction in data transfers and post-processing Workload: HS1, ~78M queries, ~238M ref. length on GeForce

Comparing with CPU performance[baseline single core performance] [Suffix tree] [Suffix array] [Suffix tree]

Summary • GPUs have drastically different performance characteristics • Reconsidering the choice of the data structure used is necessary when porting applications to the GPU • A good matching data structureensures: • Low communication overhead • Data locality: might be achieved at the cost of additional per thread processing time • Low post-processing overhead

Code, benchmarks and papers available at: netsyslab.ece.ubc.ca

Maximizing Hybrid GPU Acceleration: Unlocking Compute Power & Efficiency

Maximizing Hybrid GPU Acceleration: Unlocking Compute Power & Efficiency

Presentation Transcript

KD-Tree Acceleration Structures for a GPU Raytracer

GPU Acceleration of Finite Element Computations

Physically-Based Simulation on the GPU

Some Applications of GPU-Based Medical Imaging

Acceleration Techniques for GPU-based Volume Rendering

GPU Acceleration of SVG November 2011

GPU-based Visualization Algorithms

GPU Acceleration in ITK v4

GPU based cloud computing

GPU acceleration in Matlab

Performance Measurement of Applications with GPU Acceleration using CUDA

Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU

Performance study of multi-GPU acceleration of LU Factorization

Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

Acceleration Based Pedometer

MATEI ALINA

GPU Acceleration in ITK v4

GPU- Based Responsive Grass

Acceleration Techniques for GPU-based Volume Rendering