1 / 36

Harvesting the Opportunity of GPU-based Acceleration Matei Ripeanu

Harvesting the Opportunity of GPU-based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work Abdullah Gharaibeh, Samer Al-Kiswany. Networked Systems Laboratory (NetSysLab) University of British Columbia. A golf course ….

elaine
Download Presentation

Harvesting the Opportunity of GPU-based Acceleration Matei Ripeanu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Harvesting the Opportunity of GPU-based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work Abdullah Gharaibeh, Samer Al-Kiswany

  2. Networked Systems Laboratory (NetSysLab) University of British Columbia A golf course … … a (nudist) beach (… and 199 days of rain each year)

  3. Hybrid architectures in Top 500 [Nov’10]

  4. Hybrid architectures • High compute power / memory bandwidth • Energy efficient [operated today at low efficiency] • Agenda for this talk • GPU Architecture Intuition • What generates the above characteristics? • Progress on efficiently harnessing hybrid (GPU-based) architectures

  5. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  6. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  7. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  8. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  9. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  10. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  11. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  12. Feed the cores with data Idea #3 The processing elements are data hungry!  Wide, high throughput memory bus

  13. 10,000x parallelism! Idea #4 Hide memory access latency  Hardware supported multithreading

  14. Host Machine Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory Instruction Unit Registers Registers Registers Host Memory Core 1 Core 2 Core M Constant Memory Texture Memory Global Memory The Resulting GPU Architecture nVidia Tesla 2050 • 448 cores • Four ‘memories’ • Shared • fast – 4 cycles • small – 48KB • Global • slow – 400-600cycles • large – up to 3GB • high throughput – 150GB/s • Texture – read only • Constant – read only • Hybrid • PCI 16x -- 4GBps GPU

  15. High peak compute power High host-device communication overhead Complex to program High peak memory bandwidth Limited memory space GPUs offer different characteristics

  16. Projects at NetSysLab@UBChttp://netsyslab.ece.ubc.ca • Porting applications to efficiently exploit GPU characteristics • Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 • Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing Magazine, January/February 2011. • Middleware runtime support to simplify application development • CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, TR • GPU-optimized building blocks: Data structures and libraries • GPU Support for Batch Oriented Workloads, L. Costa, S. Al-Kiswany, M. Ripeanu, IPCCC’09 • Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 • A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10 • On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M. Ripeanu, JoCC‘08

  17. Motivating Question: How should we design applications to efficiently exploit GPU characteristics? • Context: • A bioinformatics problem: Sequence Alignment • A string matching problem • Data intensive (102 GB) Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10

  18. Past work: sequence alignment on GPUs MUMmerGPU [Schatz 07, Trapnell 09]: • A GPU port of the sequence alignment tool MUMmer[Kurtz 04] • ~4x (end-to-end) compared to CPU version (%) > 50% overhead Hypothesis: mismatch between the core data structure (suffix tree) and GPU characteristics

  19. Idea: trade-off time for space • Use a space efficient data structure (though, from higher computational complexityclass): suffix array • 4x speedup compared to suffix tree-based on GPU Significant overhead reduction Consequences: • Opportunity toexploit multi-GPU systemsas I/O is less of a bottleneck • Focus is shifted towardsoptimizing the compute stage

  20. Outline for the rest of this talk • Sequence alignment: background and offloading to GPU • Space/Time trade-off analysis • Evaluation

  21. Background: Sequence Alignment Problem Problem: Find where each query most likely originated from Queries 108 queries 101 to 102 symbols length per query Reference 106 to 1011 symbols length (up to ~400GB) • CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG • ...TAGGC TGCGC... ...CGGCA... ...GGCG • ...GGCTA ATGCG… .…TCGG... TTTGCGG…. • ...TAGG ...ATAT… .…CCTA... CAATT…. • ..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Queries Reference

  22. Sequence alignment Easy to partition Memory intensive GPU Massively parallel High memory bandwidth GPU Offloading: Opportunity and Challenges Opportunity • Data Intensive • Large output size • Limited memory space • No direct access to other I/O devices (e.g., disk) Challenges

  23. subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Data intensive problem and limited memory space divide and compute in rounds search-optimized data-structures Large output size compressed output representation (decompress on the CPU) GPU Offloading: addressing the challenges High-level algorithm (executed on the host)

  24. Space/Time Trade-off Analysis

  25. The core data structure massive number of queries and long reference => pre-process reference to an index Past work: build a suffix tree(MUMmerGPU [Schatz 07, 09]) • Search: O(qry_len) per query • Space: O(ref_len) but the constant is high ~20 x ref_len • Post-processing: DFS traversal for each query O(4qry_len - min_match_len)

  26. The core data structure massive number of queries and long reference => pre-process reference to an index subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Past work: build a suffix tree(MUMmerGPU [Schatz 07]) • Search: O(qry_len) per query • Space: O(ref_len), but the constant is high: ~20xref_len • Post-processing: O(4qry_len - min_match_len), DFS traversal per query Expensive Efficient Expensive

  27. A better matching data structure? Suffix Tree Suffix Array Less data to transfer Compute Impact 1: Reduced communication

  28. A better matching data structure Suffix Tree Suffix Array Space for longer sub-references => fewer processing rounds Compute Impact 2: Better data locality is achieved at the cost of additional per-thread processing time

  29. A better matching data structure Suffix Tree Suffix Array Compute Impact 3: Lower post-processing overhead

  30. Evaluation

  31. Evaluation setup • Testbed • Low-end Geforce 9800 GX2 GPU (512MB) • High-end Tesla C1060 (4GB) • Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09]) • Success metrics • Performance • Energy consumption • Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

  32. Speedup: array-based over tree-based

  33. Dissecting the overheads Significant reduction in data transfers and post-processing Workload: HS1, ~78M queries, ~238M ref. length on GeForce

  34. Comparing with CPU performance[baseline single core performance] [Suffix tree] [Suffix array] [Suffix tree]

  35. Summary • GPUs have drastically different performance characteristics • Reconsidering the choice of the data structure used is necessary when porting applications to the GPU • A good matching data structureensures: • Low communication overhead • Data locality: might be achieved at the cost of additional per thread processing time • Low post-processing overhead

  36. Code, benchmarks and papers available at: netsyslab.ece.ubc.ca

More Related