1 / 52

Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work with: Abdullah Gharaibeh, Samer Al-Kiswany. Networked Systems Laboratory (NetSysLab)

phuc
Download Presentation

Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work with: Abdullah Gharaibeh, Samer Al-Kiswany

  2. Networked Systems Laboratory (NetSysLab) University of British Columbia A golf course … … a (nudist) beach (… and 199 days of rain each year)

  3. Hybrid architectures in Top 500 [Nov’10]

  4. Hybrid architectures • High compute power / memory bandwidth • Energy efficient [operated today at low overall efficiency] • Agenda for this talk • GPU Architecture Intuition • What generates the above characteristics? • Progress on efficiently harnessing hybrid (GPU-based) architectures

  5. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  6. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  7. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  8. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  9. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  10. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  11. Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

  12. Feed the cores with data Idea #3 The processing elements are data hungry!  Wide, high throughput memory bus

  13. 10,000x parallelism! Idea #4 Hide memory access latency  Hardware supported multithreading

  14. Host Machine Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory Instruction Unit Registers Registers Registers Host Memory Core 1 Core 2 Core M Constant Memory Texture Memory Global Memory The Resulting GPU Architecture nVidia Tesla 2050 • 448 cores • Four ‘memories’ • Shared • fast – 4 cycles • small – 48KB • Global • slow – 400-600cycles • large – up to 3GB • high throughput – 150GB/s • Texture – read only • Constant – read only • Hybrid • PCI 16x -- 4GBps GPU

  15. High peak compute power High host-device communication overhead Complex to program (SIMD, co-processor model) High peak memory bandwidth Limited memory space GPU characteristics

  16. Context: Distributed storage systems Roadmap: Two Projects StoreGPU MummerGPU++ Motivating Question Motivating Question How should one design/port applications to efficiently exploit GPU characteristics? Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems? • Context: • Porting a bioinformatics application (Sequence Alignment) • A string matching problem • Data intensive (102 GB)

  17. Computationally intensive Limit performance Computationally Intensive Operations in Distributed (Storage) Systems Operations Techniques Similarity detection (deduplication) Content addressability Security Integrity checks Redundancy Load balancing Summary cache Storage efficiency Hashing Erasure coding Encryption/decryption Membership testing (Bloom-filter) Compression

  18. Metadata Manager Application Client Access Module b1 b2 b3 bn Distributed Storage System Architecture Application Layer FS API Files divided into stream of blocks Techniques To improve Performance/Reliability Redundancy Integrity Checks De- duplication Security Enabling Operations Compression Encryption/ Decryption Hashing Encoding/ Decoding Storage Nodes GPU Offloading Layer CPU MosaStore http://mosastore.net

  19. GPU accelerated deduplication: Design / prototype implementation that integrates similarity detection and GPU support • End-to-end system evaluation 2x throughput improvement for a realistic checkpointing workload

  20. b1 b2 b3 bn Challenges Files divided into stream of blocks • Integration Challenges • Minimizing the integration effort • Transparency • Separation of concerns • Extracting Major Performance Gains • Hiding memory allocation overheads • Hiding data transfer overheads • Efficient utilization of the GPU memory units • Use of multi-GPU systems Similarity Detection Hashing Offloading Layer GPU

  21. b1 b2 b3 bn Hashing on GPUs HashGPU1:a library that exploits GPUs to support specialized use of hashing in distributed storage systems Hashing a stream of blocks One performance data point: Accelerates hashing by up to 5x speedup compared to a single core CPU HashGPU GPU However,significant speedup achieved only for large blocks (>16MB) => not suitable forefficient similarity detection 1Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems,S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, M. Ripeanu, HPDC ‘08

  22. Profiling HashGPU At least 75% overhead Amortizing memory allocation and overlapping data transfers and computation may bring important benefits

  23. b1 b2 b3 bn CrystalGPU:a layer of abstraction that transparently enables common GPU optimizations Files divided into stream of blocks Similarity Detection One performance data point: CrystalGPU can improve the speedup of hashing by more than 10x HashGPU Offloading Layer CrystalGPU GPU

  24. b1 b2 b3 bn CrystalGPU Opportunities and Enablers • Opportunity: Reusing GPU memory buffers Enabler: a high-level memory manager • Opportunity: overlap the communication and computation Enabler: double buffering and asynchronous kernel launch • Opportunity: multi-GPU systems (e.g., GeForce 9800 GX2 and GPU clusters) Enabler: a task queue manager Files divided into stream of blocks Similarity Detection HashGPU CrystalGPU Offloading Layer Memory Manager Task Queue Double Buffering GPU

  25. HashGPU Performance on top CrystalGPU Base Line: CPU Single Core The gains enabled by the three optimizations can be realized!

  26. End-to-end system evaluation

  27. End-to-End System Evaluation • Testbed • Four storage nodes and one metadata server • One client with 9800GX2 GPU • Three configuration • No similarity detection (without-SD) • Similarity detection • on CPU (4 cores @ 2.6GHz) (SD-CPU) • on GPU (9800 GX2) (SD-GPU) • Three workloads • Real checkpointing workload • Completely similar files: maximum gains in terms of data saving • Completely different files: only overheads, no gains • Success metrics: • System throughput • Impact on a competing application: compute or I/O intensive • A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10

  28. System Throughput (Checkpointing Workload) 1.8x improvement The integrated system preserves the throughput gains on a realistic workload!

  29. System Throughput (Synthetic Workload of Similar Files) Room for 2x improvement Offloading to the GPU enables close to optimal performance!

  30. Impact on a Competing (Compute Intensive) Application Writing Checkpoints back to back 2x improvement 7% reduction Frees resources (CPU) to competing applications while preserving throughput gains!

  31. Summary

  32. Metadata Manager Application Client Access Module Distributed Storage System Architecture Storage Nodes MosaStore http://mosastore.net

  33. b1 b2 b3 bn StoreGPU summary Application Layer FS API Motivating Question Files divided into stream of blocks Does the 10x lower computation cost offered by GPUs change the way we design (distributed storage) systems? Techniques To improve Performance/Reliability Redundancy Integrity Checks De- duplication Security Enabling Operations Results so far: • StoreGPU: storage system prototype that offloads to GPU • Evaluate the feasibility of GPU offloading, and the impact on competing applications Compression Encryption/ Decryption Hashing Encoding/ Decoding CPU GPU Offloading Layer

  34. Context: Distributed storage systems Roadmap: Two Projects StoreGPU MummerGPU++ Motivating Question Motivating Question How should one design/port applications to efficiently exploit GPU characteristics? Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems? • Context: • Porting a bioinformatics application (Sequence Alignment) • A string matching problem • Data intensive (102 GB)

  35. Background: Sequence Alignment Problem Problem: Find where each query most likely originated from Queries 108 queries 101 to 102 symbols length per query Reference 106 to 1011 symbols length (up to ~400GB) • CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG • ...TAGGC TGCGC... ...CGGCA... ...GGCG • ...GGCTA ATGCG… .…TCGG... TTTGCGG…. • ...TAGG ...ATAT… .…CCTA... CAATT…. • ..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Queries Reference

  36. > 50% overhead Sequence Alignment on GPUs • MUMmerGPU[Schatz 07, Trapnell 09]: • A GPU port of the sequence alignment tool MUMmer[Kurtz 04] • Achieves good speedup compared to CPU version • Based on suffix tree • However, suffers from significant communication and post-processing overheads • MUMmerGPU++[gharibeh 10]: • Use a space efficient data structure (though, from higher computational complexity class): suffix array • Achieves significant speedup compared to suffix tree-based on GPU Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing, Jan/Feb 2011.

  37. Speedup Evaluation Over 60% improvement Suffix Tree Suffix Tree Suffx Array Workload: Human, ~10M queries, ~30M ref. length

  38. Space/Time Trade-off Analysis

  39. subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Data intensive problem and limited memory space divide and compute in rounds search-optimized data-structures Large output size compressed output representation (decompress on the CPU) GPU Offloading: addressing the challenges High-level algorithm (executed on the host)

  40. The core data structure massive number of queries and long reference => pre-process reference to an index Past work: build a suffix tree(MUMmerGPU [Schatz 07, 09]) • Search: O(qry_len) per query • Space: O(ref_len) but the constant is high ~20 x ref_len • Post-processing: DFS traversal for each query O(4qry_len - min_match_len)

  41. The core data structure massive number of queries and long reference => pre-process reference to an index subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Past work: build a suffix tree(MUMmerGPU [Schatz 07]) • Search: O(qry_len) per query • Space: O(ref_len), but the constant is high: ~20xref_len • Post-processing: O(4qry_len - min_match_len), DFS traversal per query Expensive Efficient Expensive

  42. A better matching data structure? Suffix Tree Suffix Array Less data to transfer Compute Impact 1: Reduced communication

  43. A better matching data structure Suffix Tree Suffix Array Space for longer sub-references => fewer processing rounds Compute Impact 2: Better data locality is achieved at the cost of additional per-thread processing time

  44. A better matching data structure Suffix Tree Suffix Array Compute Impact 3: Lower post-processing overhead

  45. Evaluation

  46. Evaluation setup • Testbed • Low-end Geforce 9800 GX2 GPU (512MB) • High-end Tesla C1060 (4GB) • Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09]) • Success metrics • Performance • Energy consumption • Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

  47. Speedup: array-based over tree-based

  48. Dissecting the overheads • Consequences: • Focus shifts tooptimizing the compute stage • Opportunity toexploit multi-GPU systems (as I/O is less of a bottleneck) Workload: HS1, ~78M queries, ~238M ref. length on GeForce

  49. MummerGPU++ Summary Motivating Question How should one design/port applications to efficiently exploit GPU characteristics? • Choice of appropriate data structure can be crucial when porting applications to the GPU • A good matching data structure ensures: • Low communication overhead • Data locality: can be achieved at the cost of additional per thread processing time • Low post-processing overhead

  50. StoreGPU MummerGPU++ Motivating Question Motivating Question How should one design/port applications to efficiently exploit GPU characteristics? Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems? Hybrid platforms will gain wider adoption. Unifying theme: making the use of hybrid architectures (e.g., GPU-based platforms) simple and effective

More Related