1 / 21

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance. Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia. High peak compute power High communication overhead. High peak memory bandwidth Limited memory space.

diane
Download Presentation

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Size Matters: Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia

  2. High peak compute power High communication overhead High peak memory bandwidth Limited memory space GPUs offer different characteristics Implication: careful tradeoff analysis is needed when porting applications to GPU-based platforms

  3. Motivating Question: How should we design applications to efficiently exploit GPU characteristics? • Context: • A bioinformatics problem: Sequence Alignment • A string matching problem • Data intensive (102 GB)

  4. Past work: sequence alignment on GPUs MUMmerGPU [Schatz 07, Trapnell 09]: • A GPU port of the sequence alignment tool MUMmer[Kurtz 04] • ~4x (end-to-end) compared to CPU version (%) > 50% overhead Hypothesis: mismatch between the core data structure (suffix tree) and GPU characteristics

  5. Idea: trade-off time for space • Use a space efficient data structure (though, from higher computational complexityclass): suffix array • 4x speedup compared to suffix tree-based on GPU Significant overhead reduction Consequences: • Opportunity toexploit multi-GPU systemsas I/O is less of a bottleneck • Focus is shifted towardsoptimizing the compute stage

  6. Outline • Sequence alignment: background and offloading to GPU • Space/Time trade-off analysis • Evaluation

  7. Background: sequence alignment problem Find where each query most likely originated from Queries 108 queries 101 to 102 symbols length per query Reference 106 to 1011 symbols length • CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG • ...TAGGC TGCGC... ...CGGCA... ...GGCG • ...GGCTA ATGCG… .…TCGG... TTTGCGG…. • ...TAGG ...ATAT… .…CCTA... CAATT…. • ..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Queries Reference

  8. Sequence alignment Easy to partition Memory intensive GPU Massively parallel High memory bandwidth GPU Offloading: opportunity and challenges Opportunity • Data Intensive • Large output size • Limited memory space • No direct access to other I/O devices (e.g., disk) Challenges

  9. subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Data intensive problem and limited memory space divide and compute in rounds Large output size compressed output representation (decompress on the CPU) GPU Offloading: addressing the challenges High-level algorithm (executed on the host)

  10. Space/Time Trade-off Analysis

  11. The core data structure massive number of queries and long reference =>pre-process reference to an index Past work: build a suffix tree(MUMmerGPU [Schatz 07, 09]) • Search: O(qry_len) per query • Space: O(ref_len), but the constant is high: ~20xref_len • Post-processing: O(4qry_len - min_match_len), DFS traversal per query

  12. The core data structure massive number of queries and long reference => pre-process reference to an index subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Past work: build a suffix tree(MUMmerGPU [Schatz 07]) • Search: O(qry_len) per query • Space: O(ref_len), but the constant is high: ~20xref_len • Post-processing: O(4qry_len - min_match_len), DFS traversal per query Expensive Efficient Expensive

  13. A better matching data structure Suffix Tree Suffix Array Less data to transfer Impact 1: reduced communication

  14. A better matching data structure Suffix Tree Suffix Array Space for longer sub-references => fewer processing rounds Impact 2: better data locality is achieved at the cost of additional per-thread processing time

  15. A better matching data structure Suffix Tree Suffix Array Impact 3: lower post-processing overhead

  16. Evaluation

  17. Evaluation setup • Testbed • Low-end Geforce 9800 GX2 GPU (512MB) • High-end Tesla C1060 (4GB) • Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09]) • Success metrics • Performance • Energy consumption • Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

  18. Speedup: array-based over tree-based

  19. Dissecting the overheads Significant reduction in data transfers and post-processing Workload: HS1, ~78M queries, ~238M ref. length on Geforce

  20. Summary • GPUs have drastically different performance characteristics • Reconsidering the choice of the data structure used is necessary when porting applications to the GPU • A good matching data structureensures: • Low communication overhead • Data locality: might be achieved at the cost of additional per thread processing time • Low post-processing overhead

  21. Code available at: netsyslab.ece.ubc.ca

More Related