1 / 18

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China. Two Conflicting Approaches for Programmability in HPC. Top-down Approach Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization

ordell
Download Presentation

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New Techniques for Programming GPU ClustersYifeng Chen School of EECSPeking University, China.

  2. Two Conflicting Approaches for Programmability in HPC • Top-down Approach • Core programming model is high-level (e.g. func parallel lang) • Must rely on heavy heuristic runtime optimization • Add low-level program constructs to improve low-level control • Risks: • Programmers tend to avoid using “extra” constructs. • Low-level controls do not fit well into the core model. • Bottom-up Approach (PARRAY PPoPP’12) • Core programming model exposes the memory hierarchy • Same algorithm, Same performance, Same intellectual challenge, but Shorter code

  3. GPUClusters Tianhe: 1 GPU/ 2CPUs Tsubame:3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUs PKU McClus: 2GPUs/1 CPU

  4. Motivating Examples for PARRAY

  5. Basic Notation Dimension Tree Type Reference

  6. Thread Arrays

  7. Generating CUDA+Pthread #parray {pthd [2]} P #parray {paged float [2][[2048][4096]]} H #parray {dmem float # H_1} D #parray {[#P][#D]} G float* host; _pa_pthd* p; #mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p) } pthread_create sem_post sem_wait pthread_join

  8. Generating MPI or IB/verbs #parray { mpi [2] } M #parray { paged float [2][[2048][4096]] } H #parray { [#M][#H_1] } G float* host; _pa_mpi* m; #mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m) } MPI_Scatter

  9. Other Communication Patterns ALLTOALL BCAST

  10. Generating Code for IB/verbs and YH Communication Layer • Semi-Bypassing the MPI layer • Patching the Infiniband layer • Discontiguous RDMA communication pattern achieving Zero-Copy.

  11. Large-Scale FFT in 20 lines Deeply optimized algorithm (ICS 2010) Zero-copy for hmem

  12. (Before Nov 2011)

  13. Direct Simulation of Turbulent Flows • Scale • Up to 14336 3D Single-Precision • 12 distributed arrays, each with 11 TB data (128TB total) • Entire Tianhe-1A with 7168 nodes • Progress • 4096 3D completed • 8192 3D half-way • and 14336 3D tested for performance. • Software Technologies • PARRAY code only 300 lines. • Programming-level resilience technology for stable computation • Conclusion: GPU-accelerated large simulation on entire Tianhe-1A is feasible.

  14. Generated Code

  15. Discussions • Other programming models? • MPI (more expressive datatype) • OpenACC (optimization for coalescing accesses) • PGAS (generating PGAS library calls) • IB/verbs (directly generating Zero-Copy IB calls) • We need a software stack! • Irregular structures must be encoded into arrays and then benefit from PARRAY. • Runtime workflow possible above PARRAY • Generating Pthread + CUDA + MPI (future support of FPGA and MIC possible) + macros • Macros are compiled out: no performance loss. • Typical training = 3 days, friendly to engineers…

More Related