1 / 20

Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed

Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed. Ye Feng IMAGe DAReS SIParCS University of Wyoming. Introduction. Task : evaluating the feasibility and effectiveness of coprocessor on DART. Target : get_close_obs

tyne
Download Presentation

Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed Ye Feng IMAGeDAReSSIParCS University of Wyoming

  2. Introduction • Task: evaluating the feasibility and effectiveness of coprocessor on DART. • Target: get_close_obs ( profiling result: computationally intensive & executed multiple times during a typical DART run.) • Coprocessor: NVDIA GPUs with CUDA Fortran. • Result: Parallel version of exhaustive search on GPU is faster.

  3. Problem Calculate: the horizontal distances between base location and observation locations.

  4. maxdist Find: the close observations.

  5. cdist cclose_ind EASY! or is it?

  6. It is easy on CPU Data Dependency • But GPU doesn’t work this way! • Problems with data dependency usually don’t scale so well on GPU. • cnum_close depends on previous cnum_close value. • cclose_ind and cdist both depend on cnum_close.

  7. GPU Scan: psum dist diff Take the 1st bit of - maxdist = Prefix Sum cnum_close 1, dist<maxdist (close) Most Significant Bit (1st bit) 0, dist>maxdist (not close)

  8. GPU Scan: Diff_sum cdist diff dist diff psum

  9. Extract: cclose_ind cdist cdist Diff_sum Thread ID What we want What we have How can we independently eliminate the zeros and extract the indices

  10. Solution? If diff .not. 0 Then cclose_ind=Thread ID diff Diff_sum Thread ID cclose_ind If diff = 0 Then throw it away

  11. If diff .not. 0 Then cclose_ind=Thread ID diff Diff_sum Thread ID cclose_ind NO Branching! If diff = 0 Then throw it away

  12. Solution! Thread ID Diff_sum cdist cclose_ind cdist cnum_close

  13. Device Functions: • gpu_dist • gpu_scan • si: number of iterations performed in this kernel. • extract • sn: number of gpu_scan blocks that each extract block in this kernel handles. si=2 8 threads/block 16 element/block dist array: Block 1 Block 2 Result from gpu_scan: sn=4

  14. Conclusion • CUDA Fortran on GPU gave significant speedup vs CPU (10x+). • Step outside the box (redesign the algorithm). • In order to get good performance, si and sn need to be tuned. • Be careful with using device memory. • There’s still room to improve the performance of this project.

  15. Acknowledgements DAReS/IMAGe Helen Kershaw (Mentor) Nancy Collins (Mentor) Jeff Anderson Tim Hoar Kevin Raeder UCAR NCARUniversity of Wyoming Kristin Mooney Silvia Gentile Carolyn Mueller Richard Loft Raghu Raj Prasanna Kumar

More Related