1 / 23

Isaa c Lyn g aa s ( irly n g aas@g m ail.co m )

Isaa c Lyn g aa s ( irly n g aas@g m ail.co m ). Jo hn Paige ( paigejo@ g m ail.co m ) Ad vised by: S ri n a th Vad laman i ( s ri n a thv @ u car.ed u ) & Do ug Ny ch ka ( nychka@ uc ar.ed u ) SI P a r CS , Ju ly 31, 2014.  W hy use HPC wi th R?.  A cceler a ti ng mKr i g & Krig

rich
Download Presentation

Isaa c Lyn g aa s ( irly n g aas@g m ail.co m )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by:Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka(nychka@ucar.edu) SIParCS, July 31, 2014

  2. Why use HPC with R? • AcceleratingmKrig& Krig • Parallel Cholesky • Software Packages • Parallel Eigen Decomposition • Conclusions & Future Works

  3. Accelerate the ‘Fields’ Krigand mKrig functions • Survey of parallel linear algebra software • Multicore (Shared Memory) • GPU • XeonPhi

  4. Many developers & users inthe field of Statistics • Readilyavailablecode base • Problem: R isslowfor large size problems

  5. Bottleneck in Linear Algebra operations • mKrig– Cholesky Decomposition • Krig – Eigen Decomposition • R uses sequential algorithms • Strategy: Use C interoperable libraries to parallelizelinear algebra • Cfunctions callable through R environment

  6. Symmetric positive definite -> Triangular • A=LL^T • Nice propertiesfor determinant calculation

  7. PLASMA (Multicore Shared Memory) • http://icl.cs.utk.edu/plasma/ • MAGMA (GPU & XeonPhi) • http://icl.cs.utk.edu/magma/ • CULA (GPU) • http://www.culatools.com/

  8. Multicore (Shared Memory) • Block Scheduling • Determineswhat operationsshould be done on which core • Block Size optimization • Dependent onCache Memory

  9. Plasmausing1Node (# ofObservations=25000) 15 Speedup Optimal Speedup Speedup vs.1Core 10 5 0 1 2 4 8 # of Cores 12 16

  10. PLASMAon DualSocket Sandy Bridge (# of Observations=15000, Core=16) 7 6 Time(sec) 5 4 3 256Kb 50040Mb 1000 Block Size 1500

  11. PLASMAOptimalBlockSizes (Cores=16) 600 500 Optimal Block size 400 300 200 100 0 10000 20000 # of Observations 30000 40000

  12. UtilizesGPUs or XeonPhi for parallelization • Multiple GPU & Multiple XeonPhi implementationsavailable • 1 CPUdrivesone 1GPU • Block Scheduling • Similar to PLASMA • Block Size dependent onAccelerator Architecture

  13. CUDA Proprietary linear algebra package Capable of doing Lapack operationsusing 1 GPU API written inC Dense & Spare operationsavailable

  14. 1 Node ofCaldera or Pronghorn • 2 x 8 core Intel Xeon E5-2670 (Sandy Bridge) processors per Node • 64 GB RAM (~59 GB available) • CachePerCore: L1=32Kb, L2 =256Kb • Cache Per Socket: L3=20Mb • 2 x Nvidia Tesla M270QGPU(Caldera) • ~5.2 GB RAM per device • 1 core drives1 GPU • 2 x XeonPhi 5110P (Pronghorn) • ~7.4 GB RAM per device

  15. AcceleratedHardware has Room for Improvement 400 300 GFLOP/sec 200 Plasma (16 cores) Magma1GPU Magma2GPUs Magma1MIC Magma2MICs CULA 100 0 0 10000 20000 # of Observations 30000 40000 • Serial R: ~3 GFLOP/sec • Theoretical Peak Performance • 16 core Xeon SandyBridge: ~333 GFLOP/sec • 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec • 1 Xeon Phi 5110P: ~1,011 GFLOP/sec

  16. AllParallel Cholesky Implementations are Faster than SerialR 1000 100 10 Time(sec) Serial R Plasma (16 Cores) CULA Magma 1 GPU Magma 2 GPUs Magma 1 Xeon Phi Magma 2 Xeon Phis 1 0.1 0.01 0 10000 20000 #of Observations 30000 40000 • >100TimesSpeedup over serialR when# of Observations= 10k

  17. Eigendecompositionalso Faster onAccelerated Hardware 300 Serial R CULA Magma 1 GPU Magma 2 GPUs 250 200 Time(sec) 150 100 50 0 0 2000 4000 6000 8000 10000 # of Observations • ~6 TimesSpeedup over serial R when # of Observations= 10k

  18. Can Run~30 Cholesky Decompositionsper Eigen Decomposition 30 TimeEigendecomposition/TimeCholesky 25 20 15 10 5 0 0 2000 4000 6000 8000 10000 #of Observations • Both times taken using MAGMA w/ 2 GPUs

  19. ParallelCholesky Beats Parallel Rfor Moderate to Large Matrices 25 20 Speedupvs.ParallelR 15 10 Plasma Magma2GPUs 5 0 0 5000 10000 #of Observations 15000 20000 • If we want to do 16 Cholesky decompositions in parallel, we are guaranteed better performance when speedup >16

  20. Using Caldera • Single Cholesky Decomposition • Matrix Size < 20k use PLASMA (16 coresw/ optimal blocksize) • Matrix Size 20k – 35k use MAGMAw/ 2 GPUs • Matrix Size > 35k use PLASMA (16 coresw/ optimal blocksize) • Dependent on computing resources available

  21. Explored Implementation onaccelerated hardware • GPUs • Multicore (Shared Memory) • Xeon Phis • Installed third party linear algebra packages & programmed wrappersthat call these packagesfromR • Installation instructions and programs available through bitbucket repo for access contact Srinath Vadlamani • Future Work • Multicore Distributed Memory • Single Precision

  22. Douglas Nychka, Reinhard Furrer, and StephanSain. fields: Toolsfor spatial data, 2014b. URL: http://CRAN.R-project.org/package=fields. R package version7.1. Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, JulienLangou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMAprojects. In Journal ofPhysics: Conference Series, volume 180, page 012037. IOPPublishing, 2009. Hatem Ltaief, Stanimire Tomov, Rajib Nath, PengDu, and Jack Dongarra. A Scalable High PerformantCholesky Factorizationfor Multicore withGPU Accelerators. Proc. OfVECPAR’10, Berkeley, CA, June22-25, 2010. Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPCProgramming on Intel Many-Integrated-Core Hardware with MAGMAPort to XeonPhi. PPAM 2013, Warsaw, Poland, September, 2013.

  23. xPOTRF xTRSM 0 1 2 3 0 xTRSM xTRSM xTRSM xSYRK 1 2 3 0 1 2 3 xGEMM xGEMM xGEMM xGEMM xPOTRF xGEMM xGEMM xSYRK xTRSM xSYRK xTRSM xTRSM xSYRK xSYRK xGEMM xGEMM xSYRK xGEMM xSYRK xPOTRF xTRSM xTRSM xGEMM xSYRK xSYRK 0 1 xPOTRF 2 xTRSM FINAL xPOTRF xTRSM xSYRK xGEMM xSYRK xPOTRF • http://www.netlib.org/lapack/lawnspdf/lawn223.pdf

More Related