Isaa c Lyn g aa s ( irly n g aas@g m ail.co m )

Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by:Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka(nychka@ucar.edu) SIParCS, July 31, 2014

Why use HPC with R? • AcceleratingmKrig& Krig • Parallel Cholesky • Software Packages • Parallel Eigen Decomposition • Conclusions & Future Works

Accelerate the ‘Fields’ Krigand mKrig functions • Survey of parallel linear algebra software • Multicore (Shared Memory) • GPU • XeonPhi

Many developers & users inthe field of Statistics • Readilyavailablecode base • Problem: R isslowfor large size problems

Bottleneck in Linear Algebra operations • mKrig– Cholesky Decomposition • Krig – Eigen Decomposition • R uses sequential algorithms • Strategy: Use C interoperable libraries to parallelizelinear algebra • Cfunctions callable through R environment

Symmetric positive definite -> Triangular • A=LL^T • Nice propertiesfor determinant calculation

PLASMA (Multicore Shared Memory) • http://icl.cs.utk.edu/plasma/ • MAGMA (GPU & XeonPhi) • http://icl.cs.utk.edu/magma/ • CULA (GPU) • http://www.culatools.com/

Multicore (Shared Memory) • Block Scheduling • Determineswhat operationsshould be done on which core • Block Size optimization • Dependent onCache Memory

Plasmausing1Node (# ofObservations=25000) 15 Speedup Optimal Speedup Speedup vs.1Core 10 5 0 1 2 4 8 # of Cores 12 16

PLASMAon DualSocket Sandy Bridge (# of Observations=15000, Core=16) 7 6 Time(sec) 5 4 3 256Kb 50040Mb 1000 Block Size 1500

PLASMAOptimalBlockSizes (Cores=16) 600 500 Optimal Block size 400 300 200 100 0 10000 20000 # of Observations 30000 40000

UtilizesGPUs or XeonPhi for parallelization • Multiple GPU & Multiple XeonPhi implementationsavailable • 1 CPUdrivesone 1GPU • Block Scheduling • Similar to PLASMA • Block Size dependent onAccelerator Architecture

CUDA Proprietary linear algebra package Capable of doing Lapack operationsusing 1 GPU API written inC Dense & Spare operationsavailable

1 Node ofCaldera or Pronghorn • 2 x 8 core Intel Xeon E5-2670 (Sandy Bridge) processors per Node • 64 GB RAM (~59 GB available) • CachePerCore: L1=32Kb, L2 =256Kb • Cache Per Socket: L3=20Mb • 2 x Nvidia Tesla M270QGPU(Caldera) • ~5.2 GB RAM per device • 1 core drives1 GPU • 2 x XeonPhi 5110P (Pronghorn) • ~7.4 GB RAM per device

AcceleratedHardware has Room for Improvement 400 300 GFLOP/sec 200 Plasma (16 cores) Magma1GPU Magma2GPUs Magma1MIC Magma2MICs CULA 100 0 0 10000 20000 # of Observations 30000 40000 • Serial R: ~3 GFLOP/sec • Theoretical Peak Performance • 16 core Xeon SandyBridge: ~333 GFLOP/sec • 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec • 1 Xeon Phi 5110P: ~1,011 GFLOP/sec

AllParallel Cholesky Implementations are Faster than SerialR 1000 100 10 Time(sec) Serial R Plasma (16 Cores) CULA Magma 1 GPU Magma 2 GPUs Magma 1 Xeon Phi Magma 2 Xeon Phis 1 0.1 0.01 0 10000 20000 #of Observations 30000 40000 • >100TimesSpeedup over serialR when# of Observations= 10k

Eigendecompositionalso Faster onAccelerated Hardware 300 Serial R CULA Magma 1 GPU Magma 2 GPUs 250 200 Time(sec) 150 100 50 0 0 2000 4000 6000 8000 10000 # of Observations • ~6 TimesSpeedup over serial R when # of Observations= 10k

Can Run~30 Cholesky Decompositionsper Eigen Decomposition 30 TimeEigendecomposition/TimeCholesky 25 20 15 10 5 0 0 2000 4000 6000 8000 10000 #of Observations • Both times taken using MAGMA w/ 2 GPUs

ParallelCholesky Beats Parallel Rfor Moderate to Large Matrices 25 20 Speedupvs.ParallelR 15 10 Plasma Magma2GPUs 5 0 0 5000 10000 #of Observations 15000 20000 • If we want to do 16 Cholesky decompositions in parallel, we are guaranteed better performance when speedup >16

Using Caldera • Single Cholesky Decomposition • Matrix Size < 20k use PLASMA (16 coresw/ optimal blocksize) • Matrix Size 20k – 35k use MAGMAw/ 2 GPUs • Matrix Size > 35k use PLASMA (16 coresw/ optimal blocksize) • Dependent on computing resources available

Explored Implementation onaccelerated hardware • GPUs • Multicore (Shared Memory) • Xeon Phis • Installed third party linear algebra packages & programmed wrappersthat call these packagesfromR • Installation instructions and programs available through bitbucket repo for access contact Srinath Vadlamani • Future Work • Multicore Distributed Memory • Single Precision

Douglas Nychka, Reinhard Furrer, and StephanSain. fields: Toolsfor spatial data, 2014b. URL: http://CRAN.R-project.org/package=fields. R package version7.1. Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, JulienLangou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMAprojects. In Journal ofPhysics: Conference Series, volume 180, page 012037. IOPPublishing, 2009. Hatem Ltaief, Stanimire Tomov, Rajib Nath, PengDu, and Jack Dongarra. A Scalable High PerformantCholesky Factorizationfor Multicore withGPU Accelerators. Proc. OfVECPAR’10, Berkeley, CA, June22-25, 2010. Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPCProgramming on Intel Many-Integrated-Core Hardware with MAGMAPort to XeonPhi. PPAM 2013, Warsaw, Poland, September, 2013.

xPOTRF xTRSM 0 1 2 3 0 xTRSM xTRSM xTRSM xSYRK 1 2 3 0 1 2 3 xGEMM xGEMM xGEMM xGEMM xPOTRF xGEMM xGEMM xSYRK xTRSM xSYRK xTRSM xTRSM xSYRK xSYRK xGEMM xGEMM xSYRK xGEMM xSYRK xPOTRF xTRSM xTRSM xGEMM xSYRK xSYRK 0 1 xPOTRF 2 xTRSM FINAL xPOTRF xTRSM xSYRK xGEMM xSYRK xPOTRF • http://www.netlib.org/lapack/lawnspdf/lawn223.pdf

Isaa c Lyn g aa s ( irly n g aas@g m ail.co m )