IceCube simulation with PPC

IceCube simulation with PPC Photon propagation code PPC: 2009 - now Photonics: 2000 – up to now

Photonics: conventional on CPU • First, run photonics to fill space with photons, tabulate the result • Create such tables for nominal light sources: cascade and uniform half-muon • Simulate photon propagation by looking up photon density in tabulated distributions • Table generation is slow • Simulation suffers from a wide range of binning artifacts • Simulation is also slow! (most time is spent loading the tables)

Direct photon tracking with PPC photon propagation code • simulating flasher/standard candle photons • same code for muon/cascade simulation • using precise scattering function: linear combination of HG+SAM • using tabulated (in 10 m depth slices) layered ice structure • employing 6-parameter ice model to extrapolate in wavelength • tilt in the ice layer structure is properly taken into account • transparent folding of acceptance and efficiencies • precise tracking through layers of ice, no interpolation needed • precise simulation of the longitudinal development of cascades and • angular distribution of particles emitting Cherenkov photons

PPC simulation on GPU graphics processing unit execution threads propagation steps (between scatterings) photon absorbed new photon created (taken from the pool) threads complete their execution (no more photons) Running on an NVidia GTX 295 CUDA-capable card, ppc is configured with: 448 threads in 30 blocks (total of 13440 threads) average of ~ 1024 photons per thread (total of 1.38 . 107 photons per call)

Photon Propagation Code: PPC • There are 5 versions of the ppc: • original c++ • "fast" c++ • in Assembly • for CUDA GPU • icetray module All versions verified to produce identical results comparison with i3mcml http://icecube.wisc.edu/~dima/work/WISC/ppc/

ppc icetray module • at http://code.icecube.wisc.edu/svn/projects/ppc/trunk/ • uses a wrapper: private/ppc/i3ppc.cxx, which compiles by cmake system into the libppc.so • it is necessary to compile an additional library libxppc.so by running make in private/ppc/gpu: • “make glib” compiles gpu-accelerated version (needs cuda tools) • “make clib” compiles cpu version (from the same sources!) • link to libxppc.so and libcudart.so (if gpu version) from build/lib directory • this library file must be loaded before the libppc.so wrapper library • These steps are automated with a resouces/make.sh script

ppc homepage http://icecube.wisc.edu/~dima/work/WISC/ppc

GPU scaling Original: 1/2.08 1/2.70 CPU c++: 1.00 1.00 Assembly: 1.25 1.37 GTX 295: 147 157 GTX/Ori: 307 424 C1060: 104 112 C2050: 157 150 GTX 480: 210 204 Uses cudaGetDeviceProperties() to get the number of multiprocessors, Uses cudaFuncGetAttributes() to get the maximum number of threads On GTX 295: 1.296 GHz Running on 30 MPs x 448 threads Kernel uses: l=0 r=35 s=8176 c=62400 On GTX 480: 1.401 GHz Running on 15 MPs x 768 threads Kernel uses: l=0 r=40 s=3960 c=62400 On C1060: 1.296 GHz Running on 30 MPs x 448 threads Kernel uses: l=0 r=35 s=3992 c=62400 On C2050: 1.147 GHz Running on 14 MPs x 768 threads Kernel uses: l=0 r=41 s=3960 c=62400

GTX 480 vs. GTX 295 • GTX 480 has 1 GPU • 480 MPs in 15 cores • 32 MPs per core • 4 single-precision sFPUs • 60 sFPUs per GPU • 480 cores per card • 60 sFPUs per card (!) shared memory • up to 48Kb of per core • up to 720 Kb per card • GTX 295 has 2 GPUs • 240 MPs in 30 cores • 8 MPs per core • 2 single-precision sFPUs • 60 sFPUs per GPU • 480 cores per card • 120 sFPUs per card Shared memory: • 16Kb per core • 960 Kb per card Why is ppc not a factor 2 faster on GTX 480 GPU than on GTX 295 GPU?

Device enumeration nvidia-smi -lsa GPU 0: Product Name : GeForce GTX 295 Serial : 1803836293359 PCI ID : 5eb10de Temperature : 87 C GPU 1: Product Name : GeForce GTX 295 Serial : 2497590956570 PCI ID : 5eb10de Temperature : 90 C GPU 2: Product Name : GeForce GTX 295 Serial : 1247671583504 PCI ID : 5eb10de Temperature : 100 C GPU 3: Product Name : GeForce GTX 295 Serial : 2353575330598 PCI ID : 5eb10de Temperature : 105 C GPU 4: Product Name : GeForce GTX 295 Serial : 1939228426794 PCI ID : 5eb10de Temperature : 100 C GPU 5: Product Name : GeForce GTX 295 Serial : 2347233542940 PCI ID : 5eb10de Temperature : 103 C cudatest: Found 6 devices, driver 2030, runtime 2030 0(1.3): GeForce GTX 295 1.296 GHz G(939261952) S(16384) C(65536) R(16384) W(32) l1 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 1(1.3): GeForce GTX 295 1.296 GHz G(939261952) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 2(1.3): GeForce GTX 295 1.296 GHz G(939261952) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 3(1.3): GeForce GTX 295 1.296 GHz G(938803200) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 4(1.3): GeForce GTX 295 1.296 GHz G(939261952) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 5(1.3): GeForce GTX 295 1.296 GHz G(939261952) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) 2 and 3 4 and 5 0 and 1 PSU 3 GTX 295 cards, each with 2 GPUs

Device enumeration nvidia-smi -a GPU 0: Product Name : GeForce GTX 295 PCI ID : 5eb10de Temperature : 68 C GPU 1: Product Name : GeForce GTX 295 PCI ID : 5eb10de Temperature : 73 C GPU 2: Product Name : GeForce GTX 480 PCI ID : 6c010de Temperature : 106 C GPU 3: Product Name : GeForce GTX 295 PCI ID : 5eb10de Temperature : 90 C GPU 4: Product Name : GeForce GTX 295 PCI ID : 5eb10de Temperature : 91 C cuda002: Found 5 devices, driver 3010, runtime 3010 0(2.0): GeForce GTX 480 1.401 GHz G(1610285056) S(49152) C(65536) R(32768) W(32) l0 o1 c0 h1 i0 m15 a512 M(2147483647) T(1024: 1024,1024,64) G(65535,65535,1) 1(1.3): GeForce GTX 295 1.242 GHz G(938803200) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(2147483647) T(512: 512,512,64) G(65535,65535,1) 2(1.3): GeForce GTX 295 1.242 GHz G(939327488) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(2147483647) T(512: 512,512,64) G(65535,65535,1) 3(1.3): GeForce GTX 295 1.242 GHz G(939327488) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(2147483647) T(512: 512,512,64) G(65535,65535,1) 4(1.3): GeForce GTX 295 1.242 GHz G(939327488) S(16384) C(65536) R(16384) W(32) l0 o1 c0 h1 i0 m30 a256 M(2147483647) T(512: 512,512,64) G(65535,65535,1) 0 2 3 and 4 3 and 4 1 and 2 0 and 1 PSU 2 GTX 295 cards, 1 GTX 480 card

Fermi vs. Tesla Flasher f2kmuon 35587.8 13797.0 34246.7 10990.8 40989.2 13563.2 40423.2 11514.9 29114.5 11361.4 27343.8 8760.0 27346.6 8755.5 29024.8 11429.2 27631.5 8833.2 27630.9 8833.3 28950.1 9079.5 28955.1 9073.2 cudatest: Found 6 devices, driver 2030, runtime 2030 0(1.3): GeForce GTX 295 1.296 GHz G(939261952) S(16384) C(65536) R(16384) W(32) l1 o1 c0 h1 i0 m30 a256 M(262144) T(512: 512,512,64) G(65535,65535,1) tesla: Found 1 devices, driver 3000, runtime 3000 0(1.3): Tesla C1060 1.296 GHz G(4294770688) S(16384) C(65536) R(16384) W(32) l1 o1 c0 h1 i0 m30 a256 M(2147483647) T(512: 512,512,64) G(65535,65535,1) fermi: Found 1 devices, driver 3000, runtime 3000 0(2.0): Tesla C2050 1.147 GHz G(2817982464) S(49152) C(65536) R(32768) W(32) l1 o1 c0 h1 i0 m14 a512 M(2147483647) T(1024: 1024,1024,64) G(65535,65535,1) beta: Found 1 devices, driver 3010, runtime 3010 0(2.0): Tesla C2050 1.147 GHz G(2817982464) S(49152) C(65536) R(32768) W(32) l1 o1 c0 h1 i0 m14 a512 M(2147483647) T(1024: 1024,1024,64) G(65535,65535,1) 11: arch=11 make gpu 12: arch=12 make gpu (default/best) 1x: arch=12 make gpu with -ftz=true -prec-div=false -prec-sqrt=false 20: arch=20 make gpu 2x: arch=20 make gpu with -ftz=true -prec-div=false -prec-sqrt=false

Kernel time calculation Run 3232 (corsika) IC86 processing on cuda002 (per file): GTX 295: Device time: 1123741.1 (in-kernel: 1115487.9...1122539.1) [ms] GTX 480: Device time: 693447.8 (in-kernel: 691775.9...693586.2) [ms] If more than 1 thread is running using same GPU: Device time: 1417203.1 (in-kernel: 1072643.6...1079405.0) [ms] 3 counters: 1. time difference before/after kernel launch in host code 2. in-kernel, using cycle counter: min thread time 3. max thread time Also, real/user/sys times of top: gpus 6 cpus 1 cores 8 files 693 Real 749m4.693s User 3456m10.888s sys 39m50.369s Device time: 245312940.1 216887330.9 218253017.2 [ms] files: 693 real: 64.8553 user: 37.8357 gpu: 58.9978 kernel: 52.4899 [seconds] 81%-91% GPU utilization

Concurrent execution time Thread 1: CPU GPU CPU GPU Thread 2: GPU CPU GPU CPU Create track segments CPU CPU CPU CPU One thread: GPU GPU GPU GPU Process photon hits Copy photon hits from GPU However: have 2 buffers: 1 on host and 1 on GPU! Just need to synchronize before the buffers are re-used Need 2 buffers for track segments and photon hits Copy track segments to GPU

BAD multiprocessors (MPs) clist cudatest 0 1 2 3 4 5 cuda001 0 1 2 3 4 5 cuda002 0 1 2 3 4 5 cuda003 0 1 2 3 4 5 #badmps cuda001 3 22 cuda002 2 20 cuda002 4 10 Configured: xR=5 eff=0.95 sf=0.2 g=0.943 Loaded 12 angsens coefficients Loaded 6x170 dust layer points Loaded 16028 random multipliers Loaded 42 wavelenth points Loaded 171 ice layers Loaded 3540 DOMs (19x19) Processing f2k muons from stdin on device 2 Total GPU memory usage: 83053520 photons: 13762560 hits: 991 Error: TOT was a nan or an inf 1 times! Bad MP #20 photons: 13762560 hits: 393 photons: 13762560 hits: 570 photons: 13762560 hits: 501 photons: 13762560 hits: 832 photons: 13762560 hits: 717 CUDA Error: unspecified launch failure [dima@cuda002 gpu]$ cat mmc.1.f2k | BADMP=20 ./ppc 2 > /dev/null Configured: xR=5 eff=0.95 sf=0.2 g=0.943 Loaded 12 angsens coefficients Loaded 6x170 dust layer points Loaded 16028 random multipliers Loaded 42 wavelenth points Loaded 171 ice layers Loaded 3540 DOMs (19x19) Processing f2k muons from stdin on device 2 Not using MP #20 Total GPU memory usage: 83053520 photons: 13762560 hits: 871 … photons: 1813560 hits: 114 Device time: 31970.7 (in-kernel: 31725.6...31954.8) [ms] Total GPU memory usage: 83053520 photons: 13762560 hits: 938 Error: TOT was a nan or an inf 9 times! Bad MP #20 #20 #20 #20 photons: 13762560 hits: 442 photons: 13762560 hits: 627 CUDA Error: unspecified launch failure Disable 3 bad GPUs out of 24: 12.5% Disable 3 bad MPs out of 720: 0.4%! Failure rates:

Typical run times • corsika: run 3232: 10493 10.0345 sec files • ic86/spx/3232 on cuda00[123] (53.4 seconds per job) • 1.2 days of real detector time in 6.5 days • nugen: run 2972: 9993 200000-event files; E^-2 weighted • ic86/spx/2972 on cudatest (25.1 seconds per job) • entire 10k set of files in 2.9 days  this is enough for an atmnu/diffuse analysis! • Considerations: • Maximize GPU utilization by running only mmc+ppc parts on the GPU nodes • still, IC40 mmc+ppc+detector was run with ~80% GPU utilization • run with 100% DOM efficiency, save all ppc events with at least 1 MC hit • apply a range of allowed efficiencies (70-100%) later with ppc-eff module

Other • Consider: • building production computers with only 2 cards, leaving a space in between • using 6-core CPUs if paired with 3 GPU cards • 4-way Tesla GPU-only servers a possible solution • Consumer GTX card much faster than Tesla/Fermi cards • GTX 295 was so far found to be a better choice than GTX 480 • but: no loger available! • Reliability: • 0.4% loss of advertised capacity in GTX 295 cards • however: 2 of 3 affected cards were “refurbished” • do cards deteriorate over time? The failed MPs did not change in ~3 months

IceCube simulation with PPC