1 / 36

Exascale radio astronomy

Exascale radio astronomy. M Clark. Contents. GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale. The March of GPUs. What is a GPU?. Kepler K20X (2012) 2688 processing cores 3995 SP Gflops peak Effective SIMD width of 32 threads (warp)

talbot
Download Presentation

Exascale radio astronomy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exascale radio astronomy M Clark

  2. Contents • GPU Computing • GPUs for Radio Astronomy • The problem is power • Astronomy at the Exascale

  3. The March of GPUs

  4. What is a GPU? • Kepler K20X (2012) • 2688 processing cores • 3995 SP Gflopspeak • Effective SIMD width of 32 threads (warp) • Deep memory hierarchy • As we move away from registers • Bandwidth decreases • Latency increases • Limited on-chip memory • 65,536 32-bit registers per SM • 48 KiB shared memory per SM • 1.5 MiBL2 • Programmed using a thread model

  5. Minimum Port, Big Speed-up Application Code Rest of Sequential CPU Code Only Critical Functions GPU CPU +

  6. Strong CUDA GPU Roadmap 20 Pascal Unified Memory 3D Memory NVLink 18 16 14 12 Maxwell DX12 10 SGEMM / W Normalized 8 Kepler Dynamic Parallelism 6 4 2 Fermi FP64 Tesla CUDA 0 2008 2010 2012 2014 2016

  7. Introducing NVLINK and Stacked Memory NVLINK • GPU high speed interconnect • 80-200 GB/s • Planned support for POWER CPUs Stacked Memory • 4x Higher Bandwidth (~1 TB/s) • 3x Larger Capacity • 4x More Energy Efficient per bit

  8. NVLink Enables Data Transfer At Speed of CPU Memory TESLA GPU CPU NVLink 80 GB/s HBM 1 Terabyte/s DDR4 50-75 GB/s DDR Memory Stacked Memory

  9. Radio Telescope Data Flow Real-Time R-T, post R-T Correlator Calibration & Imaging RF Samplers N O(N) O(N) O(N) O(N2) O(N log N) O(N2) digital

  10. Where can GPUs be Applied? • Cross correlation – GPU are ideal • Performance similar to CGEMM • High performance open-source library https://github.com/GPU-correlators/xGPU • Calibration and Imaging • Gridding - Coordinate mapping of input data to a regular grid • Arithmetic intensity scales with kernel convolution size • Compute-bound problem maps well to GPUs • Dominant time sink in compute pipeline • Other image processing steps • CUFFT – Highly optimized Fast Fourier Transform library • PFB – Computational intensity increases with number of taps • Coordinate transformations and resampling

  11. LOFAR GPUs in Radio Astronomy • Already an essential tool in radio astronomy • ASKAP – Western Australia • LEDA – United States of America • LOFAR – Netherlands (+ Europe) • MWA – Western Australia • NCRA - India • PAPER – South Africa LEDA ASKAP MWA PAPER

  12. Cross Correlation on GPUs • Cross correlation is essentially GEMM • Hierarchical locality

  13. Correlator Efficiency 64 Pascal 32 Maxwell 16 Kepler 8 >2.5 TFLOPS sustained X-engine GFLOPS per Watt Fermi 4 >1 TFLOPS sustained 2 Tesla 1 0.35 TFLOPS sustained 2016 2008 2010 2012 2014

  14. Software Correlation Flexibility • Why do software correlation? • Software correlatorsinherently have a great degree of flexibility • Software correlation can do on-the-fly reconfiguration • Subset correlation at increased bandwidth • Subset correlation at decreased integration time • Pulsar binning • Easy classification of data (RFI threshold) • Software is portable, correlator unchanged since 2010 • Already running on 2016 architecture

  15. = Power for the city of San Francisco HPC’s Biggest Challenge: Power Power of 300 PetaflopCPU-only Supercomputer

  16. GPUs Power World’s 10 Greenest Supercomputers

  17. The End of Historic Scaling C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011

  18. The End of Voltage Scaling • In the Good Old Days • Leakage was not important, and voltage scaled with feature size • L’ = L/2 • V’ = V/2 • E’ = CV2 = E/8 • f’ = 2f • D’ = 1/L2 = 4D • P’ = P • Halve L and get 4x the transistors and 8x the capability for the • same power • The New Reality • Leakage has limited threshold voltage, largely ending voltage scaling • L’ = L/2 • V’ = ~V • E’ = CV2 = E/2 • f’ = 2f • D’ = 1/L2 = 4D • P’ = 4P • Halve L and get 4x the transistors and 8x the capability for • 4x the power, • or 2x the capability for the same power in ¼ the area.

  19. Major Software Implications • Need to expose massive concurrency • Exaflop at O(GHz) clocks  O(billion-way) parallelism! • Need to expose and exploit locality • Data motion more expensive than computation • > 100:1 global v. local energy • Need to start addressing resiliency in the applications

  20. How Parallel is Astronomy? • SKA1-LOW specifications • 1024 dual-pol stations => 2,096,128 visibilities • 262,144 frequency channels • 300 MHz bandwidth • Correlator • 5 Pflops of computation • Data-parallel across visibilities • Task-parallel across frequency channels • O(trillion-way) parallelism

  21. How Parallel is Astronomy? • SKA1-LOW specifications • 1024 dual-pol stations => 2,096,128 visibilities • 262,144 frequency channels • 300 MHz bandwidth • Gridding (W-projection) • Kernel size 100x100 • Parallel across kernel size and visibilities (J. Romein) • O(10 billion-way) parallelism

  22. Energy EfficiencyDrives Locality 26 pJ 256 pJ 20 pJ 64-bit DP 16000 pJ DRAM Rd/Wr 256 bits 256-bit access 8 kB SRAM 50 pJ 500 pJ Efficient off-chip link 1000 pJ 20mm 28nm IC

  23. Energy Efficiency Drives Locality picoJoules

  24. Energy Efficiency Drives Locality • This is observable today • We have lots of tunable parameters: • Register tile size: how many much work should each thread do? • Thread block size: how many threads should work together? • Input precision: size of the input words • Quick and dirty cross correlation example • 4x4 => 8x8 register tiling • 8.5% faster, 5.5% lower power => 14% improvement in GFLOPS / watt

  25. SKA1 LOW Sketch 50 Tb/s 10 Tb/s Correlator Calibration & Imaging RF Samplers 8-bit digitization O(10) PFLOPS O(100) PFLOPS N = 1024 digital

  26. Energy Efficiency Drives Locality picoJoules

  27. Do we need Moore’s Law? • Moore’s law come from shrinking process • Moore’s law is slowing down • Denard scaling is dead • Increasing wafer costs means that it takes longer to move to the next process

  28. Improving Energy Efficiency @ Iso-Process • We don’t know how to build the perfect processor • Huge focus on improved architecture efficiency • Better understanding of a given process • Compare Fermi vs. Kepler vs. Maxwell architectures @ 28 nm • GF117: 96 cores, peak 192 Gflops • GK107: 384 cores, peak 770 Gflops • GM107: 640 cores, peak 1330 Gflops • Use cross-correlation benchmark • Only measure GPU power

  29. Improving Energy Efficiency @ 28 nm GFLOPS / watt 55% 80% 80%

  30. How Hot is Your Supercomputer? 2. Wilkes Cluster U. Cambridge, air cooled 3631 GFLOPS / watt 1. TSUBAME-KFC Tokyo Tech, oil cooled 4503 GFLOPS / watt Number 1 is 24% more efficient than number 2

  31. Temperature is Power • Power is dynamic and static • Dynamic power is work • Static power is leakage • Dominant term from sub-threshold leakage Voltage terms: Vs: Gate to source voltage Vth: Switching threshold voltage n: transistor sub-threshold swing coeff Device specifics: A: Technology specific constant L, W: device channel length & width Thermal Voltage: 8.62×10−5eV/K 26 mV at room temperature

  32. Temperature is Power Geforce GTX 580 Tesla K20m GF110, 40nm GK110, 28nm

  33. Tuning for Power Efficiency • A given processor does not have a fixed power efficiency • Dependent on • Clock frequency • Voltage • Temperature • Algorithm • Tune in this multi-dimensional space for power efficiency • E.g., cross-correlation on Kepler K20 • 12.96 -> 18.34 GFLOPS / watt • Bad news: no two processors are exactly alike

  34. Precision is Power • Power scales with the square of the multiplier (approximately) • Most computation done in FP32 / FP64 • Should use the minimum precision required by science needs • Maxwell GPUs have 16-bit integer multiply-add at FP32 rate • Algorithms should increasingly use hierarchical precision • Only invoke in high precision when necessary • Signal processing folks known this for a long time • Lesson feeding back into the HPC community...

  35. Conclusions • Astronomy has insatiable amount of compute • Many-core processors are a perfect match to the processing pipeline • Power is a problem but • Astronomy has oodles of parallelism • Key algorithms possess locality • Precision requirements are well understood • Scientists and Engineers wedded to the problem • Astronomy is perhaps the ideal application for the exascale

More Related