Exascale radio astronomy

Exascale radio astronomy M Clark

Contents • GPU Computing • GPUs for Radio Astronomy • The problem is power • Astronomy at the Exascale

The March of GPUs

What is a GPU? • Kepler K20X (2012) • 2688 processing cores • 3995 SP Gflopspeak • Effective SIMD width of 32 threads (warp) • Deep memory hierarchy • As we move away from registers • Bandwidth decreases • Latency increases • Limited on-chip memory • 65,536 32-bit registers per SM • 48 KiB shared memory per SM • 1.5 MiBL2 • Programmed using a thread model

Minimum Port, Big Speed-up Application Code Rest of Sequential CPU Code Only Critical Functions GPU CPU +

Strong CUDA GPU Roadmap 20 Pascal Unified Memory 3D Memory NVLink 18 16 14 12 Maxwell DX12 10 SGEMM / W Normalized 8 Kepler Dynamic Parallelism 6 4 2 Fermi FP64 Tesla CUDA 0 2008 2010 2012 2014 2016

Introducing NVLINK and Stacked Memory NVLINK • GPU high speed interconnect • 80-200 GB/s • Planned support for POWER CPUs Stacked Memory • 4x Higher Bandwidth (~1 TB/s) • 3x Larger Capacity • 4x More Energy Efficient per bit

NVLink Enables Data Transfer At Speed of CPU Memory TESLA GPU CPU NVLink 80 GB/s HBM 1 Terabyte/s DDR4 50-75 GB/s DDR Memory Stacked Memory

Radio Telescope Data Flow Real-Time R-T, post R-T Correlator Calibration & Imaging RF Samplers N O(N) O(N) O(N) O(N2) O(N log N) O(N2) digital

Where can GPUs be Applied? • Cross correlation – GPU are ideal • Performance similar to CGEMM • High performance open-source library https://github.com/GPU-correlators/xGPU • Calibration and Imaging • Gridding - Coordinate mapping of input data to a regular grid • Arithmetic intensity scales with kernel convolution size • Compute-bound problem maps well to GPUs • Dominant time sink in compute pipeline • Other image processing steps • CUFFT – Highly optimized Fast Fourier Transform library • PFB – Computational intensity increases with number of taps • Coordinate transformations and resampling

LOFAR GPUs in Radio Astronomy • Already an essential tool in radio astronomy • ASKAP – Western Australia • LEDA – United States of America • LOFAR – Netherlands (+ Europe) • MWA – Western Australia • NCRA - India • PAPER – South Africa LEDA ASKAP MWA PAPER

Cross Correlation on GPUs • Cross correlation is essentially GEMM • Hierarchical locality

Correlator Efficiency 64 Pascal 32 Maxwell 16 Kepler 8 >2.5 TFLOPS sustained X-engine GFLOPS per Watt Fermi 4 >1 TFLOPS sustained 2 Tesla 1 0.35 TFLOPS sustained 2016 2008 2010 2012 2014

Software Correlation Flexibility • Why do software correlation? • Software correlatorsinherently have a great degree of flexibility • Software correlation can do on-the-fly reconfiguration • Subset correlation at increased bandwidth • Subset correlation at decreased integration time • Pulsar binning • Easy classification of data (RFI threshold) • Software is portable, correlator unchanged since 2010 • Already running on 2016 architecture

= Power for the city of San Francisco HPC’s Biggest Challenge: Power Power of 300 PetaflopCPU-only Supercomputer

GPUs Power World’s 10 Greenest Supercomputers

The End of Historic Scaling C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011

The End of Voltage Scaling • In the Good Old Days • Leakage was not important, and voltage scaled with feature size • L’ = L/2 • V’ = V/2 • E’ = CV2 = E/8 • f’ = 2f • D’ = 1/L2 = 4D • P’ = P • Halve L and get 4x the transistors and 8x the capability for the • same power • The New Reality • Leakage has limited threshold voltage, largely ending voltage scaling • L’ = L/2 • V’ = ~V • E’ = CV2 = E/2 • f’ = 2f • D’ = 1/L2 = 4D • P’ = 4P • Halve L and get 4x the transistors and 8x the capability for • 4x the power, • or 2x the capability for the same power in ¼ the area.

Major Software Implications • Need to expose massive concurrency • Exaflop at O(GHz) clocks  O(billion-way) parallelism! • Need to expose and exploit locality • Data motion more expensive than computation • > 100:1 global v. local energy • Need to start addressing resiliency in the applications

How Parallel is Astronomy? • SKA1-LOW specifications • 1024 dual-pol stations => 2,096,128 visibilities • 262,144 frequency channels • 300 MHz bandwidth • Correlator • 5 Pflops of computation • Data-parallel across visibilities • Task-parallel across frequency channels • O(trillion-way) parallelism

How Parallel is Astronomy? • SKA1-LOW specifications • 1024 dual-pol stations => 2,096,128 visibilities • 262,144 frequency channels • 300 MHz bandwidth • Gridding (W-projection) • Kernel size 100x100 • Parallel across kernel size and visibilities (J. Romein) • O(10 billion-way) parallelism

Energy EfficiencyDrives Locality 26 pJ 256 pJ 20 pJ 64-bit DP 16000 pJ DRAM Rd/Wr 256 bits 256-bit access 8 kB SRAM 50 pJ 500 pJ Efficient off-chip link 1000 pJ 20mm 28nm IC

Energy Efficiency Drives Locality picoJoules

Energy Efficiency Drives Locality • This is observable today • We have lots of tunable parameters: • Register tile size: how many much work should each thread do? • Thread block size: how many threads should work together? • Input precision: size of the input words • Quick and dirty cross correlation example • 4x4 => 8x8 register tiling • 8.5% faster, 5.5% lower power => 14% improvement in GFLOPS / watt

SKA1 LOW Sketch 50 Tb/s 10 Tb/s Correlator Calibration & Imaging RF Samplers 8-bit digitization O(10) PFLOPS O(100) PFLOPS N = 1024 digital

Energy Efficiency Drives Locality picoJoules

Do we need Moore’s Law? • Moore’s law come from shrinking process • Moore’s law is slowing down • Denard scaling is dead • Increasing wafer costs means that it takes longer to move to the next process

Improving Energy Efficiency @ Iso-Process • We don’t know how to build the perfect processor • Huge focus on improved architecture efficiency • Better understanding of a given process • Compare Fermi vs. Kepler vs. Maxwell architectures @ 28 nm • GF117: 96 cores, peak 192 Gflops • GK107: 384 cores, peak 770 Gflops • GM107: 640 cores, peak 1330 Gflops • Use cross-correlation benchmark • Only measure GPU power

Improving Energy Efficiency @ 28 nm GFLOPS / watt 55% 80% 80%

How Hot is Your Supercomputer? 2. Wilkes Cluster U. Cambridge, air cooled 3631 GFLOPS / watt 1. TSUBAME-KFC Tokyo Tech, oil cooled 4503 GFLOPS / watt Number 1 is 24% more efficient than number 2

Temperature is Power • Power is dynamic and static • Dynamic power is work • Static power is leakage • Dominant term from sub-threshold leakage Voltage terms: Vs: Gate to source voltage Vth: Switching threshold voltage n: transistor sub-threshold swing coeff Device specifics: A: Technology specific constant L, W: device channel length & width Thermal Voltage: 8.62×10−5eV/K 26 mV at room temperature

Temperature is Power Geforce GTX 580 Tesla K20m GF110, 40nm GK110, 28nm

Tuning for Power Efficiency • A given processor does not have a fixed power efficiency • Dependent on • Clock frequency • Voltage • Temperature • Algorithm • Tune in this multi-dimensional space for power efficiency • E.g., cross-correlation on Kepler K20 • 12.96 -> 18.34 GFLOPS / watt • Bad news: no two processors are exactly alike

Precision is Power • Power scales with the square of the multiplier (approximately) • Most computation done in FP32 / FP64 • Should use the minimum precision required by science needs • Maxwell GPUs have 16-bit integer multiply-add at FP32 rate • Algorithms should increasingly use hierarchical precision • Only invoke in high precision when necessary • Signal processing folks known this for a long time • Lesson feeding back into the HPC community...

Conclusions • Astronomy has insatiable amount of compute • Many-core processors are a perfect match to the processing pipeline • Power is a problem but • Astronomy has oodles of parallelism • Key algorithms possess locality • Precision requirements are well understood • Scientists and Engineers wedded to the problem • Astronomy is perhaps the ideal application for the exascale

Exascale radio astronomy

Exascale radio astronomy

Presentation Transcript

(Radio) Astronomy in Taiwan

Antennas in Radio Astronomy

Radio Astronomy: Jansky

Radio Astronomy

Metamaterials for radio astronomy

Radio Astronomy

Radio Astronomy

Radio Astronomy

Radio Astronomy

Radio Astronomy

Antennas in Radio Astronomy

(Radio) Astronomy in Taiwan

Radio Astronomy

Radio Astronomy in School

ARECIBO RADIO ASTRONOMY

Radio Astronomy and Amateur Radio

Radio Astronomy Outreach

Radio Astronomy

RADIO ASTRONOMY AT ARECIBO

Radio Astronomy and Interferometry: Basic Radio/mm Astronomy

Radio Astronomy in School