PRACE Keynote, Linz

PRACE Keynote, Linz Oskar Mencer, April 2014 Computing in Space

Thinking Fast and Slow Daniel Kahneman Nobel Prize in Economics, 2002 14 × 27 = ? Kahneman splits thinking into: System 1: fast, hard to control System 2: slow, easier to control ….. 300 ….. 378

Assembly-line computing in action SYSTEM 1 x86 cores SYSTEM 2 flexible memory plus logic Optimal Encoding Low Latency Memory System High Throughput Memory minimize data movement

Temporal Computing (1D) • A program is a sequence of instructions • Performance is dominated by: • Memory latency • ALU availability CPU Memory Actual computation time Read data 1 C O M P Write Result 1 Read data 2 C O M P Write Result 2 Get Inst. 1 Get Inst. 2 Get Inst. 3 Read data 3 C O M P Write Result 3 Time

Spatial Computing (2D) Synchronous data movement data in data out ALU Control ALU Control Buffer ALU ALU ALU Read data [1..N] Computation Write results [1..N] Time Throughput dominated

Computing in Time vs Computing in Space Computing in Time 512 Controlflow Cores 2GHz 10KB on-chip SRAM 8GB on board DRAM 1 result every 100* clock cycles *depending on application! Computing in Space 10,000*Dataflow cores 200MHz 5MB on-chip SRAM 96GB of DRAM per DFE 1 result every clock cycle => *200x faster per manycore card => *10x less power => *10x bigger problems per node => *10x less nodes needed >10TB/s

OpenSPL in Practice New CME Electronic Trading Gateway will be going live in March 2014! Webinar Page: http://www.cmegroup.com/education/new-ilink-architecture-webinar.html CME Group Inc. (Chicago Mercantile Exchange) is one of the largest options and futures exchanges. It owns and operates large derivatives and futures exchanges in Chicago, and New York City, as well as online trading platforms. It also owns the Dow Jones stock and financial indexes, and CME Clearing Services, which provides settlement and clearing of exchange trades. …. [from Wikipedia]

Maxeler Seismic Imaging Platform • Maxeler provides Hardware plus application software for seismic modeling • MaxSkins allow access to Ultrafast Modelling and RTM for research and development of RTM and Full Waveform Inversion (FWI) from MatLab, Python, R, C/C++ and Fortran. • Bonus: MaxGenFDis a MaxCompiler plugin that allows the user to specify any 3D Finite Difference problem, including the PDE, coefficients, boundary conditions, etc, and automatically generate a fully parallelized implementation for a whole rack of Maxeler MPC nodes. • Application areas: • O&G • Weather • 3D PDE Solvers • High Energy Physics • Medical Imaging

Example: data flow graph generated by MaxCompiler 4866 static dataflow cores in 1 chip

Mission Impossible?

Computing in Space - Why Now? • Semiconductor technology is ready • Within ten years (2003 to 2013) the number of transistors on a chip went up from 400M (Itanium 2) to 5Bln (Xeon Phi) • Memory performance isn’t keeping up • Memory density has followed the trend set by Moore’s law • But Memory latency has increased from 10s to 100s of CPU clock cycles • As a result, On-die cache % of die area increased from 15% (1um) to 40% (32nm) • Memory latency gap could eliminate most of the benefits of CPU improvements • Petascalechallenges (10^15 FLOPS) • Clock frequencies stagnated in the few GHz range • Energy usage and Power wastage of modern HPC systems are becoming a huge economic burden that can not be ignored any longer • Requirements for annual performance improvements grow steadily • Programmers continue to rely on sequential execution (1D approach) • For affordable petascalesystems  Novel approach is needed

OpenSPL Example: X2 + 30 x SCSVar x = io.input("x", scsInt(32)); SCSVar result = x * x + 30; io.output("y", result, scsInt(32)); x 30 + y

OpenSPLExample: Moving Average Y = (Xn-1 + X + Xn+1) / 3 SCSVar x = io.input(“x”, scsFloat(7,17)); SCSVarprev= stream.offset(x, -1); SCSVar next = stream.offset(x, 1); SCSVar sum = prev + x + next; SCSVar result = sum / 3; io.output(“y”, result, scsFloat(7,17));

OpenSPLExample: Choices x 1 1 10 - + > SCSVar x = io.input(“x”, scsUInt(24)); SCSVar result = (x>10) ? x+1 : x-1; io.output(“y”, result, scsUInt(24)); y

OpenSPL and MaxAcademy 17 lectures/exercises, Theory and Practice of Computing in Space LECTURE 1: Concepts for Computing in Space LECTURE 2: Converting Temporal Code to Graphs LECTURE 3: Computing, Storage and Networking LECTURE 4: OpenSPL LECTURE 5: Dataflow Engines (DFEs) LECTURE 6: Programming DFEs (Basics) LECTURE 7: Programming DFEs (Advanced) LECTURE 8: Programming DFEs (Dynamic and multiple kernels) LECTURE 9: Application Case Studies I LECTURE 10: Making things go fast LECTURE 11: Numerics LECTURE 12: Application Case Studies II LECTURE 13: System Perspective LECTURE 14: Verifying Results LECTURE 15: Performance Modelling LECTURE 16: Economics of Computing in Space LECTURE 17: Summary and Conclusions

Maxeler Dataflow Engine Platforms High Density DFEsIntel Xeon CPU cores and up to 6 DFEs with 288GB of RAM The Dataflow Appliance Dense compute with 8 DFEs, 384GB of RAM and dynamic allocation of DFEs to CPU servers with zero-copy RDMA access The Low Latency Appliance Intel Xeon CPUs and 1-2 DFEs with direct links to up to six 10Gbit Ethernet connections

Bringing Scalability andEfficiency to theDatacenter

3000³ Modeling Compared to 32 3GHz x86 cores parallelized using MPI *presented at SEG 2010. 8 Full Intel Racks ~100kWatts => 2 MaxNodes (2U) Maxeler System <1kWatt

Typical Scalability of Sparse Matrix Visage – Geomechanics(2 node Nehalem 2.93 GHz) Eclipse Benchmark(2 node Westmere 3.06 GHz)

624 624 Sparse Matrix Solving O. Lindtjorn et al, 2010 • Given matrix A, vector b, find vector x in: Ax = b • Typically memory bound, not parallelisable. • 1 MaxNode achieved 20-40x the performance of an x86 node. Domain Specific Address and Data Encoding

Global Weather Simulation • Atmospheric equations • Equations: Shallow Water Equations (SWEs) [L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X. Huang, Y. Zhang, and G. Yang, Accelerating solvers for global atmospheric equations through mixed-precision data flow engine, FPL2013]

Always double-precision needed? • Range analysis to track the absolute values of all variables fixed-point fixed-point fixed-point reduced-precision reduced-precision

What about errorvs area tradeoffs • Bit accurate simulations for different bit-width configurations.

Accuracy validation • [Chao Yang, Wei Xue, Haohuan Fu, Lin Gan, et al. ‘A Peta-scalable CPU-GPU Algorithm for Global Atmospheric Simulations’, PPoPP’2013]

And there is also performance gain Meshsize: MaxNode speedup over Tianhe node: 14 times 14x

And power efficiency too Meshsize: MaxNode is 9 times more power efficient 9 x

Weather and climate models on DFEs Which one is better? Finer grid and higher precision are obviously preferred but the computational requirements will increase  Power usage  $$ What about using reduced precision? (15 bits instead of 64 double precision FP)

Weather models precision comparison

What about 15 days of simulation? Surface pressure after 15 days of simulation for the double precision and the reduced precision simulations (quality of the simulation hardly reduced)

CPU DFE MAX-UP: Astro Chemistry

Does it work? Test problem 2D Linear advection 4th order Runge-Kutta Regular torus mesh Gaussian bump Bump is advected across the torus mesh After 20 timesteps it should be back where it started Bump at t=20

CFD Performance • For this 2D linear advection test problem we achieve ca.450M degree-of-freedom updates per second • For comparison a GPU implementation (of a Navier-Stokes solver) achieves ca.50M DOFs/s Max3A workstation with Xilinx Virtex 6 475t + 4-core i7

CFD Conclusions • You really can do unstructured meshes on a dataflow accelerator • You really can max out the DRAM bandwidth • You really can get exciting performance • You have to work pretty hard • Or build on the work of others • This was not an acceleration project • We designed a generic architecture for a family of problems

We’re Hiring Candidate Profiles Acceleration Architect (UK) Application Engineer (USA) System Administrator (UK) Senior PCB Designer (UK) Hardware Engineer (UK) Networking Engineer (UK) Electronics Technician (UK)

PRACE Keynote, Linz

PRACE Keynote, Linz

Presentation Transcript

KeyNote

Keynote

Keynote

KEYNOTE

PRACE-1IP – Supporting the Implementation of the PRACE Infrastructure

Keynote

PRACE

Diplomova prace