1 / 37

PRACE Keynote, Linz

PRACE Keynote, Linz. Oskar Mencer, April 2014. Computing in Space. Thinking Fast and Slow. Daniel Kahneman Nobel Prize in Economics, 2002 14 × 27 = ? Kahneman splits thinking into: System 1: fast, hard to control System 2: slow, easier to control . ….. 300 ….. 378.

anais
Download Presentation

PRACE Keynote, Linz

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PRACE Keynote, Linz Oskar Mencer, April 2014 Computing in Space

  2. Thinking Fast and Slow Daniel Kahneman Nobel Prize in Economics, 2002 14 × 27 = ? Kahneman splits thinking into: System 1: fast, hard to control System 2: slow, easier to control ….. 300 ….. 378

  3. Assembly-line computing in action SYSTEM 1 x86 cores SYSTEM 2 flexible memory plus logic Optimal Encoding Low Latency Memory System High Throughput Memory minimize data movement

  4. Temporal Computing (1D) • A program is a sequence of instructions • Performance is dominated by: • Memory latency • ALU availability CPU Memory Actual computation time Read data 1 C O M P Write Result 1 Read data 2 C O M P Write Result 2 Get Inst. 1 Get Inst. 2 Get Inst. 3 Read data 3 C O M P Write Result 3 Time

  5. Spatial Computing (2D) Synchronous data movement data in data out ALU Control ALU Control Buffer ALU ALU ALU Read data [1..N] Computation Write results [1..N] Time Throughput dominated

  6. Computing in Time vs Computing in Space Computing in Time 512 Controlflow Cores 2GHz 10KB on-chip SRAM 8GB on board DRAM 1 result every 100* clock cycles *depending on application! Computing in Space 10,000*Dataflow cores 200MHz 5MB on-chip SRAM 96GB of DRAM per DFE 1 result every clock cycle => *200x faster per manycore card => *10x less power => *10x bigger problems per node => *10x less nodes needed >10TB/s

  7. OpenSPL in Practice New CME Electronic Trading Gateway will be going live in March 2014! Webinar Page: http://www.cmegroup.com/education/new-ilink-architecture-webinar.html CME Group Inc. (Chicago Mercantile Exchange) is one of the largest options and futures exchanges. It owns and operates large derivatives and futures exchanges in Chicago, and New York City, as well as online trading platforms. It also owns the Dow Jones stock and financial indexes, and CME Clearing Services, which provides settlement and clearing of exchange trades. …. [from Wikipedia]

  8. Maxeler Seismic Imaging Platform • Maxeler provides Hardware plus application software for seismic modeling • MaxSkins allow access to Ultrafast Modelling and RTM for research and development of RTM and Full Waveform Inversion (FWI) from MatLab, Python, R, C/C++ and Fortran. • Bonus: MaxGenFDis a MaxCompiler plugin that allows the user to specify any 3D Finite Difference problem, including the PDE, coefficients, boundary conditions, etc, and automatically generate a fully parallelized implementation for a whole rack of Maxeler MPC nodes. • Application areas: • O&G • Weather • 3D PDE Solvers • High Energy Physics • Medical Imaging

  9. Example: data flow graph generated by MaxCompiler 4866 static dataflow cores in 1 chip

  10. Mission Impossible?

  11. Computing in Space - Why Now? • Semiconductor technology is ready • Within ten years (2003 to 2013) the number of transistors on a chip went up from 400M (Itanium 2) to 5Bln (Xeon Phi) • Memory performance isn’t keeping up • Memory density has followed the trend set by Moore’s law • But Memory latency has increased from 10s to 100s of CPU clock cycles • As a result, On-die cache % of die area increased from 15% (1um) to 40% (32nm) • Memory latency gap could eliminate most of the benefits of CPU improvements • Petascalechallenges (10^15 FLOPS) • Clock frequencies stagnated in the few GHz range • Energy usage and Power wastage of modern HPC systems are becoming a huge economic burden that can not be ignored any longer • Requirements for annual performance improvements grow steadily • Programmers continue to rely on sequential execution (1D approach) • For affordable petascalesystems  Novel approach is needed

  12. OpenSPL Example: X2 + 30 x SCSVar x = io.input("x", scsInt(32)); SCSVar result = x * x + 30; io.output("y", result, scsInt(32)); x 30 + y

  13. OpenSPLExample: Moving Average Y = (Xn-1 + X + Xn+1) / 3 SCSVar x = io.input(“x”, scsFloat(7,17)); SCSVarprev= stream.offset(x, -1); SCSVar next = stream.offset(x, 1); SCSVar sum = prev + x + next; SCSVar result = sum / 3; io.output(“y”, result, scsFloat(7,17));

  14. OpenSPLExample: Choices x 1 1 10 - + > SCSVar x = io.input(“x”, scsUInt(24)); SCSVar result = (x>10) ? x+1 : x-1; io.output(“y”, result, scsUInt(24)); y

  15. OpenSPL and MaxAcademy 17 lectures/exercises, Theory and Practice of Computing in Space LECTURE 1: Concepts for Computing in Space LECTURE 2: Converting Temporal Code to Graphs LECTURE 3: Computing, Storage and Networking LECTURE 4: OpenSPL LECTURE 5: Dataflow Engines (DFEs) LECTURE 6: Programming DFEs (Basics) LECTURE 7: Programming DFEs (Advanced) LECTURE 8: Programming DFEs (Dynamic and multiple kernels) LECTURE 9: Application Case Studies I LECTURE 10: Making things go fast LECTURE 11: Numerics LECTURE 12: Application Case Studies II LECTURE 13: System Perspective LECTURE 14: Verifying Results LECTURE 15: Performance Modelling LECTURE 16: Economics of Computing in Space LECTURE 17: Summary and Conclusions

  16. Maxeler Dataflow Engine Platforms High Density DFEsIntel Xeon CPU cores and up to 6 DFEs with 288GB of RAM The Dataflow Appliance Dense compute with 8 DFEs, 384GB of RAM and dynamic allocation of DFEs to CPU servers with zero-copy RDMA access The Low Latency Appliance Intel Xeon CPUs and 1-2 DFEs with direct links to up to six 10Gbit Ethernet connections

  17. Bringing Scalability andEfficiency to theDatacenter

  18. 3000³ Modeling Compared to 32 3GHz x86 cores parallelized using MPI *presented at SEG 2010. 8 Full Intel Racks ~100kWatts => 2 MaxNodes (2U) Maxeler System <1kWatt

  19. Typical Scalability of Sparse Matrix Visage – Geomechanics(2 node Nehalem 2.93 GHz) Eclipse Benchmark(2 node Westmere 3.06 GHz)

  20. 624 624 Sparse Matrix Solving O. Lindtjorn et al, 2010 • Given matrix A, vector b, find vector x in: Ax = b • Typically memory bound, not parallelisable. • 1 MaxNode achieved 20-40x the performance of an x86 node. Domain Specific Address and Data Encoding

  21. Global Weather Simulation • Atmospheric equations • Equations: Shallow Water Equations (SWEs) [L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X. Huang, Y. Zhang, and G. Yang, Accelerating solvers for global atmospheric equations through mixed-precision data flow engine, FPL2013]

  22. Always double-precision needed? • Range analysis to track the absolute values of all variables fixed-point fixed-point fixed-point reduced-precision reduced-precision

  23. What about errorvs area tradeoffs • Bit accurate simulations for different bit-width configurations.

  24. Accuracy validation • [Chao Yang, Wei Xue, Haohuan Fu, Lin Gan, et al. ‘A Peta-scalable CPU-GPU Algorithm for Global Atmospheric Simulations’, PPoPP’2013]

  25. And there is also performance gain Meshsize: MaxNode speedup over Tianhe node: 14 times 14x

  26. And power efficiency too Meshsize: MaxNode is 9 times more power efficient 9 x

  27. Weather and climate models on DFEs Which one is better? Finer grid and higher precision are obviously preferred but the computational requirements will increase  Power usage  $$ What about using reduced precision? (15 bits instead of 64 double precision FP)

  28. Weather models precision comparison

  29. What about 15 days of simulation? Surface pressure after 15 days of simulation for the double precision and the reduced precision simulations (quality of the simulation hardly reduced)

  30. CPU DFE MAX-UP: Astro Chemistry

  31. Does it work? Test problem 2D Linear advection 4th order Runge-Kutta Regular torus mesh Gaussian bump Bump is advected across the torus mesh After 20 timesteps it should be back where it started Bump at t=20

  32. CFD Performance • For this 2D linear advection test problem we achieve ca.450M degree-of-freedom updates per second • For comparison a GPU implementation (of a Navier-Stokes solver) achieves ca.50M DOFs/s Max3A workstation with Xilinx Virtex 6 475t + 4-core i7

  33. CFD Conclusions • You really can do unstructured meshes on a dataflow accelerator • You really can max out the DRAM bandwidth • You really can get exciting performance • You have to work pretty hard • Or build on the work of others • This was not an acceleration project • We designed a generic architecture for a family of problems

  34. We’re Hiring Candidate Profiles Acceleration Architect (UK) Application Engineer (USA) System Administrator (UK) Senior PCB Designer (UK) Hardware Engineer (UK) Networking Engineer (UK) Electronics Technician (UK)

More Related