1 / 40

New HPC architectures landscape and impact on code developments

This article discusses the impact of new HPC architectures on code developments for material science simulations, with a focus on preparing codes for future exascale systems. The article presents a pragmatic approach to address the challenges of exascale computing and highlights the potential outcome and impact of these efforts.

Download Presentation

New HPC architectures landscape and impact on code developments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New HPC architectureslandscapeand impact on code developments Carlo Cavazzoni Cineca & MaX

  2. Enabling Exascale Transition • GOAL: “modernize” community codes and make them ready to exploit at best future exascale system for material science simulations (MAX Obj1.4) • CHALLENGE: there is not yet a solution that fits all needs, and this is in common in all computational science domains • STRATEGY: pragmatic approach based on building knowledge about exascale related problems and running proof of concepts to field test solutions, and finally deriving best practices that can consolidate in real solutions for the full applications and making their way through the public code release. • OUTCOME: New code versions with development validated, libraries and module publicly available beyond MAX, extensive dissemination activities . • IMPACT: modern codes, exploitation of today HPC systems, other applications fields as well as technology providers.

  3. Changes in the road-map to Exa Intel’s Data Center Group GM Trish Damkroger describing the company’s exascale strategy and other topics they are talking about at the SC17 conference, she offhandedly mentioned that the Knights Hill product is dead. More specifically she said that the chip will be replaced in favor of “a new platform and new microarchitecture specifically designed for exascale.”

  4. Specialized cores

  5. ExascaleHow serious the situation is? Peak Performance 10^5 FPUs in 10^4 servers Moore law 10^18 Flops Number of FPUs FPU Performance 10^4 FPUs in 10^5 servers 10^9 10^9 Flops Dennardlaw Working hypothesis ExascaleArchitectures Heterogeneus

  6. General Consideration • Exascale is not (only) about scalability and Flops performance! • In an exascale machine there will be 10^9 FPUs, bring data in and out will be the main challenge. • 10^4 nodes, but 10^5 FPUs inside the nodes! • There is no silver bullet (so far) • heterogeneity is here to stay • deeper memory hierarchies Carlo Cavazzoni

  7. Exascale… some guess • From GPU to specialized core (tensor core) • Specialized memory module HBM • Specialized non volatile memory NVRAM Performance modelling • Refactor code to better fit architectures with specialized HW • Avoiding WRONG TURN!

  8. Paradigm and co-design Identify latency and throughput sub/module/class Map to HW App workflow latency Re-factor throughput knl Latency code Throughput code host Map and Overlap comm latency throughput Heterogeneus

  9. MaXActivities Programming Paradigms Libraries Co-design Profiling Hot spots Performance issues bottlenecks Flops and watts efficiency DSL Kernel Libraries module Perf. Models New arch. Vendors New MPI and OpenMP Standards New Para. OmpSS, CUDA New more efficient code version Library/module shared Codes/community/vendor Feedback to Scientists/developers (WP1) Dissemination of best practice Schools and Workshops Collaborations e.g. CoE/FET/PRACE

  10. Perf. modelling, results Absolute time estimate results. MnSi - bulk, 64 atoms, 14 k-points

  11. CoE what is my target architecture? How could I cope with GPUs, many-cores, FPGAs? I like homogeneous architecture! Why should I care about heterogeneous? DSL, Kernel libraries Modularization, API Encapsulation Separation Of Concern Heterogeneity is here to stay!!! DSL Sirius CheSS (SIESTA) SDDK FFTXlib (QE,YAMBO) LAXlib (QE, ~YAMBO) FLEUR-LA (FLEUR) Kernel lib ELPA (QE,YAMBO, FLEUR) Carlo Cavazzoni

  12. One size do not fit all 109 FPU to leverage Best algo for 1FPU /= best algo for 109FPU Implement the best algo for each scale e.g. 2 FFT and data distribution in QE 6.2 Autotuning • Choose the best at runtime

  13. Beyond Modularization QE - Libraries FFTXlib LAXlib Other codes Mini-app

  14. Scaling-out YAMBO A single GW calculation has run on 1000 Intel Knights Landing (KNL) nodes of the new Tier-0 MARCONI KNL partition, corresponding to 68000 cores and ~ 3 pFlop/second. The simulation, related to the growth of complex graphene nanoribbons on a metal surface, is part of an active research project combining computational spectroscopy with cutting edge experimental data from teams in Austria, Italy, and Switzerland. Simulations were performed exploiting computational resources granted by PRACE (via call 14). http://www.max-centre.eu/2017/04/19/a-new-scalability-record-in-a-materials-science-application/

  15. Planning for Exascale Performance model, co-design, POC code re-factoring Pre-exascale exascale today Yambo @ 3Pflops Yambo @ 10-15Pflops Yambo @ 50Pflops Socket perf 3-5TFlops Socket perf 10-15TFlops Socket perf 20-40TFlops

  16. Marconi convergent HPC solution Cloud/Data Proc. Scale Out 792 Lenovo NeXtScale servers Intel E5-2697 v4 Broadwell - 216 nodi eth x cloud HT INFN - 216 nodi eth x cloud HPC/DP - 360 nodi QDR x Tier 1 – HPC 100 Lenovo NeXtscale servers Intel E5-2630 v3 Haswell QDR + Nvidia K80 2300 Lenovo Stark servers > 7PFlops Intel SkyLake 24 cores @ 2.1GHz. 196GByte x node 3600 Intel/ lenovo servers > 11PFlops Intel PHI code name Knight Landing 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM 720 Lenovo NeXtScale servers Intel E5-2697 v4 Broadwell 18 cores @ 2.3GHz. 128GByte x node Lenovo GSS + SFA12K + IBM Flash >30PByte

  17. Cineca “sustainable” roadmaptoward exascale 5x >250PF+ >20PF ~8MW 5x 50PF+ 10PF ~4MW 2x (latency cores) 11PF+ 9PF 3.5MW 5x 1x (latency cores) 2PF 1MW 20x 10x (in total) 100TF 1MW Paradigm change Pre-exascale solid Carlo Cavazzoni

  18. What does 5x really means? Peak Performance? Linpack? HPCG? Time to solutions? Energy to solutions? Time to Science? A combination of all of the above? We need to define the right metric!

  19. The data centers at the Science Park ECMWF DC maincharacteristics • 2 power line up to 10 MW (onebck up of the other) • Expansion to 20 MW • Photovoltaiccells on the roofs (500 MWh/year) • Redundancy N+1 (mechanics and electrical) • 5 x 2 MW DRUPS • Cooling • 4 dry coolers (1850 kW each) • 4 groundwaterwelles • 5 refrigeratorunits (1400 kW each) • Peak PUE 1.35 / Maximum annualized PUE 1.18 Electricalsubstation (HV/MV) Outdoor Chillers + mechanics Diesel Generators ECMWF PLANTS ECMWF DC 1 ECMWF DC 2 INFN DC CINECA DC ECMWF EXP. ECMWF PLANTS INFN – CINECA DC maincharacteristics • up to 20 MW (onebck up of the other) • Possible use of CombinedHeat and PowerFuelCells Technology • Redundancystrategy under study • Cooling, still under study • dry coolers • groundwaterwelles • refrigeratorunits • PUE < 1.2 – 1.3 Outdoor Chillers Electricalplant rooms DRUPS rooms Mechanicalplant rooms POP 1 + POP 2 Switch rooms Gas storage rooms General Utilities

  20. HPC and Verticals Value delivered to users VALUE Applications integration (Meteo, Astro, Materials, Visit, Repo, Ing. , Analytic, etc…) BigDATA Accelerated computing codesign 3D Viz Cloud Service AI HW infrastructure (clusters, storage, network, devices) Toward an End-to-end optimized infrastructure

  21. Thank you!

  22. Backup slides

  23. Exascale “node”, according to Intel https://www.hpcwire.com/2018/01/25/hpc-ai-two-communities-future/

  24. Memory! Analysis of Memory allocation, During an SCF cycle. Memory BW usage On different type of MEM! Critical behaviour Code is slowed down Need better memory access pattern communications

  25. Exascale system Al Gara’svision for the unification of the “3 Pillars” of HPC currentlyunderway. “The convergence of AI, data analytics and traditionalsimulationwillresult in systems with broadercapabilities and configurabilityaswellas cross pollination.”

  26. RMA MPI Intel 2017 • RMA as a substitute of Alltoall • Source code and data shared with Intel BDW36 mpi

  27. QE: Linear Algebra on KNLwith OpenMP

  28. The CINECA-INFN Plan

  29. Exascale How serious the situation is? Peak Performance 10^5 FPUs in 10^4 servers Moore law 10^18 Flops Number of FPUs FPU Performance 10^4 FPUs in 10^5 servers 10^9 10^9 Flops Dennard law Working hypothesis Exascale Architectures Heterogeneus

  30. Exascale… some guess • From GPU to specialized core (tensor core) • Specialized memory module HBM • Specialized non volatile memory NVRAM Performance modelling • Refactor code to better fit architectures with specialized HW

  31. Bologna Big Data Science Park Protezione civile and regional agency for development and innovation CINECA & INFN Exascale Supercomputing center ECMWF Data Center Conference and Education Center BIG DATA FOUNDATION Agenzia Nazionale Meteo University centers «Ballette innovation and creativity center» IOR biobank Enea center Competence center Industry 4.0

  32. New Cineca HPC infrastructure design point D.A.V.I.D.E. (prototype) Marconi A4 - OPA Marconi A3 - OPA Marconi A2 - OPA Marconi A1 - OPA GSS OPA GSS Login Gateway PPI4HPC & EuroHPC Internet HBP, Eurofusion PRACE/EUDAT CNAF ViZ ETH-core + Mellanox Gateway IB + ETH-25/100Gbit tape Fibre SW Ex-PICO 5100 NFS Servers Cloud, BigData, AI, Interactive and Data processing Cluster Mellanox FDR servers FEC Servers TMS

  33. Al Gara (Intel) the samearchitecturewill cover HPC, AI, and Data Analytics throughconfiguration, whichmeansthereneeds to be a consistent software story acrossthesedifferent hardware backends to address HPC plus AI workloads.

  34. Exascale system Al Gara’svision for the unification of the “3 Pillars” of HPC currentlyunderway. “The convergence of AI, data analytics and traditionalsimulationwillresult in systems with broadercapabilities and configurabilityaswellas cross pollination.”

  35. Exascale “node”, according to Intel https://www.hpcwire.com/2018/01/25/hpc-ai-two-communities-future/

  36. D.A.V.I.D.E. Intelligenza Artificiale: Dall'Università alle Aziende - Bologna http://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-datasheet.pdf

  37. HPC and Verticals Value delivered to users VALUE Applications integration (Meteo, Astro, Materials, Visit, Repo, Eng. , Analytic, etc…) BigDATA Accelerated computing AI Co-design 3D Viz High Through Connectors to other infrastructures procurement HW infrastructure (clusters, storage, network, devices) Toward an End-to-end optimized infrastructure

  38. Power projection Peek Perf (DP) @ 10MW

  39. Technical Project Goal of the procurement • New Prace Tier-0 system • Target:5x increase of system capability • Maximize efficiency (capability/w) • Sustain production for 3 years minimum • Integrated in the current infrastructure • Possibly hosted in the same data center as ECMWF 40

More Related