pgi compilers tools update n.
Skip this Video
Loading SlideShow in 5 Seconds..
PGI Compilers & Tools Update- March 2018 PowerPoint Presentation
Download Presentation
PGI Compilers & Tools Update- March 2018

Loading in 2 Seconds...

play fullscreen
1 / 26

PGI Compilers & Tools Update- March 2018 - PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on

Check out the March 2018 PGI Compilers & Tools update for news about the HPC SDK, OpenACC, CUDA, and more.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
    Presentation Transcript
    pgi compilers tools update

    PGI COMPILERS & TOOLS UPDATE

    PGI Compilers for Heterogeneous Supercomputing, March 2018

    pgi the nvidia hpc sdk

    PGI - THE NVIDIA HPC SDK

    Fortran, C & C++ Compilers

    Optimizing, SIMD Vectorizing, OpenMP

    Accelerated Computing Features

    OpenACC Directives

    CUDA Fortran

    Multi-Platform Solution

    Multicore x86-64 and OpenPOWER CPUs,

    NVIDIA Tesla GPUs

    Supported on Linux, macOS, Windows

    MPI/OpenMP/OpenACC Tools

    Debugger

    Performance Profiler

    Interoperable with DDT, TotalView

    2

    openacc for everyone pgi community edition

    OPENACC FOR EVERYONE

    PGI Community Edition Now Available

    FREE

    PROGRAMMING MODELS

    OpenACC, CUDA Fortran, OpenMP,

    C/C++/Fortran Compilers and Tools

    PLATFORMS

    X86, OpenPOWER, NVIDIA GPU

    UPDATES

    1-2 times a year

    6-9 times a year

    6-9 times a year

    PGI Premier

    Services

    SUPPORT

    User Forums

    PGI Support

    LICENSE

    Annual

    Perpetual

    Volume/Site

    3

    latest cpus support intel skylake

    Latest CPUs Support

    Intel Skylake

    AMD Zen

    IBM POWER9

    Full OpenACC 2.6

    OpenMP 4.5 for multicore CPUs

    AVX-512 code generation

    Integrated CUDA 9.1 toolkit/libraries

    New fastmath intrinsics library

    Partial C++17 support

    Optional LLVM-based x86 code

    generator

    pgicompilers.com/whats-new

    4

    spec accel 1 2 benchmarks

    SPEC ACCEL 1.2 BENCHMARKS

    OpenACC

    OpenMP 4.5

    200

    200

    Intel 2018

    PGI 18.1

    PGI 18.1

    150

    150

    GEOMEAN Seconds

    GEOMEAN Seconds

    100

    100

    4.4x

    Speed-up

    50

    50

    0

    0

    2-socket Skylake

    40 cores / 80 threads

    2-socket EPYC

    48 cores / 48 threads

    2-socket Broadwell

    40 cores / 80 threads

    2-socket

    Broadwell

    1x Volta

    V100

    Performance measured February, 2018. Skylake: Two 20 core Intel Xeon Gold 6148 CPUs @ 2.4GHz w/ 376GB memory, hyperthreading enabled. EPYC: Two 24 core AMD EPYC 7451 CPUs

    @ 2.3GHz w/ 256GB memory. Broadwell: Two 20 core Intel Xeon E5-2698 v4 CPUs @ 3.6GHz w/ 256GB memory, hyperthreading enabled. Volta: NVIDIA DGX1 system with two 20 core

    Intel Xeon E5-2698 v4 CPUs @ 2.20GHz, 256GB memory, one NVIDIA Tesla V100-SXM2-16GB GPU @ 1.53GHz. SPEC® is a registered trademark of the Standard Performance Evaluation

    Corporation (www.spec.org).

    5

    spec cpu 2017 fp speed benchmarks

    SPEC CPU 2017 FP SPEED BENCHMARKS

    200

    Intel 2018

    PGI 18.1

    150

    GEOMEAN Seconds

    100

    50

    0

    2-socket Skylake

    40 cores / 80 threads

    2-socket EPYC

    48 cores / 48 threads

    2-socket Broadwell

    40 cores / 80 threads

    Performance measured February, 2018. Skylake: Two 20 core Intel Xeon Gold 6148 CPUs @ 2.4GHz w/ 376GB memory, hyperthreading enabled. EPYC: Two 24 core AMD EPYC 7451 CPUs

    @ 2.3GHz w/ 256GB memory. Broadwell: Two 20 core Intel Xeon E5-2698 v4 CPUs @ 3.6GHz w/ 256GB memory, hyperthreading enabled. SPEC® is a registered trademark of the Standard

    Performance Evaluation Corporation (www.spec.org).

    6

    openacc directives

    OPENACC DIRECTIVES

    Manage

    Data

    Movement

    • Incremental

    #pragma acc data copyin(a,b) copyout(c)

    {

    ...

    #pragma acc parallel

    {

    #pragma acc loop gang vector

    for (i = 0; i < n; ++i) {

    c[i] = a[i] + b[i];

    ...

    }

    }

    ...

    }

    • Single source base

    • Interoperable

    Initiate

    Parallel

    Execution

    • Performance portable

    • CPU, GPU, Manycore

    Optimize

    Loop

    Mappings

    8

    openacc is for multicore cpus gpus

    OPENACC IS FOR MULTICORE CPUS & GPUS

    98 !$ACC KERNELS

    99 !$ACC LOOP INDEPENDENT

    100

    DO k=y_min-depth,y_max+depth

    101 !$ACC LOOP INDEPENDENT

    102 DO j=1,depth

    103 density0(x_min-j,k)=left_density0(left_xmax+1-j,k)

    104 ENDDO

    105 ENDDO

    106 !$ACC END KERNELS

    GPU

    CPU

    % pgfortran -ta=multicore –fast –Minfo=acc -c \

    update_tile_halo_kernel.f90

    . . .

    100, Loop is parallelizable

    Generating Multicore code

    100, !$acc loop gang

    102, Loop is parallelizable

    % pgfortran -ta=tesla –fast -Minfo=acc –c \

    update_tile_halo_kernel.f90

    . . .

    100, Loop is parallelizable

    102, Loop is parallelizable

    Accelerator kernel generated

    Generating Tesla code

    100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y

    102, !$acc loop gang, vector(32) ! blockidx%x threadidx%x

    9

    cloverleaf

    CLOVERLEAF

    AWE Hydrodynamics mini-App, bm32 data set

    http://uk-mac.github.io/CloverLeaf

    160

    142x

    140

    Speedup vs Single Haswell Core

    PGI 18.1 OpenACC

    109x

    120

    Intel 2018 OpenMP

    100

    80

    67x

    60

    40x

    40

    14.8x 15x

    11x

    20

    10x 10x

    7.6x 7.9x

    0

    Kepler Pascal

    1x

    2x

    4x

    Multicore Haswell

    Multicore Broadwell

    Multicore Skylake

    Volta V100

    Systems: Haswell: 2x16 core Haswell server, four K80s, CentOS 7.2 (perf-hsw10), Broadwell: 2x20 core Broadwell server, eight P100s (dgx1-prd-01), Broadwell server, eight V100s (dgx07), Skylake 2x20 core Xeon Gold server (sky-4).

    Compilers: Intel 2018.0.128, PGI 18.1

    Benchmark: CloverLeaf v1.3 downloaded from http://uk-mac.github.io/CloverLeaf the week of November 7 2016; CloverlLeaf_Serial; CloverLeaf_ref (MPI+OpenMP); CloverLeaf_OpenACC (MPI+OpenACC)

    Data compiled by PGI February 2018.

    10

    openacc uptake in hpc

    OPENACC UPTAKE IN HPC

    Hackathons

    Training

    Community

    Applications

    3 of Top 5 HPC Apps

    ANSYS Fluent &

    Gaussian released;

    VASP in development

    5 Events in 2017

    All hackathons are

    initiated by users

    OpenACC and DLI

    10 new modules;

    instructor certification

    User Group

    SC17 participation

    up 27% vs SC16

    6–9 Events in 2018

    Slack Channel

    2x growth in

    last 6 months

    18+ 2018 workshops

    ECMWF, KAUST,

    PSC, CESGA

    5 ORNL CAAR Codes

    GTC, XGC, ACME,

    FLASH, LSDalton

    94 Codes GPU

    accelerated to date

    Online Courses

    5K+ attended over

    last 3 years

    Downloads

    PGI Community Edition

    quarterly downloads

    up 136% in 2017

    109 Apps

    Total being tracked

    Expertise

    94 mentors registered

    Online Labs

    4.3K+ taken over

    last 3 years

    11

    parallelization strategy

    Parallelization Strategy

    Within Gaussian 16, GPUs are used for a small fraction of code that consumes a large

    fraction of the execution time. T e implementation of GPU parallelism conforms

    to Gaussian’s general parallelization strategy. Its main tenets are to avoid changing

    the underlying source code and to avoid modif cations which negatively af ect CPU

    performance. For these reasons, OpenACC was used for GPU parallelization.

    PGI Accelerator Compilers with OpenACC

    PGI compilers fully support the current OpenACC

    standard as well as important extensions to it.

    PGI is an important contributor to the ongoing

    development of OpenACC.

    OpenACC enables developers to implement

    GPU parallelism by adding compiler directives

    to their source code, of en eliminating the need

    for rewriting or restructuring. For example, the

    following Fortran compiler directive identif es a

    loop which the compiler should parallelize:

    ! $ac c par al l el l oop

    Other directives allocate GPU memory, copy data

    to/from GPUs, specify data to remain on the GPU,

    combine or split loops and other code sections,

    and generally provide hints for optimal work

    distribution management, and more.

    T e OpenACC project is very active, and the

    specif cations and tools are changing fairly

    rapidly. T is has been true throughout the

    lifetime of this project. Indeed, one of its major

    challenges has been using OpenACC in the midst

    of its development. T e talented people at PGI

    were instrumental in addressing issues that arose

    in one of the very f rst uses of OpenACC for a

    large commercial sof ware package.

    T e Gaussian approach to parallelization relies on environment-specif c parallelization

    frameworks and tools: OpenMP for shared-memory, Linda for cluster and network

    parallelization across discrete nodes, and OpenACC for GPUs.

    T e process of implementing GPU support involved many dif erent aspects:

    Identifying places where GPUs could be benef cial. T ese are a subset of areas which

    are parallelized for other execution contexts because using GPUs requires f ne grained

    parallelism.

    Understanding and optimizing data movement/storage at a high level to maximize

    GPU ef ciency.

    PGI’s sophisticated prof ling and performance evaluation tools were vital to the success of the ef ort.

    Specifying GPUs to Gaussian 16

    T e GPU implementation in Gaussian 16 is sophisticated and complex but using it is simple and straightforward. GPUs are specif ed with

    1 additional Link 0 command (or equivalent Default.Route f le entry/command line option). For example, the following commands tell

    Gaussian to run the calculation using 24 compute cores plus 8 GPUs+8 controlling cores (32 cores total):

    Request 32 CPUs for the calculation: 24 cores for computation, and 8 cores to control GPUs (see below).

    Use GPUs 0-7 with CPUs 0-7 as their controllers.

    %CPU=0- 31

    %GPUCPU=0- 7=0- 7

    Detailed information is available on our website.

    GAUSSIAN 16

    Project Contributors

    Mike Frisch, Ph.D.

    President and CEO

    Gaussian, Inc.

    Using OpenACC allowed us to continue

    development of our fundamental

    algorithms and software capabilities

    simultaneously with the GPU-related

    work. In the end, we could use the

    same code base for SMP, cluster/

    network and GPU parallelism. PGI's

    compilers were essential to the success

    of our efforts.

    Roberto Gomperts

    NVIDIA

    Michael Frisch

    Gaussian

    Brent Leback

    NVIDIA/PGI

    Giovanni Scalmani

    Gaussian

    Gaussian, Inc.

    340 Quinnipiac S t. Bldg. 40

    Wallingford, CT 06492 USA

    custserv@ gaussian.com

    Gaussian is a registered trademark of Gaussian, Inc. All other trademarks and registered trademarks are

    the properties of their respective holders. Specif cations subject to change without notice.

    Copyright © 2017, Gaussian, Inc. All rights reserved.

    12

    ansys fluent

    ANSYS FLUENT

    Sunil Sathe

    Lead Software Developer

    ANSYS Fluent

    We’ve effectively used

    OpenACC for heterogeneous

    computing in ANSYS Fluent

    with impressive performance.

    We’re now applying this work

    to more of our models and

    new platforms.

    13

    slide14

    VASP

    Prof. Georg Kresse

    Computational Materials Physics

    University of Vienna

    For VASP, OpenACC is the way

    forward for GPU acceleration.

    Performance is similar and in some

    cases better than CUDA C, and

    OpenACC dramatically decreases

    GPU development and maintenance

    efforts. We’re excited to collaborate

    with NVIDIA and PGI as an early

    adopter of CUDA Unified Memory.

    14

    mpas a

    MPAS-A

    Richard Loft

    Director, Technology Development

    NCAR

    Our team has been evaluating

    OpenACC as a pathway to

    performance portability for the Model

    for Prediction (MPAS) atmospheric

    model. Using this approach on the

    MPAS dynamical core, we have

    achieved performance on a single

    P100 GPU equivalent to 2.7 dual

    socketed Intel Xeon nodes on our new

    Cheyenne supercomputer.

    Image courtesy: NCAR

    15

    numeca fine open

    NUMECA FINE/Open

    David Gutzwiller

    Lead Software Developer

    NUMECA

    Porting our unstructured C++ CFD

    solver FINE/Open to GPUs using

    OpenACC would have been

    impossible two or three years ago,

    but OpenACC has developed

    enough that we’re now getting

    some really good results.

    16

    cosmo

    COSMO

    Dr. Oliver Fuhrer

    Senior Scientist

    Meteoswiss

    OpenACC made it practical to

    develop for GPU-based hardware

    while retaining a single source for

    almost all the COSMO physics

    code.

    17

    gamera for gpu

    GAMERA FOR GPU

    Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Muneo

    Hori, Lalith Wijerathne

    The University of Tokyo

    With OpenACC and a compute

    node based on NVIDIA's Tesla

    P100 GPU, we achieved more

    than a 14X speed up over a K

    Computer node running our

    earthquake disaster simulation

    code

    Map courtesy University of Tokyo

    18

    quantum espresso

    QUANTUM ESPRESSO

    Filippo Spiga

    Head of Research Software Engineering

    University of Cambridge

    CUDA Fortran gives us the full

    performance potential of the

    CUDA programming model and

    NVIDIA GPUs. !$CUF KERNELS

    directives give us productivity and

    source code maintainability. It’s

    the best of both worlds.

    19

    programming gpu accelerated systems cuda unified

    Programming GPU-Accelerated Systems

    CUDA Unified Memory for Dynamically Allocated Data

    GPU Developer View

    GPU Developer View With

    CUDA Unified Memory

    PCIe

    System

    Memory

    GPU Memory

    Unified Memory

    21

    pgi openacc and cuda unified memory compiling

    PGI OpenACC and CUDA Unified Memory

    Compiling with the –ta=tesla:managed option

    GPU Developer View With

    CUDA Unified Memory

    #pragma acc data copyin(a,b) copyout(c)

    {

    ...

    #pragma acc parallel

    {

    #pragma acc loop gang vector

    for (i = 0; i < n; ++i) {

    c[i] = a[i] + b[i];

    ...

    }

    }

    ...

    }

    Unified Memory

    C malloc, C++ new, Fortran allocate all mapped to CUDA Unified Memory

    22

    pgi openacc and cuda unified memory compiling 1

    PGI OpenACC and CUDA Unified Memory

    Compiling with the –ta=tesla:managed option

    GPU Developer View With

    CUDA Unified Memory

    ...

    #pragma acc parallel

    {

    #pragma acc loop gang vector

    for (i = 0; i < n; ++i) {

    c[i] = a[i] + b[i];

    ...

    }

    }

    ...

    Unified Memory

    C malloc, C++ new, Fortran allocate all mapped to CUDA Unified Memory

    23

    gtc an openacc production application being

    GTC: An OpenACC Production Application

    Being ported for runs on the ORNL Summit supercomputer

    The gyrokinetic toroidal

    code (GTC) is a massively

    parallel, particle-in-cell

    production code for

    turbulence simulation in

    support of the burning

    plasma experiment ITER,

    the crucial next step in the

    quest for fusion energy.

    http://phoenix.ps.uci.edu/gtc_group

    24

    NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

    gtc performance using openacc openpower nvlink

    GTC Performance using OpenACC

    OpenPOWER | NVLink | Unified Memory | P100 | V100

    16x

    16.5X

    14x

    12x

    12.1X

    12X

    10x

    8x

    6x

    6.1X

    5.9X

    4x

    2x

    20-core P8

    P8+2xP100

    UM

    P8+2xP100

    P8+4xP100

    UM

    P8+4xP100

    x64+4xV100

    Data Directives Data Directives

    Data Directives

    P8

    UM

    : IBM POWER8NVL, 2 sockets, 20 cores, NVLINK

    : No Data Directives in sources, compiled with –ta=tesla:managed

    25

    openacc resources

    OPENACC RESOURCES

    Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow

    Success Stories

    https://www.openacc.org/success-stories

    Support Options

    Resources

    https://www.openacc.org/resources

    www.openacc.org/community#slack

    PGI User Forums

    pgicompilers.com/userforum

    Compilers and Tools

    https://www.openacc.org/tools

    Events

    https://www.openacc.org/events

    stackoverflow.com/questions/tagged

    /openacc' Questions

    26