Gpu based cloud computing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 54

GPU based cloud computing PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on
  • Presentation posted in: General

GPU based cloud computing. Dairsie Latimer, Petapath, UK. Petapath. About Petapath. Founded in 2008 to focus on delivering innovative hardware and software solutions into the high performance computing (HPC) markets

Download Presentation

GPU based cloud computing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Gpu based cloud computing

GPU based cloud computing

Dairsie Latimer, Petapath, UK

Petapath


About petapath

About Petapath

Founded in 2008 to focus on delivering innovative hardware andsoftware solutions into the high performance computing (HPC) markets

Partnered with HP and SGI to deliverer two Petascale prototypesystems as part of the PRACE WP8 programme

The system is a testbed for new ideas in usability, scalability andefficiency of large computer installations

Active in exploiting emerging standards for acceleration technologies andare members of Khronos group and sit on the OpenCL working committee

We also provide consulting expertise for companies wishing to explore the advantages offered by heterogeneous systems

Petapath


What is heterogeneous or gpu computing

What is Heterogeneous or GPU Computing?

x86

GPU

PCIe bus

Computing with CPU + GPU

Heterogeneous Computing


Low latency or high throughput

Low Latency or High Throughput?

  • CPU

  • Optimised for low-latency access to cached data sets

  • Control logic for out-of-order and speculative execution

  • GPU

  • Optimised for data-parallel, throughput computation

  • Architecture tolerant of memory latency

  • More transistors dedicated to computation


Nvidia gpu computing ecosystem

NVIDIA GPU Computing Ecosystem

CUDA Development Specialist

TPP / OEM

ISV

CUDA Training Company

Hardware

Architect

VAR


Science is desperate for throughput

Science is Desperate for Throughput

Gigaflops

1 Exaflop

1,000,000,000

1 Petaflop

Bacteria

100s of

Chromatophores

1,000,000

Chromatophore

50M atoms

1,000

Ribosome

2.7M atoms

F1-ATPase

327K atoms

Ran for 8 months to simulate 2 nanoseconds

Estrogen Receptor

36K atoms

1

BPTI

3K atoms

1997

2003

2006

2010

2012

1982


Power crisis in supercomputing

Power Crisis in Supercomputing

Household Power

Equivalent

Exaflop

City

25,000,000 Watts

7,000,000 Watts

Petaflop

Town

Jaguar

Los Alamos

850,000 Watts

Teraflop

Neighborhood

60,000 Watts

Gigaflop

Block

1982

1996

2008

2020


Enter the gpu

Enter the GPU

GeForce®

Entertainment

TeslaTM

High-Performance Computing

Quadro®

Design & Creation

NVIDIA GPU Product Families


Next generation gpu architecture fermi

NEXT-GENERATION GPU ARCHITECTURE — ‘FERMI’


Introducing the fermi tesla architecture the soul of a supercomputer in the body of a gpu

3 billion transistors

Up to 2× the cores (C2050 has 448)

Up to 8× the peak DP performance

ECC on all memories

L1 and L2 caches

Improved memory bandwidth (GDDR5)

Up to 1 Terabyte of GPU memory

Concurrent kernels

Hardware support for C++

Introducing the ‘Fermi’ Tesla ArchitectureThe Soul of a Supercomputer in the body of a GPU

DRAM I/F

DRAM I/F

DRAM I/F

HOST I/F

L2

DRAM I/F

Giga Thread

DRAM I/F

DRAM I/F


Design goal of fermi

Design Goal of Fermi

Expand performance sweet spot of the GPU

Bring more users, more applications to the GPU

Data

Parallel

GPU

Instruction

Parallel

CPU

Many Decisions

Large Data Sets


Streaming multiprocessor architecture

Streaming Multiprocessor Architecture

Instruction Cache

Scheduler

Scheduler

Dispatch

Dispatch

  • 32 CUDA cores per SM (512 total)

  • 8× peak double precision floating point performance

    • 50% of peak single precision

  • Dual Thread Scheduler

  • 64 KB of RAM for shared memory and L1 cache (configurable)

Register File

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Load/Store Units × 16

Special Func Units × 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache


Cuda core architecture

CUDA Core Architecture

New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs

Fused multiply-add (FMA) instruction

for both single and double precision

New integer ALU optimized for64-bit and extended precisionoperations

Instruction Cache

Scheduler

Scheduler

Dispatch

Dispatch

Register File

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

CUDA Core

Core

Core

Core

Core

Dispatch Port

Operand Collector

Core

Core

Core

Core

Core

Core

Core

Core

FP Unit

INT Unit

Core

Core

Core

Core

Load/Store Units x 16

Result Queue

Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache


Cached memory hierarchy

Cached Memory Hierarchy

  • First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory

  • L1 Cache per SM (32 cores)

    • Improves bandwidth and reduces latency

  • Unified L2 Cache (768 KB)

    • Fast, coherent data sharing across all cores in the GPU

DRAM I/F

DRAM I/F

HOST I/F

DRAM I/F

L2

Giga Thread

DRAM I/F

Parallel DataCache™Memory Hierarchy

DRAM I/F

DRAM I/F


Larger faster resilient memory interface

Larger, Faster, Resilient Memory Interface

GDDR5 memory interface

2× signaling speed of GDDR3

Up to 1 Terabyte of memory attached to GPU

Operate on larger data sets (3 and 6 GB Cards)

ECC protection for GDDR5 DRAM

All major internal memories are ECC protected

Register file, L1 cache, L2 cache

DRAM I/F

DRAM I/F

HOST I/F

DRAM I/F

L2

Giga Thread

DRAM I/F

DRAM I/F

DRAM I/F


Gigathread hardware thread scheduler

GigaThread Hardware Thread Scheduler

Concurrent Kernel Execution + Faster Context Switch

Kernel 1

Kernel 1

Kernel 2

Ker4

Kernel 2

Kernel 2

Kernel 3

nel

Kernel 2

Kernel 5

Time

Kernel 3

Kernel 4

Kernel 5

Serial Kernel Execution

Parallel Kernel Execution


Gigathread streaming data transfer engine

GigaThread Streaming Data Transfer Engine

Dual DMA engines

Simultaneous CPUGPU and GPUCPU data transfer

Fully overlapped with CPU and GPU processing time

Activity Snapshot:

SDT

Kernel 0

CPU

SDT0

GPU

SDT1

Kernel 1

CPU

SDT0

GPU

SDT1

Kernel 2

CPU

SDT0

GPU

SDT1

Kernel 3

CPU

SDT0

GPU

SDT1


Enhanced software support

Enhanced Software Support

Many new features in CUDA Toolkit 3.0

To be released on Friday

Including early support for the Fermi architecture:

Native 64-bit GPU support

Multiple Copy Engine support

ECC reporting

Concurrent Kernel Execution

Fermi HW debugging support in cuda-gdb


Enhanced software support1

Enhanced Software Support

OpenCL 1.0 Support

First class language citizen in CUDA Architecture

Supports ICD (so interoperability between vendors is a possibility)

Profiling support available

Debug support coming to Parallel Nsight (NEXUS) soon

gDebugger CL from graphicREMEDY

Third party OpenCL profiler/debugger/memory checker

Software Tools Ecosystem is starting to grow

Given boost by existence of OpenCL


Gpu based cloud computing

“Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is "expected to be 10-times more powerful than today's fastest supercomputer."

Since ORNL's Jaguar supercomputer, for all intents and purposes, holds that title, and is in the process of being upgraded to 2.3 PFlops….

…we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range.”

September 30 2009


Nvidia tesla products

NVIDIA Tesla Products


Tesla gpu computing products 10 series

Tesla GPU Computing Products: 10 Series

SuperMicro 1U

GPU SuperServer

Tesla S1070

1U System

Tesla C1060

Computing Board

Tesla Personal Supercomputer


Tesla gpu computing products 20 series

Tesla GPU Computing Products: 20 Series

Tesla S2050

1U System

Tesla S2070

1U System

Tesla C2050

Computing Board

Tesla C2070

Computing Board


Heterogeneous clusters

HETEROGENEOUS CLUSTERS


Data centers space and energy limited

Data Centers: Space and Energy Limited

Traditional Data Center Cluster

1000’s of cores

1000’s of servers

Quad-core

CPU

8 cores per server

2x Performance requires 2x Number of Servers

Heterogeneous Data Center Cluster

10,000’s of cores

100’s of servers

Augment/replacehost servers


Cluster deployment

Cluster Deployment

  • Now a number of GPU aware Cluster Management Systems

    • ActiveEon ProActive Parallel Suite® Version 4.2

    • Platform Cluster Manager and HPC Workgroup

    • Streamline Computing GPU Environment (SCGE)

  • Not just installation aids

    • i.e. putting the driver and toolkits in the right place

    • now starting to provide GPU node discovery and job steering

  • NVIDIA and Mellanox

    • Better interop. between Mellanox IF adapters and NVIDIA Tesla GPUs

    • Can provide as much as a 30% performance improvement by eliminating unnecessary data movement in a multi node heterogeneous application


Cluster deployment1

Cluster Deployment

  • A number of cluster and distributed debug tools now support CUDA and NVIDIA Tesla

  • Allinea® DDT for NVIDIA CUDA

    • Extends well known Distributed Debugging Tool (DDT) with CUDA support

  • TotalView® debugger (part of an Early Experience Program)

    • Extends with CUDA support, have also announced intentions to support OpenCL

  • Both based on the Parallel Nsight (NEXUS) Debugging API


Nvidia reality server 3 0

NVIDIA Reality Server 3.0

  • Cloud computing platform for running 3D web applications

  • Consists of an Tesla RS GPU-based server cluster running RealityServer software from mental images

  • Deployed in a number of different sizes

    • From 2 – 100’s of 1U Servers

  • iray® - Interactive Photorealistic Rendering Technology

    • Streams interactive 3D applications to any web connected device

    • Designers and architects can now share and visualize complex 3D models under different lighting and environmental conditions


Distributed computing projects

DISTRIBUTED COMPUTING PROJECTS


Distributed computing projects1

Distributed Computing Projects

Traditional distributed computing projects have beenmaking use of GPUs for some time (non-commercial)

Typically have 000’s to 10,000’s of contributors

[email protected] has access to 6.5 PFLOPS of compute

Of which ~95% comes from GPUs or PS3s

Many are bio-informatics, molecular dynamicsand quantum chemistry codes

Represent the current sweet spot applications

Ubiquity of GPUs in home systems helps


Distributed computing projects2

Distributed Computing Projects

  • [email protected]

  • Directed by Prof. Vijay Pande at Stanford University (http://folding.stanford.edu/)

  • Most recent GPU3 Core based on OpenMM 1.0 (https://simtk.org/home/openmm)

    • OpenMM library provides tools for molecular modeling simulation

    • Can be hooked into any MM application, allowing that code to domolecular modeling with minimal extra effort

    • OpenMM has a strong emphasis on hardware acceleration providingnot just a consistent API, but much greater performance

  • Current NVIDIA target is via CUDA Toolkit 2.3

  • OpenMM 1.0 also provides Beta support for OpenCL

  • OpenCL is long term convergence software platform


Distributed computing projects3

Distributed Computing Projects

  • Berkeley Open Infrastructure for Network Computing

  • BOINC project (http://boinc.berkeley.edu/)

    • Platform infrastructure originally evolved from [email protected]

  • Many projects use BOINC and several of these have heterogeneous compute implementations (http://boinc.berkeley.edu/wiki/GPU_computing)

  • Examples include:

    • GPUGRID.net

    • [email protected]

    • [email protected] (IEEE 754 Double precision capable GPU required)

    • [email protected]

    • Lattice

    • Collatz Conjecture


Distributed computing projects4

Distributed Computing Projects

  • GPUGRID.net

  • Dr. Gianni De Fabritiis,Research Group of Biomedical InformaticsUniversity Pompeu Fabra-IMIM, Barcelona

  • Uses GPUs to deliver high-performance all-atom biomolecular simulation of proteins using ACEMD (http://multiscalelab.org/acemd)

    • ACEMD is a production bio-molecular dynamics code specially optimized to run on graphics processing units (GPUs) from NVIDIA

    • It reads CHARMM/NAMD and AMBER input files with a simple and powerful configuration interface

  • A commercial implementation of ACEMD is available from Acellera Ltd (http://www.acellera.com/acemd/)

    • What makes this particularly interesting is that it is implemented using OpenCL


Distributed computing projects5

Distributed Computing Projects

Have had to use brute force methods to deal with robustness

Run the same WU with multiple users and compare results

Running on purpose designed heterogeneous grids with ECC

Means that some of the paranoia can be relaxed(can at least detect there have been soft errors or WU corruption)

Results in better throughput on these systems

But does result in divergence between Consumer and HPC devices

Should be compensated for by HPC class devices being about 4x faster


Gpu based cloud computing

Tesla Bio Workbench

Accelerating New Science

January, 2010

http://www.nvidia.com/bio_workbench


Introducing tesla bio workbench

Introducing Tesla Bio WorkBench

TeraChem

LAMMPS

MUMmerGPU

GPU-AutoDock

  • Applications

  • Community

Download,

Documentation

Technical

papers

Discussion

Forums

Benchmarks

& Configurations

Tesla GPU Clusters

Tesla Personal Supercomputer

  • Platforms


Tesla bio workbench applications

Tesla Bio Workbench Applications

AMBER (MD)

ACEMD (MD)

GROMACS (MD)

GROMOS (MD)

LAMMPS (MD)

NAMD (MD)

TeraChem (QC)

VMD (Visualization MD & QC)

Docking

GPU AutoDock

Sequence analysis

CUDASW++ (SmithWaterman)

MUMmerGPU

GPU-HMMER

CUDA-MEME Motif Discovery


Recommended hardware configurations

Recommended Hardware Configurations

Up to 4 Tesla C1060s per workstation

4GB main memory / GPU

Tesla S1070 1U

4 GPUs per 1U

Integrated CPU-GPU Server

2 GPUs per 1U + 2 CPUs

Tesla GPU Clusters

Tesla Personal Supercomputer

Specifics at http://www.nvidia.com/bio_workbench


Molecular dynamics and quantum chemistry applications

Molecular Dynamics andQuantum Chemistry Applications


Molecular dynamics and quantum chemistry applications1

Molecular Dynamics andQuantum Chemistry Applications

AMBER (MD)

ACEMD (MD)

HOOMD (MD)

GROMACS (MD)

  • LAMMPS (MD)

  • NAMD (MD)

  • TeraChem (QC)

  • VMD (Viz. MD & QC)

  • Typical speed ups of 3-8x on a single Tesla C1060 vs Modern 1U

  • Some applications (compute bound) show 20-100x speed ups


Usage of teragrid national supercomputing grid

Usage of TeraGrid National Supercomputing Grid

Half of the cycles


Summary

Summary


Summary1

Summary

‘Fermi’ debuts HPC/Enterprise features

Particularly ECC and high performance double precision

Software development environments are now more mature

Significant software ecosystem is starting to emerge

Broadening availability of development tools, libraries and applications

Heterogeneous (GPU) aware cluster management systems

Economics, open standards and improving programming methodologies

Heterogeneous computing is gradually changing long held perception that it is just an ‘exotic’ niche technology


Questions

Questions?


Supporting slides

Supporting Slides


Amber molecular dynamics

AMBER Molecular Dynamics

Implicit solvent GB results

1 Tesla GPU 8x faster than 2 quad-core CPUs

Generalized Born Simulations

Alpha

now

  • Generalized Born

7x

8.6x

  • PME: Particle Mesh Ewald

  • Beta release

Q1 2010

  • Multi-GPU + MPI support

  • Beta 2 release

Q2 2010

More Info

http://www.nvidia.com/object/amber_on_tesla.html

Data courtesy of San Diego Supercomputing Center


Gromacs molecular dynamics

GROMACS Molecular Dynamics

PME results

1 Tesla GPU 3.5x-4.7x faster than CPU

Beta

now

  • Particle Mesh Ewald (PME)

  • Implicit solvent GB

  • Arbitrary forms of non-bonded interactions

GROMACS on Tesla GPU Vs CPU

Reaction-Field

Cutoffs

Particle-Mesh-Ewald

(PME)

22x

3.5x

  • Multi-GPU + MPI support

  • Beta 2 release

Q2 2010

5.2x

More Info

http://www.nvidia.com/object/gromacs_on_tesla.html

Data courtesy of Stockholm Center for Biomembrane Research


Hoomd blue molecular dynamics

HOOMD Blue Molecular Dynamics

Written bottom-up for CUDA GPUs

Modeled after LAMMPS

Supports multiple GPUs

1 Tesla GPU outperforms 32 CPUs running LAMMPS

More Info

http://www.nvidia.com/object/hoomd_on_tesla.html

Data courtesy of University of Michigan


Lammps molecular dynamics on a gpu cluster

LAMMPS: Molecular Dynamics on a GPU Cluster

  • Available as beta on CUDA

  • Cut-off based non-bonded terms

    • 2 GPUs outperforms 24 CPUs

  • PME based electrostatic

    • Preliminary results: 5X speed-up

  • Multiple GPU + MPI support enabled

2 GPUs = 24 CPUs

More Info

http://www.nvidia.com/object/lammps_on_tesla.html

Data courtesy of Scott Hampton & Pratul K. Agarwal

Oak Ridge National Laboratory


Namd scaling molecular dynamics on a gpu cluster

NAMD: Scaling Molecular Dynamics on a GPU Cluster

  • Feature complete on CUDA : available in NAMD 2.7 Beta 2

    • Full electrostatics with PME

    • Multiple time-stepping

    • 1-4 Exclusions

  • 4 GPU Tesla PSC outperforms

  • 8 CPU servers

  • Scales to a GPU cluster

4 GPUs = 16 CPUs

More Info

http://www.nvidia.com/object/namd_on_tesla.html

Data courtesy of Theoretical and Computational Bio-physics Group, UIUC


Terachem quantum chemistry package for gpus

TeraChem: Quantum Chemistry Package for GPUs

First QC SW written ground-up for GPUs

4 Tesla GPUs outperform 256 quad-core CPUs

Beta

now

  • HF, Kohn-Sham, DFT

  • Multiple GPUs supported

  • Full release

  • MPI support

Q1 2010

More Info

http://www.nvidia.com/object/terachem_on_tesla.html


Vmd acceleration using cuda gpus

VMD: Acceleration using CUDA GPUs

Several CUDA applications in VMD 1.8.7

Molecular Orbital Display

Coulomb-based Ion Placement

Implicit Ligand Sampling

Speedups : 20x - 100x

Multiple GPU support enabled

More Info

http://www.nvidia.com/object/vmd_on_tesla.html

Images and data courtesy of Beckman Institute for Advanced Science and Technology, UIUC


Gpu hmmer protein sequence alignment

GPU-HMMER: Protein Sequence Alignment

Protein sequence alignment using profile HMMs

Available now

Supports multiple GPUs

Speedups range from 60-100x faster than CPU

Download

http://www.mpihmmer.org/releases.htm

GPUs

CPU


Mummergpu genome sequence alignment

MUMmerGPU: Genome Sequence Alignment

High-throughput pair-wise local sequence alignment

Designed for large sequences

Drop-in replacement for “mummer” component in MUMmer software

Speedups 3.5x to 3.75x

Download

http://mummergpu.sourceforge.net


  • Login