Towards petascale grids as a foundation of e science
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

"Towards Petascale Grids as a Foundation of E-Science" PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on
  • Presentation posted in: General

Satoshi Matsuoka Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics Oct. 1, 2007 EGEE07 Presentation @ Budapest, Hungary. "Towards Petascale Grids as a Foundation of E-Science". Vision of Grid Infrastructure in the past…. OR.

Download Presentation

"Towards Petascale Grids as a Foundation of E-Science"

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Towards petascale grids as a foundation of e science

Satoshi Matsuoka

Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics

Oct. 1, 2007 EGEE07 Presentation @ Budapest, Hungary

"Towards Petascale Grids as a Foundation of E-Science"


Vision of grid infrastructure in the past

Vision of Grid Infrastructure in the past…

OR

Very divergent & distributed supercomputers, storage, etc. tied together & “virtualized”

Bunch of networked PCs virtualized to be a Supercomputer

The “dream” is for the infrastructure to behave as a virtual supercomputing environment with an ideal programming model for many applications


But this is not meant to be

But this is not meant to be

Don Quixote or wrong tree dog bark picture here


Tsubame the first 100 teraflops supercomputer for grids 2006 2010

500GB

48disks

500GB

48disks

500GB

48disks

TSUBAME: the first 100 Teraflops Supercomputer for Grids 2006-2010

Voltaire ISR9288 Infiniband 10Gbps x2 ~1310+50 Ports~13.5Terabits/s(3Tbits bisection)

Sun Galaxy 4 (Opteron Dual core 8-socket)10480core/655Nodes32-128GB21.4TeraBytes50.4TeraFlopsOS Linux (SuSE 9, 10) NAREGI Grid MW

“Fastest Supercomputer in Asia” 29th [email protected] 103 TeraFlops Peak as of Oct. 31st!

10Gbps+External NW

Unified IB network

NEC SX-8i(for porting)

Sun BladeInteger Workload Accelerator(90 nodes, 720 CPU

Storage1.0 Petabyte (Sun “Thumper”)0.1Petabyte (NEC iStore)Lustre FS, NFS, CIF, WebDAV (over IP)50GB/s aggregate I/O BW

1.5PB

ClearSpeed CSX600SIMD accelerator360 648 boards, 35 52.2TeraFlops

60GB/s


Tsubame job statistics dec 2006 aug 2007 jobs

TSUBAME Job Statistics Dec. 2006-Aug.2007 (#Jobs)

  • 797,886 Jobs (~3270 daily)

  • 597,438 serial jobs (74.8%)

  • 121,108 <=8p jobs (15.2%)

  • 129,398 ISV Application Jobs (16.2%)

  • However, >32p jobs account for 2/3 of cumulative CPU usage

90%

Coexistence of ease-of-use in both - short duration parameter survey- large scale MPI

Fits the TSUBAME design well


In the supercomputing landscape petaflops class is already here in early 2008

In the Supercomputing Landscape,Petaflops class is already here… in early 2008

Other Petaflops 2008/2009- LANL/IBM “Roadrunner”- JICS/Cray(?) (NSF Track 2)- ORNL/Cray- ANL/IBM BG/P- EU Machines (Julich…)…

2008 LLNL/IBM “BlueGene/P”

~300,000 PPC Cores, ~1PFlops~72 racks, ~400m2 floorspace~3MW Power, copper cabling

> 10 Petaflops> million cores> 10s Petabytesplanned for 2011-2012in the US, Japan, (EU), (other APAC)

2008Q1 TACC/Sun “Ranger”

~52,600 “Barcelona” Opteron CPU Cores, ~500TFlops~100 racks, ~300m2 floorspace2.4MW Power, 1.4km IB cx4 copper cabling2 Petabytes HDD


In fact we can build one now

In fact we can build one now (!)

  • @Tokyo---One of the Largest IDC in the World (in Tokyo...)

  • Can fit a 10PF here easy (> 20 Rangers)

  • On top of a 55KV/6GW Substation

  • 150m diameter (small baseball stadium)

  • 140,000 m2 IDC floorspace

  • 70+70 MW power

  • Size of entire Google(?) (~million LP nodes)

  • Source of “Cloud” infrastructure


Gilder s law will make thin client accessibility to servers essentially free

Optical Fiber

(bits per second)

(Doubling time 9 Months)

Data Storage

(bits per square inch)

(Doubling time 12 Months)

Silicon Computer Chips

(Number of Transistors)

(Doubling time 18 Months)

Gilder’s Law – Will make thin-client accessibility to servers essentially “free”

Performance per Dollar Spent

0

1

2

3

4

5

Number of Years

(Original slide courtesy Phil Papadopoulos @ SDSC)

Scientific American, January 2001


Doe sc applications overview following slides courtesy john shalf @ lbl nersc

NAME

Discipline

Problem/Method

Structure

MADCAP

Cosmology

CMB Analysis

Dense Matrix

FVCAM

Climate Modeling

AGCM

3D Grid

CACTUS

Astrophysics

General Relativity

3D Grid

LBMHD

Plasma Physics

MHD

2D/3D Lattice

GTC

Magnetic Fusion

Vlasov-Poisson

Particle in Cell

PARATEC

Material Science

DFT

Fourier/Grid

SuperLU

Multi-Discipline

LU Factorization

Sparse Matrix

PMEMD

Life Sciences

Molecular Dynamics

Particle

DOE SC Applications Overview(following slides courtesy John Shalf @ LBLNERSC)


Latency bound vs bandwidth bound

System

Technology

MPI Latency

Peak Bandwidth

Bandwidth Delay Product

SGI Altix

Numalink-4

1.1us

1.9GB/s

2KB

Cray X1

Cray Custom

7.3us

6.3GB/s

46KB

NEC ES

NEC Custom

5.6us

1.5GB/s

8.4KB

Myrinet Cluster

Myrinet 2000

5.7us

500MB/s

2.8KB

Cray XD1

RapidArray/IB4x

1.7us

2GB/s

3.4KB

Latency Bound vs. Bandwidth Bound?

  • How large does a message have to be in order to saturate a dedicated circuit on the interconnect?

    • N1/2 from the early days of vector computing

    • Bandwidth Delay Product in TCP

  • Bandwidth Bound if msg size > Bandwidth*Delay

  • Latency Bound if msg size < Bandwidth*Delay

    • Except if pipelined (unlikely with MPI due to overhead)

    • Cannot pipeline MPI collectives (but can in Titanium)

(Original slide courtesy John Shalf @ LBL)


Diagram of message size distribution function madbench p2p

Diagram of Message Size Distribution Function (MADBench-P2P)

60% of messages > 1MB BW Dominant, Could be executed on WAN

(Original slide courtesy John Shalf @ LBL)


Message size distributions superlu ptp

Message Size Distributions(SuperLU-PTP)

> 95% of messages < 1KByte Low latency, tightly coupled LAN

(Original slide courtesy John Shalf @ LBL)


Collective buffer sizes demise of metacomputing

Collective Buffer Sizes- demise of metacomputing -

95% Latency Bound!!!

=> For metacomputing, Desktop and small cluster grids pretty much hopeless except parameter sweep apps

(Original slide courtesy John Shalf @ LBL)


So what does this tell us

So what does this tell us?

  • A “grid” programming model for parallelizing a single app is not worthwhile

    • Either simple parameter sweep / workflow, or will not work

    • We will have enough problems programming a single system with millions of threads (e.g., Jack’s keynote)

  • Grid programming at “diplomacy” level

    • Must look at multiple applications, and how they compete / coordinate

      • The apps execution environment should be virtualized --- grid being transparent to applications

      • Zillions of apps in the overall infrastructure, competing for resources

      • Hundreds to thousands of application components that coordinate (workflow, coupled multi-physics interactions, etc.)

    • NAREGI focuses on these scenarios


Use case in naregi rism fmo coupled simulation

Use case in NAREGI: RISM-FMO Coupled Simulation

Electronic structure of Nano-scale molecules in solvent is

calculated self-consistent by exchanging solvent charge

distribution and partial charge of solute molecules.

FMO

Electronic structure

RISM

Solvent distribution

GridMPI

Mediator

Mediator

Solvent charge distribution

is transformed from

regular to irregular meshes

Suitable for SMP

Suitable for Cluster

Mulliken charge is transferred for partial charge of solute molecules

*Original RISM and FMO codes are developed by Institute of Molecular Science and

National Institute of Advanced Industrial Science and Technology, respectively.


Registration deployment of applications

Compiling

OK!

NG!

Test Run

OK!

Test Run

Test Run

OK!

Registration & Deployment of Applications

  • Application Summary

  • Program Source Files

  • Input Files

  • Resource Requirements

    etc.

PSE Server

ApplicationDeveloper

ACS

(Application Contents Service)

①Register Application

②Select Compiling Host

⑤Select Deployment Host

⑦RegisterDeployment Info.

Application sharing in

Research communities

⑥Deploy

Information Service

Resource

Info.

③Compile

④Send back CompiledApplication Environment

Server#1

Server#2

Server#3


Description of workflow and job submission requirements

Description of Workflow and Job Submission Requirements

http(s)

tomcat

Workflow

Servlet

Web server(apache)

Wokflow Description

By NAREGI-WFML

applet

Program icon

NAREGI

JM I/F module

Data icon

Appli-A

JSDL

/gfarm/..

Appli-B

BPEL

<invoke name=EPS-jobA>

↓JSDL -A

<invoke name=BES-jobA>

↓JSDL -A

…………………..

JSDL

Global file

information

GridFTP

(Stdout

Stderr)

BPEL+JSDL

Application Information

SuperScheduler

Information

Service

DataGrid

PSE

Server


Reservation based co allocation

Workflow

Abstract

JSDL

ResourceQuery

Distributed

Information Service

Super

Scheduler

Client

DAI

CIM

Reservation, Submission,

Query, Control…

Reservation based

Co-Allocation

Resource

Info.

Concrete

JSDL

Concrete

JSDL

Accounting

Computing Resource

Computing Resource

GridVM

GridVM

UR/RUS

Reservation Based Co-Allocation

  • Co-allocation for heterogeneous architectures and applications

  • Used for advanced science applications, huge MPI jobs, realtime visualization on grid, etc...


Communication libraries and tools

Communication Libraries and Tools

  • Modules

    • GridMPI:MPI-1 and 2 compliant grid ready MPI library

    • GridRPC:OGF/GridRPC compliant GridRPC library

    • Mediator:Communication tool for heterogeneous applications

    • SBC:Storage based communication tool

  • Features

    • GridMPI

      • MPI for a collection of geographically distributed resources

      • High performance optimized for high bandwidth network

    • GridMPI

      • Task parallel simple seamless programming

    • Mediator

      • Communication library for heterogeneous applications

      • Data format conversion

    • SBC

      • Storage based communication for heterogeneous applications

  • Supporting Standards

    • MPI-1 and 2

    • OGF/GridRPC


Grid ready programming libraries

GridRPC (Ninf-G2)

GridMPI

Data Parallel

MPI Compatibility

Task Parallel, Simple

Seamless programming

RPC

100000 CPU

RPC

100-500 CPU

Grid Ready Programming Libraries

  • Standards compliant GridMPI and GridRPC


Communication tools for co allocation jobs

Application-1

Mediator

Mediator

Application-2

Data Format

Conversion

Data Format

Conversion

GridMPI ( )

Application-3

SBC library

SBC library

Application-2

SBC protocol ( )

Communication Tools for Co-Allocation Jobs

  • Mediator

  • SBC (Storage Based Communication)


Compete scenario mpi vm migration on grid our abaris ft mpi

VM

VM

VM

MPI

MPI

MPI

App B(CPU)

VM Job MigrationPower Optimization

VM

VM

VM

MPI

MPI

MPI

Compete Scenario: MPI / VM Migration on Grid (our ABARIS FT-MPI)

Cluster A (fast CPU, slow networks)

Resource

Manager

App A(High BW)

Host

Host

Host

Host

Host

Host

Resource manager, aware of individual application characteristics

App A (High Bandwidth)

Cluster B (high bandwidth, large memory)

VM

VM

VM

MPI

MPI

MPI

Host

Host

Host

Host

Host

Host

MPI Comm Log

Redistribution


  • Login