The status of “SAKURA” project and proposals for “Instant Grid” in NEGST project

The status of “SAKURA” project andproposals for “Instant Grid” in NEGST project Mitsuhisa Sato University of Tsukuba, Japan 1st NEGST WS (JST-CNRS)

“SAKURA” project • Title: “Research on software and network technologies foran international large scale P2P distributed computing infrastructure” • Period: 2005　～　2006 (2 years) • Research Topics and teams: • Distributed computing – parallel programming, storage • U of Tsukuba (Sato, Boku, Nakajima) • INRIA(Fedak, Cappello) • Network measurement • AIST(Kudoh, Kodama) • ENS Lyon (P. Primet , Gluck, Otal) • Status and outcome • Integration of OmniRPC and XtremWeb • “Grid RPC System Integrating Computing Resources on Multiple Grid-enabled Job Scheduling Systems” • Data communication layer for OmniRPC • OmniStorage (and P2P data sharing) 1st NEGST WS (JST-CNRS)

Integrating Computing Resourceson Multiple Grid-enabled Job SchedulingSystemsThrough a Grid RPC System Yoshihiro Nakajima † ,Mitsuhisa Sato †,Yoshiaki Aida †, Taisuke Boku †, Franck Cappello†† †University of Tsukuba, Japan ††INRIA, France 1st NEGST WS (JST-CNRS)

Presentation outline • Grid RPC System Integrating Computing Resources on Multiple Grid-enabled Job Scheduling Systems • “SAKURA” • Experimental results • Summary 1st NEGST WS (JST-CNRS)

Provide RPC style programming model on GJSS Motivation • Need for High Throughput Computing(cf. Simulations for drug design, Circuit design…) • Many kinds of Grid-enabled Job Scheduling System (GJSS) have been developed (XtremWeb, Condor, Grid engine, CyberGRIP, GridMP) • User wants to use massive computing resources on different sites easily • Different management policy and middleware on each sites • User should write extra code to adapt environment • User does not want to stop calculation by faults (need for Fault-tolerance mechanism) 1st NEGST WS (JST-CNRS)

Objectives of Grid RPC system integratingcomputing resources on multiple GJSS • Provides unified and parallel programming model by RPC on GJSS • Provide Fault-tolerant feature s for Grid RPC system on the worker programs • Exploit massive computing resources on different sites simultaneously 1st NEGST WS (JST-CNRS)

Target Grid-enabled Job Scheduling Systems • Grid-enabled Job Scheduling System or Workflow manager • Used as a job batch system • Basic work unit is an independent job • Typical type of jobs is to read input data from files, to calculate and to writes output data to files CyberGRIP XtremWeb 1st NEGST WS (JST-CNRS)

Submit GJSS Grid RPC Client Program Agent Bridge Get Result Design of Grid RPC system for integratingcomputing resources on multiple GJSS’s • Decoupling computations and data transmission from RPC mechanism • Design the agent mechanism to bridge between Grid RPC and GJSS • Using document-based communication, rather than connection-based communication The propose system can • Submit a RPC computation as a job to GJSS • Guarantee Fault-tolerant execution on the side of worker program Grid RPC Worker program 1st NEGST WS (JST-CNRS)

Agent invocation communication Internet/Network Agent Client Rex: Remote Executable Multiplex I/O rex rex rex An example of the proposed Grid RPC system as an extension of OmniRPC • Provide seamless parallel programming for local cluster to multi-cluster in a grid environment • Make use of remote PC clusters as Grid computing resources • OmniRPC consists three kinds of components: Client, Remote executable, Agent • OmniRPC agent works as a proxy of the communication between client program and remote executables 1st NEGST WS (JST-CNRS)

Extensions of OmniRPC for Proposed System and Implementations • We have designed class libraries and interface to adopt different GJSS’s • Agent handles conversion between OmniRPC protocol and GJSS protocol • Agent takes care of submitted jobs for fault-tolerance in worker program • Remote executable operates I/O data through files • Implementations • XtremWeb (by INRIA) ->OmniRPC/XW • CyberGRIP (by Fujitsu Lab)-> OmniRPC/CG • Condor (by U of WM) -> OmniRPC/C • Open Source Grid Engine (by SUN) -> OmniRPC/GE 1st NEGST WS (JST-CNRS)

Experiment • How much performance improvement with using two different GJSSs concurrently? • Parallelized version of Codeml in PAML package [Yang97] • It analyzes phylogeny of DNA or protein sequences using maximum likelihood. • 1000 DNAs processing usingasynchronous 200 RPC calls • Experimental Setting • Open Source Grid Engine (Dennis) + Condor (Alice Cluster) 1st NEGST WS (JST-CNRS)

20.6 times speedup by using two clusters Performance improvement was limited in contrast withusing single cluster Load-imbalance of a RPC executions disturbs the performance improvement Result Sample Data • Asynchronous 200 RPC • Amount of data / RPC:IN 100KB, OUT 30KB 1st NEGST WS (JST-CNRS)

OmniStorage:Performance Improvement by Data Management Layer in a Grid RPC System 1st NEGST WS (JST-CNRS)

Parameter Search Application • Parameter search applications often need a large amount of common data. • Master-worker Parallel Eigenvalue Solver • Solving large-scale eigenvalue problems by RPC model • Common data = large-scale sparse matrix • If size of the matrix is very large, it takes long time to send it to every worker master Parameters worker worker worker Large Initial Data Large Initial Data Large Initial Data 1st NEGST WS (JST-CNRS)

Problems in Grid RPC • In RPC model, master communicates with workers one-by-one. • Only supports direct communication between master and worker • This is not corresponding with actual network topology Worker Worker Worker Worker Worker Master Worker Worker Worker • Network bandwidth between master and worker may become a bottleneck of performance 1st NEGST WS (JST-CNRS)

Our Proposal • We propose a programming model that decouples data transfer layer from RPC layer • It enable to optimize data transfer among master and workers. • We developed OmniStorage as a prototype to investigate this model 1st NEGST WS (JST-CNRS)

OmniStorage Overview OmniRPC Layer Worker one-by-one Worker Master Worker Worker Invocation+Argument Worker + Large Data Transfer OmstPutData(); Register Data OmstGetData(); Retrieve Data Data Transfer Layer “OmniStorage” 1st NEGST WS (JST-CNRS)

Programming example using OmniRPC only Master program int main(){ double initialdata[1000*1000], output[100][1000]; ... for(i = 0; i < 100; i++){ req[i] = OmniRpcCallAsync("MyProcedure", i, initialdata, output[i]); } OmniRpcWaitAll(100, req); ... } Worker program (Worker’s IDL) Define MyProcedure(int IN i, double IN initialdata[1000*1000], double OUT output[1000]){ ... /* program in C */ } 1st NEGST WS (JST-CNRS)

Programming example using OmniRPC with OmniStorage Master program Identifier Pointer Data size int main(){ double initialdata[1000*1000], output[100][1000]; ... for(i = 0; i < 100; i++){ req[i] = OmniRpcCallAsync("MyProcedure", i, output[i]); } OmniRpcWaitAll(100, req); ... } OmstPutData(“MyInitialData”, initialdata, sizeof(double)*1000*1000); Worker program (Worker’s IDL) Define MyProcedure(int IN i,double OUT output[1000]){ ... /* program in C */ } OmstGetData(“MyInitialData”, initialdata, sizeof(double)*1000*1000); 1st NEGST WS (JST-CNRS)

OmniStorage prototype • First objective: To speed up the distribution of (large) initial data • Relay nodes and workers can cache received data • Data transfer of the same content occurs only once OmniRPC work management layers agent invocation of workers agent Omnst_put_data() Omnst_get_data() sever sever data management layers sever OmniStorage 1st NEGST WS (JST-CNRS)

Prototype Implementation of OmniStorage Communication through WAN done only once OmniRPC’s client host Site A OmniRPC client host data OmstPutData(); WAN Send data only once Requests only once Relay node Site B Master node of cluster data req a communication inside cluster Send data Send request execute req execute data data req execute execute data req data req begin processing execute data req data req execute OmstGetData(); OmniRPC workers OmniRPC Workers 1st NEGST WS (JST-CNRS)

We used the master-worker parallel eigenvalue solver (Sakurai et al.) • It has 80 jobs and each jobs takes 30 seconds • Each worker requires 50MB of common data Performance Execution time Performance ratio vs. 1 node 1st NEGST WS (JST-CNRS)

Coarse-grain Eigenvalue Solver: Moment-based Method • Find all of the eigenvalues that lie inside a given domain • A small matrix pencil that has only the desired eigenvalues is derived by solving large sparse systems of linear equations constructed from A and B. • Since these equations can be solved independently, we solve them on remote hosts in parallel. • This approach is suitable for master-worker programming models. Circular Region 1st NEGST WS (JST-CNRS)

Future Work • Implement a function of collecting result data from workers to a master • Current version of OmniStorage has only facility to sends data from a master to workers. • Examine another systems for data management layer because the APIs doesn’t depend on implementation • Distributed Hash Table • Gfarm – Distributed file system • Bittorrent – P2P file sharing 1st NEGST WS (JST-CNRS)

Proposals for “Instant Grid”in NEGST project Instant Grid Network measurement Interoperability 1st NEGST WS (JST-CNRS)

Background Project “Study on P2P grid infrastructure for “large-capacity” distributed computing” • Supported by JSPS, Project period: 2005 ～　2007 (3 years) • P2P grid = grid + P2P distributed computing • The “Grid” infrastructure to exploit computing power using P2P（Peer to Peer） technologies • “large-capacity” distributed computing • large-scale computing • parameter searches • Not, large-scale parallel program with MPI • large-amount data • We want to handle a large-scale data which cannot be stored in single-site 1st NEGST WS (JST-CNRS)

Our research topics for “Instant Grid” • P2P infrastructure • P2P network layer, traversing firewall/NAT • Overlay network and UDP hole punching • Interoperability with other grid/P2P middleware (starting from OmniRPC/XW) • Virtualization of resources (VM technology) • BEE : Linux binary execution environment on Windows • checkpoint/migration … • Data management • P2P data storage system for large-scale persistent data in volatile P2P grid environment. • Gfram for P2P environments (by Tatebe) • OmniStorage: Data layer for Grid RPC • Numerical algorithms for P2P-grid • Collaboration with Serge Petition’s group 1st NEGST WS (JST-CNRS)

Overlay network for P2P • Computing resources in P2P-grid will be a PC in office and home • P2P ad-hoc network through NAT • More of home or office PCs are running with private IP address behind NAT (or personal firewall) • For P2P computing with them, we need to tolerate the problem of NAT traversal communication • TCP is useful, but only outbound connection from inside to outside through NAT can be established • Overlay network for P2P-grid • Logical network to aggregate computing resources in P2P grid 1st NEGST WS (JST-CNRS)

Several ways for NAT traversal • UDP hole punching • When the ACK packet of outbound UDP communication arrives, NAT allows to pass it • Only available for UDP (unreliable & connection-less communication) • NAT judges it as an ACK packet from its by the combination of IP address and port number and the interval from outbound packet • It is easy to pass ordinary packet as a “fake” of ACK packet • TCP hole punching • Same as UDP hole punching, but also available on TCP [Bryan06] • Using more detailed information on TCP packet header than UDP, and usually “faked” packet is rejected by NAT • With packet header modification, it is possible to go through the NAT with TCP packet • For packet modification, the network drivers must be modified and it is not suitable for P2P communication • Not so many NAT allows it • NAT with UPnP • Microsoft and other companies push it, and most of home-NAT accept it • Not suitable for high-end routers (not for home use) 1st NEGST WS (JST-CNRS)

UDP hole punching • Arbitration server is required for the beginning of communication • After that, client-client P2P communication is possible • Limited only for UDP arbitration server opening direct UDP connection outgoing TCP domain2 behind NAT2 domain1 behind NAT1 outgoing TCP client2 client1 UDP communication 1st NEGST WS (JST-CNRS)

RI2N/UDP for UDP hole punching RI2N: Redundant Interconnection with Inexpensive Network on UDP • Originally, designed and implemented for reliable communication on PC cluster with multi-link Ethernet • User-level implementation as communication library of which API is similar to TCP socket interface • To avoid lock-in-kernel status (no response for user), it is implemented on UDP with flow control and re-transmission features • Available on any UDP compatible environment, and no kernel modification is required • Although the low level protocol is UDP, RI2N/UDP provides a reliable stream communication like TCP • We use RI2N/UDP for a low level layer to traverse NAT for our P2P overlay network. • Status and Plan: • Experiment and Evaluation of RI2N/UDP for hole punching is done! • Design a high level algorithms for routing/finding 1st NEGST WS (JST-CNRS)

Linux binary execution environment on Window for Grid RPC Linux binary of RPC worker • Objective • Exploit computing resources running under Windows as OmniRPC workers • Approach • Direct execution of Linux binary of RPC worker (without re-compile) (eg. Wine, Line) • Not VM • BEE: Linux binary execution environment on Window for Grid RPC • Loader for linux binary to Windows • Emulation of system calls • Limited for network sys. calls • Status and Plan • First prototype is finish (Linux 2.4) • Linux 2.6 support and process migration Linux host RPC client RPC Call Return result OmniRPC Agent upload Invoke Linux binary of RPC worker BEE Load Windows remote host 1st NEGST WS (JST-CNRS)

End … Q & A ? 1st NEGST WS (JST-CNRS)

The status of “SAKURA” project and proposals for “Instant Grid” in NEGST project

The status of “SAKURA” project and proposals for “Instant Grid” in NEGST project

Presentation Transcript

Writing More Effective Proposals IV Project Management Budget

BMI Companies Project Status Update

Raven EA Project

DataGrid Project Status Update

SAGEN

CE 496 Senior Design

Project Name – Project Status YYYYMMDD

BPA and CPO Project Status

Timetracker Midterm Update

The EGEE Project Fabrizio Gagliardi Project Director

Motivation Project Overview Grid services Status

Project Housekeeping

ECM/EDMS Project

National HEP Data Grid Project in Korea

LHC Computing Grid Project – LCG Ian Bird – LCG Deployment Manager IT Department, CERN

CNGS PROJECT STATUS

Grid Computing Program at Peking University in EUChinaGRID Project

The European DataGrid Project Technical status

NEGST Workshop, June 23-24, 2006

The Globus Project: A Status Report

Delivering Grid Interoperability