The Design Concept and Initial Implementation of AgentTeamwork Grid Computing Middleware

The Design Concept and Initial Implementation of AgentTeamwork Grid Computing Middleware Munehiro Fukuda Computing & Software Systems, University of Washington, Bothell Koichi Kashiwagi Shinya Kobayashi Computer Science, Ehime University Funded by IEEE PacRim 2005

Background • Most grid-computing systems • Centralized resource/job management • Two drawbacks • A powerful central server essential to manage all slave computing nodes • Applications based on master-slave or parameter-sweep model • Mobile agents • An execution model previously highlighted as a prospective infrastructure of distributed systems. • No more than an alternative approach to centralized grid middleware implementation. • Our motivation • Decentralized job distribution and coordination • Decentralized fault tolerance • Applications based on a variety of communication models IEEE PacRim 2005

Objective • A mobile agent execution platform fitted to grid computing • Allowing an agent to identify which MPI rank to handle and which agent to send a job snapshot to. • A fault-tolerant inter-process communication • Recovering lost messages. • Allowing over-gateway connections. • Agent-collaborative algorithms for job coordination • Allocating computing nodes in a distributed manner. • Implementing decentralized snapshot maintenance and job recovery. IEEE PacRim 2005

Snapshot Methods Snapshot Methods GridTCP GridTCP User program wrapper User program wrapper Results Results snapshot snapshot snapshot User A User B FTP Server snapshots snapshots System Overview User A’s Process User A’s Process User B’s Process TCP Communication Snapshot Methods GridTCP User program wrapper Sentinel Agent Sentinel Agent Sentinel Agent Commander Agent Resource Agent Resource Agent Commander Agent IEEE PacRim 2005 BookkeeperAgent Bookkeeper Agent

Commander, resource, sentinel, and bookkeeper agents UWAgents mobile agent execution platform Execution Layer Java user applications mpiJava API mpiJava-S mpiJava-A Java socket GridTcp User program wrapper Commander, resource, sentinel, and bookkeeper agents UWAgents mobile agent execution platform Operating systems IEEE PacRim 2005

UWInject: submits a new agent from shell. id 0 Agent domain (time=3:30pm, 8/25/05 ip = medusa.uwb.edu name = fukuda) id 0 -m 4 -m 3 id 1 id 1 id 2 id 3 id 2 User A user job UWPlace id 12 Agent domain (time=3:31pm, 8/25/05 ip = perseus.uwb.edu name = fukuda) id 4 id 5 id 6 id 7 id 8 id 9 id 10 id 11 UWAgents Execution Platform • Agent domain created per each submission from the Unix shell • # children each agent can spawn is given upon the initial submission • No name server • Messages forwarded through an agent tree • A user job scheduled as a thread, using suspend/resume IEEE PacRim 2005

User Job Distribution Job Submission Commander id 0 XML Query Spawn eXist Resource id 1 Sentinel id 2 rank 0 Bookkeeper id 3 rank 0 Sentinel id 8 rank 1 Sentinel id 9 rank 2 Sentinel id 10 rank 3 Sentinel id 11 rank 4 Bookkeeper id 12 rank 1 Bookkeeper id 13 rank 2 Bookkeeper id 14 rank 3 Bookkeeper id 15 rank 4 snapshot Sentinel id 32 rank 5 Sentinel id 33 rank 6 Sentinel id 34 rank 7 Bookkeeper id 48 rank 5 Bookkeeper id 49 rank 6 Bookkeeper id 50 rank 7 id: agent id rank: MPI Rank snapshot IEEE PacRim 2005

User Node 4 Node 1 Node 0 Node 2 Node 2 Node 0 Node 1 Node 3 Node5 Resource Allocation Job submission total nodes x multiplier eXist Resource id 1 Commander id 0 An XML query CPU Architecture OS Memory Disk Total nodes Multiplier A list of available nodes Spawn Case 1: Total nodes = 2 Multiplier = 1.5 Sentinel id 2 rank 0 Sentinel id 8 rank 1 Bookkeeper id 12 rank 5 Bookkeeper id 2 rank 0 Future use Sentinel id 2 rank 0 Sentinel id 8 rank 1 Bookkeeper id 12 rank 5 Bookkeeper id 2 rank 0 Case 2: Total nodes = 2 Multiplier = 3 Future use Future use IEEE PacRim 2005

(2) Search for the latest snapshot (3) Retrieve the snapshot (4) Send a new agent (1) Detect a ping error New Sentinel id 11 rank 4 (5) Restart a user program (0) Send a new snapshot periodically Job Resumption by a Parent Sentinel Sentinel id 2 rank 0 MPI connections Sentinel id 8 rank 1 Sentinel id 9 rank 2 Sentinel id 10 rank 3 Sentinel id 11 rank 4 Bookkeeper id 15 rank 4 IEEE PacRim 2005

(12) Restart a new resource agent from its beginning Commander id 0 (11) Detect a ping error (13) Detect a ping error and follow the same child resumption procedure as in p9. (10) Send a new agent (6) No pings for 2 * 5 (= 10sec) (7) Search for the latest snapshot New Resource id 1 Sentinel id 2 rank 0 (2) Search for the latest snapshot (8) Search for the latest snapshot (1) No pings for 8 * 5 (= 40sec) (9) Retrieve the snapshot No pings for 12 * 5 (= 60sec) (5) Send a new agent (3) Search for the latest snapshot (4) Retrieve the snapshot Job Resumption by a Child Sentinel Commander id 0 New Resource id 1 Sentinel id 2 rank 0 Bookkeeper id 3 rank 0 Sentinel id 8 rank 1 Bookkeeper id 12 rank 1 IEEE PacRim 2005

Computational Granularity 1 IEEE PacRim 2005

Performance Evaluation - Series IEEE PacRim 2005

Performance Evaluation - RayTracer IEEE PacRim 2005

Performance Evaluation – MolDyn IEEE PacRim 2005

Overhead of Job Resumption IEEE PacRim 2005

Conclusions • Our focus • A decentralized job execution and fault-tolerant environment • Applications not restricted to the master-slave or parameter-sweeping model. • Applications • 40,000 doubles x 10,000 floating-point operations • Moderate data transfer combined with massive/collective communication • At least three times larger than its computational granularity • Future work • UWAgents enhancement: over-gateway deployment and security • Programming support: preprocessor implementation • Job scheduling algorithms: priority-based agent migration IEEE PacRim 2005

The Design Concept and Initial Implementation of AgentTeamwork Grid Computing Middleware

The Design Concept and Initial Implementation of AgentTeamwork Grid Computing Middleware

Presentation Transcript

The Check-Pointed and Error-Recoverable MPI Java of AgentTeamwork Grid Computing Middleware

Middleware for Grid Computing On Virtual Machines

Design and Implementation of Runtime Reflection in Communication Middleware: the dynamicTAO Case

Grid Computing and LA Grid

The Challenges of Grid Computing

The Design of a Grid Computing System for Drug Discovery and Design

Grid Middleware

Grid Computing Middleware

Middleware for Grid Computing and the relationship to Middleware at large

An overview of grid middleware and gLite

SAM-Grid Middleware

Weather Research and Forecast implementation on Grid Computing

Grid Middleware

Middleware, Service-Oriented Architectures and Grid Computing

gLite, the next generation middleware for Grid computing

Grid Middleware for High Performance Computing

The Grid computing

Grid Middleware

Grid Computing and Middleware

Middleware, Service-Oriented Architectures and Grid Computing

An overview of grid middleware and gLite

gLite, the next generation middleware for Grid computing