1 / 43

AgentTeamwork : Mobile-Agent-Based Middleware for Distributed Job Coordination

AgentTeamwork : Mobile-Agent-Based Middleware for Distributed Job Coordination. Munehiro Fukuda Computing & Software Systems, University of Washington, Bothell Funded by. Outline. Introduction Execution Model System Design Performance Evaluation Related Work Conclusions. 1. Introduction.

garyalvarez
Download Presentation

AgentTeamwork : Mobile-Agent-Based Middleware for Distributed Job Coordination

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AgentTeamwork: Mobile-Agent-Based Middleware for Distributed Job Coordination Munehiro Fukuda Computing & Software Systems, University of Washington, Bothell Funded by AgentTeamwork

  2. Outline • Introduction • Execution Model • System Design • Performance Evaluation • Related Work • Conclusions AgentTeamwork

  3. 1. Introduction • Why Grid Computing • Background • Objective • Project Overview AgentTeamwork

  4. Why Grid Computing • Textbooks say: • Only 30% CPU utilization • Only episodic job requirements • Anyone and anywhere like a power grid • Many research prototypes and commercial products: • Globus, Condor, Legion(Avaki), NetSolve, Ninf, Entropia PCGrid, Sun Grid Engine, etc. • Then, have you ever used them? • Probably not so many of you. • What is a big hurdle? • You don’t need it anyway. Or, what? AgentTeamwork

  5. BackgroundMost Grid Systems • Functional viewpoints: • Centralized resource/job management • Two drawbacks • A powerful central server essential to manage all slave computing nodes • Applications based on master-slave or parameter-sweep model • Out motivation • Decentralized job distribution, coordination, and fault tolerance • Applications based on a variety of communication models • Practical viewpoints: • Systems dedicated to large institutions/companies • Two drawbacks • A lot of installation work required under the root account. • A group of individual computer owners not targeted at. • Our motivation • Easy participation in grid-computing and easy installation AgentTeamwork

  6. BackgroundHow to Pursue Our Motivation • Use of mobile agents • We are experts in mobile agents. • Most mobile agents • An execution model previously highlighted as a prospective infrastructure of distributed systems. • No more than an alternative approach to centralized grid middleware implementation. • Our initial goal • Decentralized middleware design with mobile agents AgentTeamwork

  7. Objective • A mobile agent execution platform fitted to grid computing • Allowing an agent to identify which MPI rank to handle and which agent to send a job snapshot to. • A fault-tolerant inter-process communication • Recovering lost messages. • Allowing over-gateway connections. • Agent-collaborative algorithms for job coordination • Allocating computing nodes in a distributed manner. • Implementing decentralized snapshot maintenance and job recovery. AgentTeamwork

  8. Project Overview • Funded by: NSF Middleware Initiative • Sponsored by: University of Washington • In Collaboration of: Ehime University • In a Team of: UWB Undergraduates AgentTeamwork

  9. 2. Execution Model • System Overview • Execution Layer • Programming Interface AgentTeamwork

  10. Snapshot Methods Snapshot Methods GridTCP GridTCP User program wrapper User program wrapper Results Results snapshot snapshot snapshot User A User B FTP Server snapshots snapshots System Overview User A’s Process User A’s Process User B’s Process TCP Communication Snapshot Methods GridTCP User program wrapper Sentinel Agent Sentinel Agent Sentinel Agent Commander Agent Resource Agent Resource Agent Commander Agent AgentTeamwork BookkeeperAgent Bookkeeper Agent

  11. mpiJava-S mpiJava-A Execution Layer Java user applications mpiJava API Java socket GridTcp User program wrapper Commander, resource, sentinel, and bookkeeper agents UWAgents mobile agent execution platform Operating systems AgentTeamwork

  12. Programming Interface public class MyApplication { public GridIpEntry ipEntry[]; // used by the GridTcp socket library public int funcId; // used by the user program wrapper public GridTcp tcp; // the GridTcp error-recoverable socket public int nprocess; // #processors public int myRank; // processor id ( or mpi rank) public int func_0( String args[] ) { // constructor MPJ.Init( args, ipEntry, tcp );// invoke mpiJava-A .....; // more statements to be inserted return 1; // calls func_1( ) } public int func_1( ) { // called from func_0 if ( MPJ.COMM_WORLD.Rank( ) == 0 ) MPJ.COMM_WORLD.Send( ... ); else MPJ.COMM_WORLD.Recv( ... ); .....; // more statements to be inserted return 2; // calls func_2( ) } public int func_2( ) { // called from func_2, the last function .....; // more statements to be inserted MPJ.finalize( );// stops mpiJava-A return -2; // application terminated } } AgentTeamwork

  13. 3. System Design • Mobile Agents • Job Coordination • Distribution • Monitoring • Resumption and migration • Programming Support • Language preprocessing • Communication check-pointing AgentTeamwork

  14. UWInject: submits a new agent from shell. id 0 Agent domain (time=3:30pm, 8/25/05 ip = medusa.uwb.edu name = fukuda) id 0 -m 4 -m 3 id 1 id 1 id 2 id 3 id 2 User A user job UWPlace id 12 Agent domain (time=3:31pm, 8/25/05 ip = perseus.uwb.edu name = fukuda) id 4 id 5 id 6 id 7 id 8 id 9 id 10 id 11 UWAgents – Concept of Agent Domain • Agent domain created per each submission from the Unix shell • # children each agent can spawn is given upon the initial submission • No name server • Messages forwarded through an agent tree • A user job scheduled as a thread, using suspend/resume AgentTeamwork

  15. UWAgents – Over Gateway Migration AgentTeamwork

  16. User snapshot snapshot Job Distribution Job Submission Commander id 0 XML Query Spawn eXist Resource id 1 Sentinel id 2 rank 0 Bookkeeper id 3 rank 0 Sensor id 4 Sensor id 5 Sentinel id 8 rank 1 Sentinel id 9 rank 2 Sentinel id 10 rank 3 Sentinel id 11 rank 4 Bookkeeper id 12 rank 1 Bookkeeper id 13 rank 2 Bookkeeper id 14 rank 3 Bookkeeper id 15 rank 4 Sentinel id 32 rank 5 Sentinel id 33 rank 6 Sentinel id 34 rank 7 Bookkeeper id 48 rank 5 Bookkeeper id 49 rank 6 Bookkeeper id 50 rank 7 id: agent id rank: MPI Rank AgentTeamwork

  17. User Node 4 Node 1 Node 0 Node 2 Node 2 Node 0 Node 1 Node 3 Node5 Resource Allocation Job submission total nodes x multiplier eXist Resource id 1 Commander id 0 An XML query CPU Architecture OS Memory Disk Total nodes Multiplier A list of available nodes Spawn Case 1: Total nodes = 2 Multiplier = 1.5 Sentinel id 2 rank 0 Sentinel id 8 rank 1 Bookkeeper id 12 rank 5 Bookkeeper id 2 rank 0 Future use Sentinel id 2 rank 0 Sentinel id 8 rank 1 Bookkeeper id 12 rank 5 Bookkeeper id 2 rank 0 Case 2: Total nodes = 2 Multiplier = 3 Future use Future use AgentTeamwork

  18. An XML query Sensor id 4 Sensor id 5 Performance data ttcp Sensor id 16 Sensor id 17 Sensor id 18 Sensor id 19 Sensor id 20 Sensor id 21 Sensor id 22 Sensor id 23 ttcp ttcp Resource Monitoring A resource request eXist Resource id 1 Commander id 0 A list of available nodes Spawn • Current restrictions • Minimum interval: 3secs • Static distribution of sensor agents • Future extensions • Sensor migration • Use of NWS at each site AgentTeamwork

  19. (2) Search for the latest snapshot (3) Retrieve the snapshot (4) Send a new agent (1) Detect a ping error New Sentinel id 11 rank 4 (5) Restart a user program (0) Send a new snapshot periodically Job Resumption by a Parent Sentinel Sentinel id 2 rank 0 MPI connections Sentinel id 8 rank 1 Sentinel id 9 rank 2 Sentinel id 10 rank 3 Sentinel id 11 rank 4 Bookkeeper id 15 rank 4 AgentTeamwork

  20. (12) Restart a new resource agent from its beginning Commander id 0 (11) Detect a ping error (13) Detect a ping error and follow the same child resumption procedure as in p9. (10) Send a new agent (6) No pings for 2 * 5 (= 10sec) (7) Search for the latest snapshot New Resource id 1 Sentinel id 2 rank 0 (2) Search for the latest snapshot (8) Search for the latest snapshot (1) No pings for 8 * 5 (= 40sec) (9) Retrieve the snapshot No pings for 12 * 5 (= 60sec) (5) Send a new agent (3) Search for the latest snapshot (4) Retrieve the snapshot Job Resumption by a Child Sentinel Commander id 0 New Resource id 1 Sentinel id 2 rank 0 Bookkeeper id 3 rank 0 Sentinel id 8 rank 1 Bookkeeper id 12 rank 1 AgentTeamwork

  21. User Program Wrapper User Program Wrapper Source Code func_0( ) { statement_1; statement_2; statement_3; return 1; } func_1( ) { statement_4; statement_5; statement_6; return 2; } func_2( ) { statement_7; statement_8; statement_9; return -2; } int fid = 1; while( fid == -2) { switch( func_id ) { case 0: fid = func_0( ); case 1: fid = func_1( ); case 2: fid = func_2( ); } } check_point( ) { // save this object // including func_id // into a file } statement_1; statement_2; statement_3; statement_4; statement_5; statement_6; statement_7; statement_8; statement_9; check_point( ); check_point( ); check_point( ); Preprocessed AgentTeamwork

  22. Preproccesser and Drawback Preprocessed Code Preprocessed Source Code int func_0( ) { statement_1; statement_2; statement_3; return 1; } int func_1( ) { while(…) { statement_4; if (…) { statement_5; return 2; } else statement_7; statement_8; } } int func_2( ) { statement_6; statement_8; while(…) { statement_4; if (…) { statement_5; return 2; } else statement_7; statement8; } } statement_1; statement_2; statement_3; check_point( ); while (…) { statement_4; if (…) { statement_5; check_point( ); statement_6; } else statement_7; statement_8; } check_point( ); • No recursions • Useless source line numbers indicated upon errors • Still need of explicit snapshot points. Before check_point( ) in if-clause After check_point( ) in if-clause AgentTeamwork

  23. User Program Wrapper rank ip 1 n1.uwb.edu user program 2 n2.uwb.edu n3.uwb.edu incoming TCP ougoing backup User Program Wrapper rank ip 1 n1.uwb.edu user program 2 n3.uwb.edu incoming TCP ougoing backup GridTcp – Check-Pointed Connection User Program Wrapper user program TCP outgoing backup incoming Snapshot maintenance n1.uwb.edu n2.uwb.edu • Outgoing packets saved in a backup queue • All packets serialized in a backup file every check pointing • Upon a migration • Packets de-serialized from a backup file • Backup packets restored in outgoing queue • IP table updated n3.uwb.edu AgentTeamwork

  24. GridTcp – Over-Gateway Connection User Program Wrapper User Program Wrapper User Program Wrapper User Program Wrapper user program user program user program user program medusa.uwb.edu (rank 1) uw1-320.uwb.edu (rank 2) • RIP-like connection • Restriction: each node name must be unique. uw1-320-00 (rank 3) mnode0 (rank 0) AgentTeamwork

  25. MPJ Package MPJ Init( ), Rank( ), Size( ), and Finalize( ) Communicator All communication functions: Send( ), Recv( ), Gather( ), Reduce( ), etc. JavaComm mpiJava-S: uses java sockets and server sockets. GridComm mpiJava-A: uses GridTcp sockets. DataType MPJ.INT, MPJ.LONG, etc. • InputStream for each rank • OutputStream for each rank • User a permanent 64K buffer for serialization • Emulate collective communication sending the same data to each OutputStream, which deteriorates performance MPJMessage getStatus( ), getMessage( ), etc. Op Operate( ) etc Other utilities AgentTeamwork

  26. Sentinel Agent User Program Wrapper rank ip Main Thread SendSnapshot Thread 1 n1.uwb.edu 2 n2.uwb.edu n3.uwb.edu Bookkeeper Agent snapshot snapshot user program TCP ReceiveMsg Thread TCPError Thread outgoing backup Resumed Sentinel Agent incoming Restart message (a new rank/ip pair) MPI Connection MPI Job Execution UWPlace (UWAgent Execution Platform) AgentTeamwork

  27. 4. Performance Evaluation • Evaluation Environment: • A 8-node Myrinet-2000 cluster: 2.8GHz pentium4-Xeon w/ 512MB • A 24-node Giga-Ethernet cluster: 3.4GHz Pentium4-Xeon w/512MB • Computation Granularity • Java Grande MPJ Benchmark • Process Resumption Overhead AgentTeamwork

  28. MPJ.Send and Recv Performance AgentTeamwork

  29. Computational Granularity 1 AgentTeamwork

  30. Computational Granularity 2 AgentTeamwork

  31. Computational Granularity 3 AgentTeamwork

  32. Performance Evaluation - Series AgentTeamwork

  33. Performance Evaluation - RayTracer AgentTeamwork

  34. Performance Evaluation – MolDyn AgentTeamwork

  35. Overhead of Job Resumption AgentTeamwork

  36. 5. Related Work From the viewpoints of: • System Architecture • Mobile Agents • Fault Tolerance AgentTeamwork

  37. System Architecture • Difference from Catalina/J-SEAL2 • They are not fully implemented. • They are based on a master-slave model AgentTeamwork

  38. Mobile Agents AgentTeamwork

  39. Fault Tolerance AgentTeamwork

  40. 6. Conclusions • Project Summary • Next Two Years AgentTeamwork

  41. Project summary • Our focus • A decentralized job execution and fault-tolerant environment • Applications not restricted to the master-slave or parameter-sweeping model. • Applications • 40,000 doubles x 10,000 floating-point operations • Moderate data transfer combined with massive/collective communication • At least three times larger than its computational granularity • Current status • UWAgent: completed • Agent behavioral design: basic job deployment/resumption implemented • User program wrapper: completed except security feature • GridTcp/mpiJava: in testing • Preprocessor: in design AgentTeamwork

  42. Next Two Years • Application support • Preprocessor implementation • Efficient input/output file transfer • Security enhancement in remote execution • GUI improvement • Agent algorithms • Over-gateway application deployment • Dynamic resource monitoring • Priority-based agent migration • Performance evaluation • Dissemination AgentTeamwork

  43. Questions? AgentTeamwork

More Related