1 / 38

Civilian Worms: Ensuring Reliability in an Unreliable Environment

Civilian Worms: Ensuring Reliability in an Unreliable Environment. Sanjeev R. Kulkarni University of Wisconsin-Madison sanjeevk@cs.wisc.edu Joint Work with Sambavi Muthukrishnan. Outline. Motivation and Goals Civilian Worms Master-Worker Model Leader Election Forward Progress

Download Presentation

Civilian Worms: Ensuring Reliability in an Unreliable Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Civilian Worms: Ensuring Reliability in an Unreliable Environment Sanjeev R. Kulkarni University of Wisconsin-Madison sanjeevk@cs.wisc.edu Joint Work with Sambavi Muthukrishnan

  2. Outline • Motivation and Goals • Civilian Worms • Master-Worker Model • Leader Election • Forward Progress • Correctness • Parallel Applications

  3. What’s happening today • Move towards clusters • Resource Managers • eg. Condor • Dynamic environment

  4. Motivation • Large Parallel/Standalone Applications • Non-Dedicated Resources • eg.:- Condor env. • Machines can disappear at any time • Unreliable commodity clusters • Hardware failures • Network Failures • Security Attacks!

  5. What’s available • Parallel Platforms • MPI • MPI-1 :- Machines can’t go away! • MPI-2 any takers? • PVM • Shoot the master! • Condor • Shoot the Central Manager!

  6. Goal • Bottleneck-Free infrastructure in an unreliable Environment • Ensure “normal termination” of applications • Users submit their jobs • Get e-mail upon completion!

  7. Focus of this talk • Approaches for Reliability • Standalone Applications • Monitor framework ( worms! ) • Replication • Parallel Applications • Future work!

  8. Worms are here again! • Usual Worms • Self replicating • Hard to detect and kill • Civilian Worms • Controlled replication • Spread legally! • Monitor applications

  9. Desired Monitoring System W = worm C = computation

  10. Issues • Management of worms • Distributed State detection • Very hard • Forward Progress • Checkpointing • Correctness

  11. Management Models • Master-Worker • Simple • Effective • Our Choice! • Symmetric • Difficult to manage the model itself!

  12. Our Implementation Model Master W = worm C = computation Workers

  13. Worm States • Master • Maintains the state of all the worm segments • Listens on a particular socket • Respawns failed worm segments • Worker • Periodically ping the master • Starts the encapsulated process if instructed • Leader Election • Invoke the LE algorithm to elect a new master • Note:- Independent of application State

  14. Leader Election • The woes begin! • Master goes down • Detection • Worker ping times out • Timeout value • Worker gets an LE message • Action • Worker goes into LE state

  15. LE algorithm • Each worm segment is given an ID • Only master gives the id • Workers broadcast their ids • The worker with the lowest id wins

  16. Brief Skeleton • While in LE • bcast LE message with your id • Set min = your id • On getting an LE message with id i • If i >= min ignore • else min = i; • min is the new Master

  17. LE in action (1) M0 W2 W1 Master goes down!

  18. LE in action (2) LE, 1 LE, 2 L2 L1 LE, 1 LE, 2 L1 and L2 send out LE messages

  19. LE in action (3) COORD_ACK L2 L1 L1 gets LE, 2 and ignores it L2 gets LE, 1 and send COORD_ACK

  20. LE in action (4) W3 spawn COORD W2 M1 M1 send COORD to W2, spawns W0

  21. Implementation Problems • Too many cases • Many unclear cases • Time to Converge • Timeout values • Network Partition

  22. What happens if? • Master still up? • Incoming id < self id => goes to LE mode • Else => sends back COORD message • Next master in line goes down? • Timeout on COORD message receipt • Late COORD_ACK? • Sends KILL message

  23. More Bizarre cases • Multiple Masters? • Master bcasts its id periodically • Conflict is resolved using lowest id method • No-master? • Workers will timeout soon!

  24. Test-Bed • 64 dual processor 550 MHz P-III nodes • Linux 2.2.12 • 2 GB RAM • Fast interconnect. 100 Mbps • Master-Worker comm. via UDP

  25. A Stress Test for LE • Test • Worker Pings every second • Kill n/4 workers • After 1 sec, kill the master • After .5 sec kill the master in line • Kill n/4 workers again

  26. Convergence

  27. Forward Progress • Why? • MTTF < application time • Solutions • Checkpointing • Application Level • Process level • Start from checkpoint image!

  28. Checkpoint • Address Space • Condor Checkpoint library • Rewrites Object files • Writes checkpoint to a file on SIGUSR2 • Files • Assumption :- Common File System

  29. Correctness • File Access • Read Only, no problems • Writes • Possible inconsistency if multiple processes access • Inconsistency across checkpoints? • Need a new File Access Algorithm

  30. Solution: Individual Versions • File Access Algorithm • On open • If first open • read: nothing • write: create a local copy and set a mapping • Else • If mapped access mapped file • If write: create a local copy and set a mapping • Close • Preserve the mapping

  31. File Access cont. • Commit Point • On completion of the computation • Checkpoint • Includes mapped files

  32. Being more Fancy • Security Attacks • Civilian to Military transition • Hide yourself from the ps • Re-fork periodically to avoid detection

  33. Conclusion • LE is VERY HARD • Don’t take it for a course project! • Does our system work? • 16 nodes: YES • 32 nodes: NO • Quite Reliable

  34. Future Direction • Robustness • Extension to parallel programs • Re-write send/recv calls • Routing issues • Scalability issues? • A hierarchical design?

  35. References • Cohen, F. B., ‘A Case for Benevolent Viruses’, http://www.all.net/books/integ/goodvcase.html • M. Litzkow and M. Solomon. “Supporting Checkponting and Process Migration outside the UNIX kernel”, Usenix Conference Proceedings, San Francisco, CA, January 1992. • Gurdip Singh, “Leader election in complete networks”, PPDC 92

  36. Implementation Arch. Worm Communicator Dispatcher Dequeuer Checkpointer Remove Checkpoint Prepend Computation Append

  37. Parallel Programs • Communication • Connectivity across failures • Re-write send/recv socket calls • Limitations of Master-Worker Model? • Not really!

  38. Communication • Checkpoint markers • Buffer all data between checkpoint markers • Help of master in rerouting

More Related