1 / 18

CSE 490dp Check-pointing and Migration

CSE 490dp Check-pointing and Migration. Robert Grimm. Problem. How to capture the state of an application? Save and restore application Clone application Move application to a different node Technical issue. Motivation. Failure resilience Restart application after failure Performance

dedmonson
Download Presentation

CSE 490dp Check-pointing and Migration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 490dpCheck-pointing and Migration Robert Grimm

  2. Problem • How to capture the state of an application? • Save and restore application • Clone application • Move application to a different node • Technical issue

  3. Motivation • Failure resilience • Restart application after failure • Performance • Balance load across several nodes • Co-locate application with (remote) data • Availability • Move away from nodes that are going to go down • Follow a user as she moves through physical world

  4. Application State • Internal data • Memory, objects • Execution state • Thread-based: Stack, registers • Event-based: Event queue • Connections • Open files, sockets • Outside data • Executables • Stored data

  5. What State to Capture? • Issue: Degree of transparency • Fully transparent • Application cannot tell the difference • No transparency • Application needs to do everything itself

  6. Internal Data • Most basic application state • Memory – copy • C, C++ • Objects – serialize • Modula-3 • Java

  7. Execution State • System must be quiescent • All execution is suspended • Thread-based: State is implicit • Stack • Registers, including PC • Condition variable queues • Very low level • Event-based: State is explicit • Event queue

  8. Connections • Open files, sockets, etc. • Problems • May change while application is not executing • Check-points • May not be available on new node • Migration

  9. Alternative • Let application restore its connections • Harder for thread-based systems • Thread may be accessing file or socket • Easier for event-based systems • Tell application to restore connections • Explicit event

  10. Outside Data • Executables, stored data • Make data available everywhere • Distributed file system • Move executable(s) with application • Support moving code but not other data • Group data and applications • Environments in one.world • Hierarchy moved as one unit

  11. Three Points in the Design Space • Sprite [Douglis & Ousterhout 91] • Aglets [Lange & Oshima 98] • Representative of Java-based agent systems • one.world

  12. Sprite • Process migration motivated by performance • Use idle machines • Transferred application state • Data • Execution state • Open connections • “It turned out to be particularly difficult in Sprite to migrate the state associated with open files”

  13. Transparency in Sprite • Application seems to be on “home machine” • Location-independent kernel calls • File system • Transfer execution state • VM, open files, PIDs, UIDs, resource usage statistics • Call back to home machine • gettimeofday • Modify state on both machines • fork, exit, wait

  14. Aglets • Mobile agent system • “Clean” platform for experimentingwith mobile agents • Transferred application state • Data • Relies on Java serialization • Executables • Lazily – only currently used classes

  15. Limitations • Not transferred • Execution state • Not supported by Java • Applications need to implement their own state machines • Outside data beyond executables • Not part of platform

  16. one.world • Failure resilience, availability, (performance) • checkpoint, restore, move, clone • Transferred state • Data • Execution state • Event queue • Outside data • Environment hierarchy • Not transferred • Open connections

  17. Programming for Change • Pervasive computing environment • Highly dynamic • Tens of thousand of nodes and services come and go • Applications • Cannot assume existence or availabilityof resources • Need to be prepared to re-acquireany resource at any time

  18. Summary • Sprite • Full migration, full transparency • Does not scale across a global network • Aglets • Limited environment with limited migration • one.world • Better balance between no migration and full migration (?)

More Related