1 / 22

Error Scope on a Computational Grid

Error Scope on a Computational Grid. Douglas Thain University of Wisconsin 4 March 2002. Overview. We have added a Java Universe to Condor. (More from Todd.) Adding this code forced us to think about the fundamental problem of coupling systems and representing errors.

meli
Download Presentation

Error Scope on a Computational Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002

  2. Overview • We have added a Java Universe to Condor. (More from Todd.) • Adding this code forced us to think about the fundamental problem of coupling systems and representing errors. • A lesson: One must consider the scopeof an error as well as its detail.

  3. Java for Scientific Computing • Java is emerging as a tool for large scale (Grande) scientific computing. • More accessible to domain scientists. • Simplified porting. • Faster development, debugging. • User communities are forming: • ACM Java Grande Conference • The Java Grande Forum

  4. The Hype: • Java: • “Write once, run anywhere!” • Condor: • “Submit once, run everywhere!” • The Grid: • Uniform, dependable, consistent, pervasive, and inexpensive computing.

  5. The Reality: • Coupling systems is not trivial! • The easy part: • Putting java in front of the program name. • The tricky parts: • Dealing with unexpected events! • Bad java installation. • Unavailable file system. • Temporary resource exhaustion.

  6. Architecture • Execution: • User just specifies “java” universe. • Execution site gives details of JVM. • I/O: • Know all of your files? • Condor transfers whole files for you. • Need online I/O? • Link program with Chirp I/O Library. • Execution site provides proxy to home site.

  7. Submission Site Execution Site shadow starter Home File System

  8. Submission Site Execution Site shadow starter Fork JVM Home File System

  9. Submission Site Execution Site shadow starter Fork JVM Home File System The Job

  10. Submission Site Execution Site shadow starter Secure Remote I/O I/O Server I/O Proxy Local I/O (Chirp) Fork Local System Calls JVM Home File System The Job I/O Library

  11. Initial Experience • Bad news: Nearly any unexpected failure would cause the job to be returned to the user: • Out of memory at execution site. • Java misconfigured at execution site. • I/O proxy can’t initialize. • Home file system offline.

  12. What do Users Want? • This was correct in a certain sense: • The information was true. • But, still frustrating. • Users want to know when their program fails by design (NullPointerException,) but not if it fails due to the environment.

  13. What Did We Do Wrong? • We thought that we were very careful to propagate errors: • I/O errors: server->proxy->library->job • JVM exit code: JVM->starter->home • But, we failed to draw a distinction: • Errors that are a natural property of the program. • Errors that were an incidental result of the environment.

  14. Scope and Detail • The scope of an error is the portion of the system that it invalidates. • The detail of an error describes its philosophical cause. • An error must be delivered according to the handler that manages its scope.

  15. Examples

  16. An Example • With this understanding, we reconsidered many elements of the Java Universe. • One example: • The JVM exit code is not a useful result. • It gives results that ignore error scope. • Solution: • Trap the program exit at a higher level. • Report the result and scope on a separate channel.

  17. JVM Exit Code

  18. shadow starter Starter Result + Program Result Result File JVM Result JVM Home File System Program Result or Error and Scope Wrapper The Job I/O Library

  19. shadow starter Starter Result + Program Result Local I/O (Chirp) I/O Proxy Result File JVM Result JVM Home File System Wrapper The Job I/O Library Errors of Larger Scope Errors Inside Program Scope

  20. Conclusion • We started building the Java Universe with some naive assumptions about errors. • On encountering practical difficulties, we thought more abstractly about errors and developed the notion of scope and detail. • By routing errors according to their scope, we made the system more robust and usable. • Details in an upcoming paper.

  21. Deeper Problems • Systems have deep semantic differences that cross multiple functions. • Consider this self-cleaning program: • Open a file. • Delete the file. • Close the file. • Works on UNIX, fails on WinNT. • Can we really provide a uniform interface?

  22. More Info: • Demo on Wednesday Morning • Room 3381 CS anytime • The Condor Project: • http://www.cs.wisc.edu/condor • These slides: • http://www.cs.wisc.edu/~thain • Douglas Thain • thain@cs.wisc.edu • Questions now?

More Related