atlas an infrastructure for global computing n.
Skip this Video
Loading SlideShow in 5 Seconds..
Atlas: An Infrastructure for Global Computing PowerPoint Presentation
Download Presentation
Atlas: An Infrastructure for Global Computing

Loading in 2 Seconds...

  share
play fullscreen
1 / 25
Download Presentation

Atlas: An Infrastructure for Global Computing - PowerPoint PPT Presentation

arav
62 Views
Download Presentation

Atlas: An Infrastructure for Global Computing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Atlas: An Infrastructure for Global Computing

  2. People • Eric Baldeschwieler (UC Berkeley) • Bobby Blumofe (UT Austin) • Eric Brewer (UC Berkeley)

  3. Outline • Introduction • Programming model • Architecture • Examples • Discussion • Limitations & Conclusion

  4. Introduction Properties of a Internet computing infrastructure • Scalability: to 106 nodes • Heterogeneity: of machines & OSs • Fault tolerance: completion probability comparable to sequential program • Adaptive parallelism: dynamic set of resources

  5. Properties ... • Safety: Hosts must be secure • Anonymity: Secure privacy of client: data & program • Hierarchy: Locality of communication (local bandwidth typically is higher) • Ease of use: Minimize “costs” of participating. • Reasonable performance: Low overhead  Benefit from a small set of machines.

  6. Introduction ... • Atlas combines mechanisms from: • Cilk • Java • with new mechanisms. • Java “ensures”: • heterogeneity • safety

  7. Introduction ... Atlas: • extends Cilk’s work-stealing scheduler to a hierarchical Internet setting • uses Cilk-NOW’s mechanisms for: • adaptive parallelism • fault tolerance

  8. Programming Model • Applications are written in Java • When a native library is used, heterogeneity is limited to platforms that support it. • Programming model is: • a Java-based implementation of Cilk: • Non-blocking, explicit continuation passing threads • a Unix-like URL-based file system & local caching with coherence.

  9. Architecture Basic architecture Compute Server Client Manager Application (Java) Runtime library Java interpreter Native libraries (C or C++) Compute Server Compute Server Compute Server

  10. Architecture ... • Client is a Java application • connects to compute servers on machines other than its manager’s. • Idle servers steal work from busy ones.

  11. Architecture • Compute server: • relinquishes control when there is non-Atlas work (a screensaver?) • Runs as a daemon: • working • pings manager & siblings for work to steal

  12. Architecture: Porting Atlas • A Java runtime system • Port: • natively written URL-based file system • some support routines.

  13. Hierarchical Work Stealing Manager Manager Manager Manager Manager Compute Server Compute Server Compute Server

  14. Hierarchical Work Stealing ... • Manager keeps track of when its subtree is idle • If manager’s subtree is idle, manager steals work from its siblings • If a subtree has “too much” work, it “allows” work stealing from above What is definition & implementation of “too much”?

  15. Hierarchical Work Stealing • The authors claim that proven properties of Cilk hold in this hierarchical setting. • Goals: • Localize communication • Sub-trees map to domain hierarchy Administrators can control thread migration: • Outflow: Privacy • Inflow: Host security

  16. Examples • Fib: fine grained threads • POV-Ray: coarse grained threads Base 1 Node 3 Nodes 8 Nodes Fib (24) 1.3 80 40 (2.0) 31 (2.6) POV-Ray 20700 21000 - 2700 (7.8) Numbers in ( ) are speedups over 1-node case.

  17. Examples ... • POV-Ray is not written in Java • Partitioning is done in Java • 8 nodes: only 2% overhead. • What about larger P?

  18. Discussion • Scalable: Yes. • Heterogeneity: Incomplete until divorces itself from all native libraries. • Safety: • Java: OK. • Native libraries: ?

  19. Discussion ... • Fault tolerance: A timed out thread is recomputed from a checkpointmaintained by subtree (manager?) • What is affect on performance of checkpointing? Subtree rooted at a thread is its subcomputation.

  20. Fault Tolerance ... Subcomputations are transactions: • Authors claim: side effects can be undone • How does this relate to hierarchical work stealing?

  21. Discussion ... • Anonymity: A host executing a stolen subtree cannot determine client. • Managers are assumed to be trustworthy • Hierarchy: Yes, via manager hierarchy. • Ease of use: Interface incomplete. • clients submit jobs via a special “shell”

  22. Discussion ... • Adaptive parallelism: • “Owner” (?) of compute server sets a policy that defines when server is idle. • How? • When compute server becomes unavailable for Atlas work, all its sub-computations are moved to another computer server.

  23. Adaptive Parallelism ... • Moving a subcomputation requires updating information linking subcomputation to its: • parent • children • How long does it take to retreat? • Is sub-computation restarted? From checkpoint?

  24. Limitations • Atlas inherits tree-structured program limitation from Cilk. • But this is still a rich set! • Generalizing to non-tree-structured programs seems hard. • No shared variables among threads. • Global file system is read-only.

  25. Conclusion • Jicos design goals = those for Atlas. • Use JXTA to give Jicos a “file system” • Then, Jicos becomes Atlas’s heir.