People • Eric Baldeschwieler (UC Berkeley) • Bobby Blumofe (UT Austin) • Eric Brewer (UC Berkeley)
Outline • Introduction • Programming model • Architecture • Examples • Discussion • Limitations & Conclusion
Introduction Properties of a Internet computing infrastructure • Scalability: to 106 nodes • Heterogeneity: of machines & OSs • Fault tolerance: completion probability comparable to sequential program • Adaptive parallelism: dynamic set of resources
Properties ... • Safety: Hosts must be secure • Anonymity: Secure privacy of client: data & program • Hierarchy: Locality of communication (local bandwidth typically is higher) • Ease of use: Minimize “costs” of participating. • Reasonable performance: Low overhead Benefit from a small set of machines.
Introduction ... • Atlas combines mechanisms from: • Cilk • Java • with new mechanisms. • Java “ensures”: • heterogeneity • safety
Introduction ... Atlas: • extends Cilk’s work-stealing scheduler to a hierarchical Internet setting • uses Cilk-NOW’s mechanisms for: • adaptive parallelism • fault tolerance
Programming Model • Applications are written in Java • When a native library is used, heterogeneity is limited to platforms that support it. • Programming model is: • a Java-based implementation of Cilk: • Non-blocking, explicit continuation passing threads • a Unix-like URL-based file system & local caching with coherence.
Architecture Basic architecture Compute Server Client Manager Application (Java) Runtime library Java interpreter Native libraries (C or C++) Compute Server Compute Server Compute Server
Architecture ... • Client is a Java application • connects to compute servers on machines other than its manager’s. • Idle servers steal work from busy ones.
Architecture • Compute server: • relinquishes control when there is non-Atlas work (a screensaver?) • Runs as a daemon: • working • pings manager & siblings for work to steal
Architecture: Porting Atlas • A Java runtime system • Port: • natively written URL-based file system • some support routines.
Hierarchical Work Stealing Manager Manager Manager Manager Manager Compute Server Compute Server Compute Server
Hierarchical Work Stealing ... • Manager keeps track of when its subtree is idle • If manager’s subtree is idle, manager steals work from its siblings • If a subtree has “too much” work, it “allows” work stealing from above What is definition & implementation of “too much”?
Hierarchical Work Stealing • The authors claim that proven properties of Cilk hold in this hierarchical setting. • Goals: • Localize communication • Sub-trees map to domain hierarchy Administrators can control thread migration: • Outflow: Privacy • Inflow: Host security
Examples • Fib: fine grained threads • POV-Ray: coarse grained threads Base 1 Node 3 Nodes 8 Nodes Fib (24) 1.3 80 40 (2.0) 31 (2.6) POV-Ray 20700 21000 - 2700 (7.8) Numbers in ( ) are speedups over 1-node case.
Examples ... • POV-Ray is not written in Java • Partitioning is done in Java • 8 nodes: only 2% overhead. • What about larger P?
Discussion • Scalable: Yes. • Heterogeneity: Incomplete until divorces itself from all native libraries. • Safety: • Java: OK. • Native libraries: ?
Discussion ... • Fault tolerance: A timed out thread is recomputed from a checkpointmaintained by subtree (manager?) • What is affect on performance of checkpointing? Subtree rooted at a thread is its subcomputation.
Fault Tolerance ... Subcomputations are transactions: • Authors claim: side effects can be undone • How does this relate to hierarchical work stealing?
Discussion ... • Anonymity: A host executing a stolen subtree cannot determine client. • Managers are assumed to be trustworthy • Hierarchy: Yes, via manager hierarchy. • Ease of use: Interface incomplete. • clients submit jobs via a special “shell”
Discussion ... • Adaptive parallelism: • “Owner” (?) of compute server sets a policy that defines when server is idle. • How? • When compute server becomes unavailable for Atlas work, all its sub-computations are moved to another computer server.
Adaptive Parallelism ... • Moving a subcomputation requires updating information linking subcomputation to its: • parent • children • How long does it take to retreat? • Is sub-computation restarted? From checkpoint?
Limitations • Atlas inherits tree-structured program limitation from Cilk. • But this is still a rich set! • Generalizing to non-tree-structured programs seems hard. • No shared variables among threads. • Global file system is read-only.
Conclusion • Jicos design goals = those for Atlas. • Use JXTA to give Jicos a “file system” • Then, Jicos becomes Atlas’s heir.