1 / 18

Scalable Cluster Management: Frameworks, Tools, and Systems

Scalable Cluster Management: Frameworks, Tools, and Systems. David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell. Sandia National Laboratories. Lilith: a tool framework for very large clusters.

eldora
Download Presentation

Scalable Cluster Management: Frameworks, Tools, and Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Cluster Management:Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell Sandia National Laboratories

  2. Lilith: a tool framework for very large clusters • Most current tools for clusters are designed as monolithic programs, to do one task well. • If you need a new task, you need a new tool. • The Lilith framework allows users to easily construct new tools using a component framework.

  3. 10sec 16min Control of large distributed systems • System administration • Auditing & job control by users • Interrogation of processes • Simple Applications 1 sec program on 1000 nodes

  4. Lilith spans a tree of machines executing user-defined code. User code (Lilim/Lilly) provides component functionality on a single node Provides scalable distribution, result collection Lilith: Scalable component framework

  5. Component Methods • MO[] distributeOnTree(MO, int[]) • data distribution down the tree • MO onTree(MO) • component action on the node • MO collateOnTree(MO[]) • result collection and condensation

  6. LilithHost Keys Policy Security Uses purely Java 2 mechanisms at this time…. User sends credential with call LilithHost creates ProtectionDomain from user credential LilithHost calls checkPermission Sandbox setup similarly using the User credential and PolicyFile Method invocation

  7. System monitoring tool to track the state of a cluster of machines PS-tool to get sortable process information from selected nodes of the cluster. Prototypical tools

  8. Lilith Lights tool • Snake toy app • demo that draws a snake over front panel • no global repository for state --- all info distributed • Snake’s movement was limited to left half of machine • program error in declaration of drand48() biased results

  9. Who serves who? • Programmers adapt to: • The OS that runs on the machine, • The system configuration chosen by the admins • Changing system environments • economically driven to heterogeneous distributed computing • Why can’t the user dictate the software environment as a resource request?

  10. DASE • Dynamically Adaptive Software Environment • Provide multi-OS/multi-environment capability • Manage multiple SW environments • “save” user environment for reuse later • Integration with SW component architectures

  11. DASE Service Object Model Logicalpartitioning Physical system “system”model Resource Space Scheduler Mesher Mapping App Object- resource spec - data/map objects Partitioner Resource Request Visualizer Solver App Space

  12. Flexible Resource Management

  13. 8 Myrinet LAN cables Power controller Terminal server compute 16 port Myrinet switch Power controller Terminal server compute 16 port Myrinet switch compute compute compute compute To system support network compute compute compute compute compute compute compute compute service sss0 service 100BaseT hub 100BaseT hub Myrinet power serial Ethernet Scalable Unit

  14. Admin access Master copy of system software sss1 sss0 sss0 sss0 In-use copy of system software In-use copy of system software In-use copy of system software node node node node node node NFS mount root from SSS0 NFS mount root from SSS0 NFS mount root from SSS0 node node node Scalable Unit Scalable Unit Scalable Unit node node node System Support Hierarchy

  15. Hardware Management • Discovery and Control • Perl scripts that • control individual devices (power controller, terminal server, machine, switch) • build a database of configuration info (MAC and IP addresses, serial numbers, etc.) • Roles • database is augmented with each components role in the system (compute, sss0, terminal server, etc.)

  16. “Virtual Machines” • Allows arbitrary grouping of scalable units that use the same system software • Operations to update system software and boot nodes, scalable units, or machines • Updates system software on an SU in 1 min. • Update system software on 24 SUs in 1.5 min. • Boot an SU in 5 min. (staged for power drain) • Boot 24 SUs in 10 min.

  17. Production SU configuration database Uses rdist to push system software down Alpha sss1 Beta Linux 2.3 sss0 sss0 sss0 In-use copy of system software In-use copy of system software In-use copy of system software node node node node node node NFS mount root from SSS0 NFS mount root from SSS0 NFS mount root from SSS0 node node node Scalable Unit Scalable Unit Scalable Unit node node node “Virtual Machines”

  18. http://dancer.ca.sandia.govhttp://www.cplant.ca.sandia.govhttp://www.cs.sandia.gov/cplanthttp://dancer.ca.sandia.govhttp://www.cplant.ca.sandia.govhttp://www.cs.sandia.gov/cplant

More Related