1 / 41

Next Generation Job Management System for Extreme Scales

Next Generation Job Management System for Extreme Scales. Speaker: Ke Wang Advisor: Ioan Raicu Datasys Lab, Illinois Institute of Technology November 25 th , 2013 at Datasys Mini-Workshop, Fall 2013. Contents.

ken
Download Presentation

Next Generation Job Management System for Extreme Scales

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next Generation Job Management System for Extreme Scales Speaker: Ke Wang Advisor: IoanRaicu DatasysLab, Illinois Institute of Technology November 25th, 2013 at Datasys Mini-Workshop, Fall 2013

  2. Contents • SLURM++: A distributed job launch prototype for extreme-scale ensemble computing (IPDPS14 submission) • MATRIX:A distributed Many-Task Computing execution fabric designed for exascale (CCGRID14 submission) Next Generation Job Management System for Extreme Scales

  3. Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales

  4. Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales

  5. Exascale Computing • Today (June, 2013): 34 Petaflop • O(100K) nodes • O(1M) cores • Near future (~2020): Exaflop Computing • ~1M nodes • ~1B processor-cores/threads Next Generation Job Management System for Extreme Scales

  6. Job Management Systems for Exascale Computing • Ensemble Computing • Over-decomposition • Many-Task Computing • Jobs/Tasks are finer-grained • Requirements • high availability • extreme high throughput (1M tasks/sec) • low Latency Next Generation Job Management System for Extreme Scales

  7. Current Job Management Systems • Batch scheduled HPC workloads • Lack the support of ensemble workloads • Centralized Design • Poor Scalability • Single-point-of-failure • SLURM maximum throughput of 500 jobs/sec • Decentralized design is demanded Next Generation Job Management System for Extreme Scales

  8. Goal • Architect, and design job management systems for exascale ensemble computing • Identifies the challenges and solutions towards supporting job management systems at extreme scales • Evaluate and compare different design choices at large scale Next Generation Job Management System for Extreme Scales

  9. Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales

  10. Contributions • Proposed a distributed architecture for job management systems, and identified the challenges and solutions towards supporting job management system at extreme-scales • Designed and developed a novel distributed resource stealing algorithm for efficient HPC job launch • Designed and implemented a distributed job launch prototype SLURM++ for extreme scales by leveraging SLURM and ZHT • Evaluated SLURM and SLRUM++ up to 500-nodes with various micro-benchmarks of different job sizes with excellent results up to 10X higher throughput Next Generation Job Management System for Extreme Scales

  11. Architecture Controllers are fully connected Ratio and Partition Size are configurable for HPC and MTC Data servers are also fully connected Next Generation Job Management System for Extreme Scales

  12. Job and Resource Metadata Next Generation Job Management System for Extreme Scales

  13. Resource Stealing • The procedure of stealing resources from other partitions • Why need to steal resources? • When to steal resources? • Where and how to steal resources? • What if no resources to be stolen? Next Generation Job Management System for Extreme Scales

  14. Resource Contention • When different controllers try to allocate the same resources • Naive way to solve the problem is to add a global lock for each queried key in the DKVS • Atomic compare and swap operation in the DKVS that can tell the controllers whether the resource allocation succeeds Next Generation Job Management System for Extreme Scales

  15. Blocking State Change Callback • A controller needs to wait on specific state change before moving on • Inefficient when client keeps polling from the server • The server has a blocking state change callback operation Next Generation Job Management System for Extreme Scales

  16. SLURM++ Design and Implementation • SLURM description • Light-weight controller as ZHT client • Job launching as a separate thread • Implement the resource stealing algorithm • Developed in C • 3K lines of code + SLURM 50K lines of code + ZHT 8K lines of code Next Generation Job Management System for Extreme Scales

  17. Evaluation • LANL Kodiak machine up to 500 nodes • Each node has two 64-bit AMD Opteron processors at 2.6GHz and 8GB memory • SLURM version 2.5.3 • Partition Size of 50 • Metrics: • Efficiency • number of ZHTmessage Next Generation Job Management System for Extreme Scales

  18. Small-Job Workload Conclusions: SLURM remains almost constant, while SLURM++ has an increasing trend with respect to the scale MTC configuration is more preferable for MTC workload than HPC configuration Next Generation Job Management System for Extreme Scales

  19. Medium-Job Workload Conclusion: SLURM experiences constant, while SLURM++ has linearly increasing trend with respect to scale Medium-job workload introduces a little bit resource stealing overhead Next Generation Job Management System for Extreme Scales

  20. Big-Job Workload Conclusion: SLURM is about to saturate at large scales, while SLURM++ has an increasing trend with respect to scale The more partitions we have, the better chance that a controller can steal resources Next Generation Job Management System for Extreme Scales

  21. Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales

  22. Contributions • Design and implement a distributed MTC task execution fabric (MATRIX) that uses adaptive work stealing technique for distributed load balancing, and employs a DKVS to store task metadata. • Explore the parameter space of work stealing technique as applied to exascale class systems, through a light-weight job scheduling system simulator, SimMatrix, up to millions of nodes, billions of cores, and trillions of tasks. • Evaluate and compare MATRIX with various task schedulers( Falkon, Sparrow, and CloudKon) tested on an IBM Blue Gene/P machine and Amazon Cloud, using both micro-benchmarks and real workload traces. Next Generation Job Management System for Extreme Scales

  23. Work Stealing • Distributed load balancing technique • Idle scheduler steals tasks from over-loaded one • Number of tasks to steal • Number of static neighbors • Number of dynamic neighbors • Poll interval Next Generation Job Management System for Extreme Scales

  24. Work Stealing • Select Neighbors Next Generation Job Management System for Extreme Scales

  25. Work Stealing • Dynamic Poll Interval Next Generation Job Management System for Extreme Scales

  26. MATRIX Architecture Next Generation Job Management System for Extreme Scales

  27. Task Submission • Worst Case • All tasks are submitted to one scheduler • Best Case • All tasks are evenly distributed to all schedulers Next Generation Job Management System for Extreme Scales

  28. Execute Unit Next Generation Job Management System for Extreme Scales

  29. Client Monitoring • Client doesn’t have to be alive • A monitoring program polls the task execution progressperiodically • Record logs about the system state for doing visualization Next Generation Job Management System for Extreme Scales

  30. Evaluation • Focus on the Amazon Cloud environment • The “m1.medium” instance • 1 virtual cpu, 2 compute units, • 3.7GB memory, • 410GB hard disk. • Run MATRIX, Sparrow, and CloudKon up to 64 instances. • The number of executing threads is 2 for all of the systems. Next Generation Job Management System for Extreme Scales

  31. Throughput Comparison Reason: MATRIX has near perfect load balancing when submitting tasks Sparrow needs to send probe messages to push tasks the clients of CloudKon need to push and pull tasks from SQS Next Generation Job Management System for Extreme Scales

  32. Efficiency Comparison Conclusion: SQS, and DynamoDB are designed specifically for Cloud The probing and pushing method in Sparrow has poor scalability for heterogeneous workload The network communication layer of MATRIX needs to be improved significantly Next Generation Job Management System for Extreme Scales

  33. Exploration of Work Stealing Parameters Space • Through SimMatrix up to millions of nodes, billions of cores, and trillions of tasks • Number of Tasks to Steal • Number of Static Neighbors • Number of Dynamic Neighbors Next Generation Job Management System for Extreme Scales

  34. Number of Tasks to Steal Conclusion: steal half of tasks a small number of static neighbors is not sufficient the more tasks (< half) to steal, the better performance Next Generation Job Management System for Extreme Scales

  35. Number of Static Neighbors Conclusion: the optimal number of static neighbors is eighth of all nodes impossible at exascalewith 128K neighbors need dynamic neighbors to reduce the neighbor count Next Generation Job Management System for Extreme Scales

  36. Number of Dynamic Neighbors Conclusion: square root number of dynamic neighbors is optimal reasonable at exascale (1K neighbors) Next Generation Job Management System for Extreme Scales

  37. Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales

  38. Conclusions • Applications for exascale computing are becoming ensemble, and finer-grained • Task schedulers for exascale computing need to be distributed, scalable • SLURM++ should be integrated with MATRIX Next Generation Job Management System for Extreme Scales

  39. Future Work • Re-implement MATRIX • Large Scale • Integration of SLURM++ and MATRIX • Workflow integration • Data-aware Scheduling • Distributed Map-Reduce framework Support Next Generation Job Management System for Extreme Scales

  40. Acknowledgements • Xiaobing Zhou • Hao Chen • KiranRamamurthy • ImanSadooghi • Michael Lang • IoanRaicu Next Generation Job Management System for Extreme Scales

  41. More Information • More information: • http://datasys.cs.iit.edu/~kewang/ • Contact: • kwang22@hawk.iit.edu • Questions? Next Generation Job Management System for Extreme Scales

More Related