slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Next Generation Job Management System for Extreme Scales PowerPoint Presentation
Download Presentation
Next Generation Job Management System for Extreme Scales

Loading in 2 Seconds...

play fullscreen
1 / 41

Next Generation Job Management System for Extreme Scales - PowerPoint PPT Presentation


  • 143 Views
  • Uploaded on

Next Generation Job Management System for Extreme Scales. Speaker: Ke Wang Advisor: Ioan Raicu Datasys Lab, Illinois Institute of Technology November 25 th , 2013 at Datasys Mini-Workshop, Fall 2013. Contents.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Next Generation Job Management System for Extreme Scales


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Next Generation Job Management System for Extreme Scales Speaker: Ke Wang Advisor: IoanRaicu DatasysLab, Illinois Institute of Technology November 25th, 2013 at Datasys Mini-Workshop, Fall 2013

    2. Contents • SLURM++: A distributed job launch prototype for extreme-scale ensemble computing (IPDPS14 submission) • MATRIX:A distributed Many-Task Computing execution fabric designed for exascale (CCGRID14 submission) Next Generation Job Management System for Extreme Scales

    3. Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales

    4. Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales

    5. Exascale Computing • Today (June, 2013): 34 Petaflop • O(100K) nodes • O(1M) cores • Near future (~2020): Exaflop Computing • ~1M nodes • ~1B processor-cores/threads Next Generation Job Management System for Extreme Scales

    6. Job Management Systems for Exascale Computing • Ensemble Computing • Over-decomposition • Many-Task Computing • Jobs/Tasks are finer-grained • Requirements • high availability • extreme high throughput (1M tasks/sec) • low Latency Next Generation Job Management System for Extreme Scales

    7. Current Job Management Systems • Batch scheduled HPC workloads • Lack the support of ensemble workloads • Centralized Design • Poor Scalability • Single-point-of-failure • SLURM maximum throughput of 500 jobs/sec • Decentralized design is demanded Next Generation Job Management System for Extreme Scales

    8. Goal • Architect, and design job management systems for exascale ensemble computing • Identifies the challenges and solutions towards supporting job management systems at extreme scales • Evaluate and compare different design choices at large scale Next Generation Job Management System for Extreme Scales

    9. Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales

    10. Contributions • Proposed a distributed architecture for job management systems, and identified the challenges and solutions towards supporting job management system at extreme-scales • Designed and developed a novel distributed resource stealing algorithm for efficient HPC job launch • Designed and implemented a distributed job launch prototype SLURM++ for extreme scales by leveraging SLURM and ZHT • Evaluated SLURM and SLRUM++ up to 500-nodes with various micro-benchmarks of different job sizes with excellent results up to 10X higher throughput Next Generation Job Management System for Extreme Scales

    11. Architecture Controllers are fully connected Ratio and Partition Size are configurable for HPC and MTC Data servers are also fully connected Next Generation Job Management System for Extreme Scales

    12. Job and Resource Metadata Next Generation Job Management System for Extreme Scales

    13. Resource Stealing • The procedure of stealing resources from other partitions • Why need to steal resources? • When to steal resources? • Where and how to steal resources? • What if no resources to be stolen? Next Generation Job Management System for Extreme Scales

    14. Resource Contention • When different controllers try to allocate the same resources • Naive way to solve the problem is to add a global lock for each queried key in the DKVS • Atomic compare and swap operation in the DKVS that can tell the controllers whether the resource allocation succeeds Next Generation Job Management System for Extreme Scales

    15. Blocking State Change Callback • A controller needs to wait on specific state change before moving on • Inefficient when client keeps polling from the server • The server has a blocking state change callback operation Next Generation Job Management System for Extreme Scales

    16. SLURM++ Design and Implementation • SLURM description • Light-weight controller as ZHT client • Job launching as a separate thread • Implement the resource stealing algorithm • Developed in C • 3K lines of code + SLURM 50K lines of code + ZHT 8K lines of code Next Generation Job Management System for Extreme Scales

    17. Evaluation • LANL Kodiak machine up to 500 nodes • Each node has two 64-bit AMD Opteron processors at 2.6GHz and 8GB memory • SLURM version 2.5.3 • Partition Size of 50 • Metrics: • Efficiency • number of ZHTmessage Next Generation Job Management System for Extreme Scales

    18. Small-Job Workload Conclusions: SLURM remains almost constant, while SLURM++ has an increasing trend with respect to the scale MTC configuration is more preferable for MTC workload than HPC configuration Next Generation Job Management System for Extreme Scales

    19. Medium-Job Workload Conclusion: SLURM experiences constant, while SLURM++ has linearly increasing trend with respect to scale Medium-job workload introduces a little bit resource stealing overhead Next Generation Job Management System for Extreme Scales

    20. Big-Job Workload Conclusion: SLURM is about to saturate at large scales, while SLURM++ has an increasing trend with respect to scale The more partitions we have, the better chance that a controller can steal resources Next Generation Job Management System for Extreme Scales

    21. Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales

    22. Contributions • Design and implement a distributed MTC task execution fabric (MATRIX) that uses adaptive work stealing technique for distributed load balancing, and employs a DKVS to store task metadata. • Explore the parameter space of work stealing technique as applied to exascale class systems, through a light-weight job scheduling system simulator, SimMatrix, up to millions of nodes, billions of cores, and trillions of tasks. • Evaluate and compare MATRIX with various task schedulers( Falkon, Sparrow, and CloudKon) tested on an IBM Blue Gene/P machine and Amazon Cloud, using both micro-benchmarks and real workload traces. Next Generation Job Management System for Extreme Scales

    23. Work Stealing • Distributed load balancing technique • Idle scheduler steals tasks from over-loaded one • Number of tasks to steal • Number of static neighbors • Number of dynamic neighbors • Poll interval Next Generation Job Management System for Extreme Scales

    24. Work Stealing • Select Neighbors Next Generation Job Management System for Extreme Scales

    25. Work Stealing • Dynamic Poll Interval Next Generation Job Management System for Extreme Scales

    26. MATRIX Architecture Next Generation Job Management System for Extreme Scales

    27. Task Submission • Worst Case • All tasks are submitted to one scheduler • Best Case • All tasks are evenly distributed to all schedulers Next Generation Job Management System for Extreme Scales

    28. Execute Unit Next Generation Job Management System for Extreme Scales

    29. Client Monitoring • Client doesn’t have to be alive • A monitoring program polls the task execution progressperiodically • Record logs about the system state for doing visualization Next Generation Job Management System for Extreme Scales

    30. Evaluation • Focus on the Amazon Cloud environment • The “m1.medium” instance • 1 virtual cpu, 2 compute units, • 3.7GB memory, • 410GB hard disk. • Run MATRIX, Sparrow, and CloudKon up to 64 instances. • The number of executing threads is 2 for all of the systems. Next Generation Job Management System for Extreme Scales

    31. Throughput Comparison Reason: MATRIX has near perfect load balancing when submitting tasks Sparrow needs to send probe messages to push tasks the clients of CloudKon need to push and pull tasks from SQS Next Generation Job Management System for Extreme Scales

    32. Efficiency Comparison Conclusion: SQS, and DynamoDB are designed specifically for Cloud The probing and pushing method in Sparrow has poor scalability for heterogeneous workload The network communication layer of MATRIX needs to be improved significantly Next Generation Job Management System for Extreme Scales

    33. Exploration of Work Stealing Parameters Space • Through SimMatrix up to millions of nodes, billions of cores, and trillions of tasks • Number of Tasks to Steal • Number of Static Neighbors • Number of Dynamic Neighbors Next Generation Job Management System for Extreme Scales

    34. Number of Tasks to Steal Conclusion: steal half of tasks a small number of static neighbors is not sufficient the more tasks (< half) to steal, the better performance Next Generation Job Management System for Extreme Scales

    35. Number of Static Neighbors Conclusion: the optimal number of static neighbors is eighth of all nodes impossible at exascalewith 128K neighbors need dynamic neighbors to reduce the neighbor count Next Generation Job Management System for Extreme Scales

    36. Number of Dynamic Neighbors Conclusion: square root number of dynamic neighbors is optimal reasonable at exascale (1K neighbors) Next Generation Job Management System for Extreme Scales

    37. Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales

    38. Conclusions • Applications for exascale computing are becoming ensemble, and finer-grained • Task schedulers for exascale computing need to be distributed, scalable • SLURM++ should be integrated with MATRIX Next Generation Job Management System for Extreme Scales

    39. Future Work • Re-implement MATRIX • Large Scale • Integration of SLURM++ and MATRIX • Workflow integration • Data-aware Scheduling • Distributed Map-Reduce framework Support Next Generation Job Management System for Extreme Scales

    40. Acknowledgements • Xiaobing Zhou • Hao Chen • KiranRamamurthy • ImanSadooghi • Michael Lang • IoanRaicu Next Generation Job Management System for Extreme Scales

    41. More Information • More information: • http://datasys.cs.iit.edu/~kewang/ • Contact: • kwang22@hawk.iit.edu • Questions? Next Generation Job Management System for Extreme Scales