1 / 70

Scheduling and Resource Management for Next-generation Clusters

Scheduling and Resource Management for Next-generation Clusters. Yanyong Zhang Penn State University www.cse.psu.edu/~yyzhang. What is a Cluster?. Cost effective Easily scalable Highly available Readily upgradeable. Scientific & Engineering Applications.

huyen
Download Presentation

Scheduling and Resource Management for Next-generation Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scheduling and Resource Management for Next-generation Clusters Yanyong Zhang Penn State University www.cse.psu.edu/~yyzhang

  2. What is a Cluster? • Cost effective • Easily scalable • Highly available • Readily upgradeable

  3. Scientific & Engineering Applications • HPTi win 5 year $15M procurement to provide systems for weather modeling (NOAA). (http://www.noaanews.noaa.gov/stories/s419.htm) • Sandia's expansion of their Alpha-based C-plant system. • Maui HPCC LosLobos Linux Super-cluster (http://www.dl.ac.uk/CFS/benchmarks/beowulf/tsld007.htm) • A performance-price ratio of … is demonstrated in simulations of wind instruments using a cluster of 20 …. (http://www.swiss.ai.mit.edu/~pas/p/sc95.html) • The PC cluster based parallel simulation environment and the technologies … will have a positive impact on networking research nationwide …. (http://www.osc.edu/press/releases/2001/approved.shtml)

  4. Commercial Applications • Business applications • Transaction Processing (IBM DB2, oracle …) • Decision Support System (IBM DB2, oracle …) • Internet applications • Web serving / searching (Google.Com …) • Infowares (yahoo.Com, AOL.Com) • Email, eChat, ePhone, eBook,eBank, eSociety, eAnything • Computing portal

  5. Resource Management • Each application is demanding • Several applications/users can be present at the same time Resource management and Quality-of-service become important.

  6. P0 P1 P2 P3 P4 Arrival Q High Speed Network System Model 4 4 3 • Each node is • independent • Maximum MPL • Arrival queue

  7. Two Phases in Resource Management • Allocation Issues • Admission Control • Arrival Queue Principle • Scheduling Issues (CPU Scheduling) • Resource Isolation • Co-allocation

  8. RECV Scheduling skewness switch SEND Co-allocation / Co-scheduling P1 P0 P0 t0 t1 TIME

  9. Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Contribution 4: optimizing clustered DB2 NEXT

  10. Contribution 1:Boosting CPU Utilization at Supercomputing Centers

  11. Response Time slowdown = Execute Time in Isolation Objective Response Time Wait Time Execute Time Wait in the arrival Q Wait in the ready/blocked Q minimize

  12. 5 3 2 6 2 2 Existing Techniques • Back Filling (BF) • Gang Scheduling (GS) • Migration (M) time 2 8 8 3 2 6 2 space # of CPUs = 14

  13. Proposed Scheme • MBGS = GS + BF + M • Use GS as the basic framework • At each row of GS matrix, apply BF technique • Whenever GS matrix is re-calculated, M should be considered.

  14. How Does MBGS Perform?

  15. Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Contribution 4: optimizing clustered DB2 NEXT

  16. Contribution 2:Reducing Response Times for Commercial Applications

  17. Objective Response Time Wait Time Execute Time Wait in the arrival Q Wait in the ready/blocked Q • Minimize wait time • Minimize response time

  18. wasted Previous Work I:Gang Scheduling (GS) (1) MINUTES ! (2) GS is not responsive enough !

  19. Previous Work II:Dynamic Co-scheduling P1 P2 P3 P0 B D A C It’s A’s turn C just finishes I/O B just gets a msg Everybody else is blocked The scheduler on each node makes independent decision based on local events without global synchronizations.

  20. How do you wait for a message? Busy Wait Spin Block Spin Yield No Explicit Reschedule Local SB SY What do you do on message arrival? Interrupt & Reschedule DCS DCS-SB DCS-SY Periodically Reschedule PB PB-SB PB-SY Dynamic Co-scheduling Heuristics

  21. Simulation Study • A detailed simulator at a microsecond granularity • System parameters • System configurations (maximum MPL, to partition or not) • System overheads (context switch overheads, interrupt costs, costs associated with manipulating queues)

  22. Simulation Study (Cont’d) • Application parameters • Injection load • Characteristics (CPU intensive, IO intensive, communication intensive or somewhere in the middle)

  23. Impact of Load

  24. Impact of Workload Characteristics Comm intensive I/O intensive

  25. Periodic Boost Heuristics • S1: Compute Phase • S2: S1 + Unconsumed Msg. • S3: Recv. + Msg. Arrived • S4: Recv. + No Msg. • A: S3-> {S2,S1} • B: S3->S2->S1 • C: {S3,S2,S1} • D: {S3,S2}->S1 • E: S2->S3->S1

  26. P0 P1 P2 P3 Pp … … High Speed Network Analytical Modeling Study • The state space is impossible to handle. Dynamic arrival

  27. _ _ _ _ _ jkB jA1, …, mA,   ik, jkR, jk, iX  i, jA, j1B,…,jPBi+, ,…, B  , ik1,…,iM, n number of nodes  l l=1 jk,l1,…,N, jk  1,…,mQ+mO, k1,…,P, N  jk,1 B Q B jkR(l) 1,…,iM, M ik _ _ iY jA, jiM,jQ B jA1, …, mA,   i, jR,j1B ,…, i+, jR(l) 1,…,iM,  jQ 1,…,mQ+mO jkB1,…,N, Reduced State Space (much more tractable !! ) Analysis Description Number of jobs on node k Original State Space (impossible to handle!!) Assumption: The state of each processor is stochastically independent and identical to the state of the other processors. 

  28. Analysis Description (Cont)  Address the state transition rates using Continuous Markov model; Build the Generator Matrix Q  Get the invariant probability vector by solving Q = 0,and e = 1.  Use fixed-point iteration to get the solution

  29. 1 IO 2 IO 1 C 2 C … … … … 1 C 2 C 1 B 2 IO 2 C 1 IO 2 C 1 IO 1 1 1 1 r1 1xP1 r1’ … 1 SP 2 C 1 2 C 1 B 1 C 2 C 1 C 2 C 1 SN 2 C 1x(1-P1) Q Q … Q 2 C 1 C Q Q … … 2 C 1 B 2 C 1 SN … 2 C 1 SP 1 C 2 * r1 = P( )x 1 IO 2 * 2 * 1 IO +{P( )+P( )}x 1 1 SN 2 * +P( )x 1 1 1/1+1/1+1/1 1/1+1/1 SB Example r2 = …

  30. Results Optimal PB Frequency Optimal Spin Time for SB

  31. Results – Optimal Quantum Length CPU Intensive Comm Intensive I/O Intensive

  32. Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Contribution 4: optimizing clustered DB2 NEXT

  33. Contribution 3:Scheduling Multiple Classes of Applications interactive real time batch

  34. Objective BE RT How long did it take me to finish?? Response time How many deadlines have been missed? Miss rate cluster

  35. Fairness Ratio (x:y) Cluster Resource x x+y y x+y

  36. P0 P1 P0 P1 P0 P1 RT RT1 time time time RT2 BE BE 2DCS-TDM 2DCS-PS 1GS x:y = 2:1 How to Adhere to Fairness Ratio?

  37. BE responsetime RT : BE = 2:1 RT : BE = 1:9 RT : BE = 9:1

  38. RT Deadline Miss Rate RT : BE = 1:9 RT : BE = 2:1 RT : BE = 9:1

  39. Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Characterizing decision support workloads on the clustered database server • Resource management for transaction processing workloads on the clustered database server NEXT

  40. Experiment Setup • IBM DB2 Universal Database for Linux, EEE, Version 7.2 • 8 dual node Linux/Pentium cluster, that has 256 MB RAM and 18 GB disk on each node. • TPC-H workload. Queries are run sequentially (Q1 – Q20). Completion time for each query is measured.

  41. A 001 4 4 4 2 B 002 2 2 4 2 C 003 1 D 004 5 3 3 3 3 3 Server Myrinet A 001 B 002 C 003 D 004 coordinator node A 001 B 002 Table T C 003 D 004 Platform Select * from T Client

  42. Methodology • Identify the components with high system overhead. • For each such component, characterize the request distribution. • Come up with ways of optimization. • Quantify potential benefits from the optimization.

  43. Sampling OS Statistics • Sample the statistics provided by stat, net/dev, process/stat. • User/system CPU % • # of pages faults • # of blocks read/written • # of reads/writes • # of packets sent/received • CPU utilization during I/O

  44. Kernel Instrumentation • Instrument each system call in the kernel. Enter system call Exit system call unblock block resume execution

  45. Operating System Profile • Considerable part of the execution time is taken by pread system call. • There is good overlap of computation with I/O for some queries. • More reads than writes.

  46. TPC-H pread Overhead pread overhead = # of preads X overhead per pread.

  47. page table user space 2 page cache 1 pread Optimization pread(dest, chunk) { for each page in the chunk { if the page is not in cache { bring it in from disk } copy the page into dest } } • Optimization: • Re-mapping the • buffer • Copy on write 30s

  48. user space read only page cache Copy-on-write # of copy-on-write % reduction = 1 - # of preads

  49. Operating System Profile • Socket calls are the next dominant system calls.

  50. Message Characteristics Q11 Q16 Message Size (bytes) Message Inter-injection Time (Millisecond) Message Destination

More Related