Download
scheduling and resource management for next generation clusters n.
Skip this Video
Loading SlideShow in 5 Seconds..
Scheduling and Resource Management for Next-generation Clusters PowerPoint Presentation
Download Presentation
Scheduling and Resource Management for Next-generation Clusters

Scheduling and Resource Management for Next-generation Clusters

121 Views Download Presentation
Download Presentation

Scheduling and Resource Management for Next-generation Clusters

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Scheduling and Resource Management for Next-generation Clusters Yanyong Zhang Penn State University www.cse.psu.edu/~yyzhang

  2. What is a Cluster? • Cost effective • Easily scalable • Highly available • Readily upgradeable

  3. Scientific & Engineering Applications • HPTi win 5 year $15M procurement to provide systems for weather modeling (NOAA). (http://www.noaanews.noaa.gov/stories/s419.htm) • Sandia's expansion of their Alpha-based C-plant system. • Maui HPCC LosLobos Linux Super-cluster (http://www.dl.ac.uk/CFS/benchmarks/beowulf/tsld007.htm) • A performance-price ratio of … is demonstrated in simulations of wind instruments using a cluster of 20 …. (http://www.swiss.ai.mit.edu/~pas/p/sc95.html) • The PC cluster based parallel simulation environment and the technologies … will have a positive impact on networking research nationwide …. (http://www.osc.edu/press/releases/2001/approved.shtml)

  4. Commercial Applications • Business applications • Transaction Processing (IBM DB2, oracle …) • Decision Support System (IBM DB2, oracle …) • Internet applications • Web serving / searching (Google.Com …) • Infowares (yahoo.Com, AOL.Com) • Email, eChat, ePhone, eBook,eBank, eSociety, eAnything • Computing portal

  5. Resource Management • Each application is demanding • Several applications/users can be present at the same time Resource management and Quality-of-service become important.

  6. P0 P1 P2 P3 P4 Arrival Q High Speed Network System Model 4 4 3 • Each node is • independent • Maximum MPL • Arrival queue

  7. Two Phases in Resource Management • Allocation Issues • Admission Control • Arrival Queue Principle • Scheduling Issues (CPU Scheduling) • Resource Isolation • Co-allocation

  8. RECV Scheduling skewness switch SEND Co-allocation / Co-scheduling P1 P0 P0 t0 t1 TIME

  9. Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Contribution 4: optimizing clustered DB2 NEXT

  10. Contribution 1:Boosting CPU Utilization at Supercomputing Centers

  11. Response Time slowdown = Execute Time in Isolation Objective Response Time Wait Time Execute Time Wait in the arrival Q Wait in the ready/blocked Q minimize

  12. 5 3 2 6 2 2 Existing Techniques • Back Filling (BF) • Gang Scheduling (GS) • Migration (M) time 2 8 8 3 2 6 2 space # of CPUs = 14

  13. Proposed Scheme • MBGS = GS + BF + M • Use GS as the basic framework • At each row of GS matrix, apply BF technique • Whenever GS matrix is re-calculated, M should be considered.

  14. How Does MBGS Perform?

  15. Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Contribution 4: optimizing clustered DB2 NEXT

  16. Contribution 2:Reducing Response Times for Commercial Applications

  17. Objective Response Time Wait Time Execute Time Wait in the arrival Q Wait in the ready/blocked Q • Minimize wait time • Minimize response time

  18. wasted Previous Work I:Gang Scheduling (GS) (1) MINUTES ! (2) GS is not responsive enough !

  19. Previous Work II:Dynamic Co-scheduling P1 P2 P3 P0 B D A C It’s A’s turn C just finishes I/O B just gets a msg Everybody else is blocked The scheduler on each node makes independent decision based on local events without global synchronizations.

  20. How do you wait for a message? Busy Wait Spin Block Spin Yield No Explicit Reschedule Local SB SY What do you do on message arrival? Interrupt & Reschedule DCS DCS-SB DCS-SY Periodically Reschedule PB PB-SB PB-SY Dynamic Co-scheduling Heuristics

  21. Simulation Study • A detailed simulator at a microsecond granularity • System parameters • System configurations (maximum MPL, to partition or not) • System overheads (context switch overheads, interrupt costs, costs associated with manipulating queues)

  22. Simulation Study (Cont’d) • Application parameters • Injection load • Characteristics (CPU intensive, IO intensive, communication intensive or somewhere in the middle)

  23. Impact of Load

  24. Impact of Workload Characteristics Comm intensive I/O intensive

  25. Periodic Boost Heuristics • S1: Compute Phase • S2: S1 + Unconsumed Msg. • S3: Recv. + Msg. Arrived • S4: Recv. + No Msg. • A: S3-> {S2,S1} • B: S3->S2->S1 • C: {S3,S2,S1} • D: {S3,S2}->S1 • E: S2->S3->S1

  26. P0 P1 P2 P3 Pp … … High Speed Network Analytical Modeling Study • The state space is impossible to handle. Dynamic arrival

  27. _ _ _ _ _ jkB jA1, …, mA,   ik, jkR, jk, iX  i, jA, j1B,…,jPBi+, ,…, B  , ik1,…,iM, n number of nodes  l l=1 jk,l1,…,N, jk  1,…,mQ+mO, k1,…,P, N  jk,1 B Q B jkR(l) 1,…,iM, M ik _ _ iY jA, jiM,jQ B jA1, …, mA,   i, jR,j1B ,…, i+, jR(l) 1,…,iM,  jQ 1,…,mQ+mO jkB1,…,N, Reduced State Space (much more tractable !! ) Analysis Description Number of jobs on node k Original State Space (impossible to handle!!) Assumption: The state of each processor is stochastically independent and identical to the state of the other processors. 

  28. Analysis Description (Cont)  Address the state transition rates using Continuous Markov model; Build the Generator Matrix Q  Get the invariant probability vector by solving Q = 0,and e = 1.  Use fixed-point iteration to get the solution

  29. 1 IO 2 IO 1 C 2 C … … … … 1 C 2 C 1 B 2 IO 2 C 1 IO 2 C 1 IO 1 1 1 1 r1 1xP1 r1’ … 1 SP 2 C 1 2 C 1 B 1 C 2 C 1 C 2 C 1 SN 2 C 1x(1-P1) Q Q … Q 2 C 1 C Q Q … … 2 C 1 B 2 C 1 SN … 2 C 1 SP 1 C 2 * r1 = P( )x 1 IO 2 * 2 * 1 IO +{P( )+P( )}x 1 1 SN 2 * +P( )x 1 1 1/1+1/1+1/1 1/1+1/1 SB Example r2 = …

  30. Results Optimal PB Frequency Optimal Spin Time for SB

  31. Results – Optimal Quantum Length CPU Intensive Comm Intensive I/O Intensive

  32. Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Contribution 4: optimizing clustered DB2 NEXT

  33. Contribution 3:Scheduling Multiple Classes of Applications interactive real time batch

  34. Objective BE RT How long did it take me to finish?? Response time How many deadlines have been missed? Miss rate cluster

  35. Fairness Ratio (x:y) Cluster Resource x x+y y x+y

  36. P0 P1 P0 P1 P0 P1 RT RT1 time time time RT2 BE BE 2DCS-TDM 2DCS-PS 1GS x:y = 2:1 How to Adhere to Fairness Ratio?

  37. BE responsetime RT : BE = 2:1 RT : BE = 1:9 RT : BE = 9:1

  38. RT Deadline Miss Rate RT : BE = 1:9 RT : BE = 2:1 RT : BE = 9:1

  39. Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Characterizing decision support workloads on the clustered database server • Resource management for transaction processing workloads on the clustered database server NEXT

  40. Experiment Setup • IBM DB2 Universal Database for Linux, EEE, Version 7.2 • 8 dual node Linux/Pentium cluster, that has 256 MB RAM and 18 GB disk on each node. • TPC-H workload. Queries are run sequentially (Q1 – Q20). Completion time for each query is measured.

  41. A 001 4 4 4 2 B 002 2 2 4 2 C 003 1 D 004 5 3 3 3 3 3 Server Myrinet A 001 B 002 C 003 D 004 coordinator node A 001 B 002 Table T C 003 D 004 Platform Select * from T Client

  42. Methodology • Identify the components with high system overhead. • For each such component, characterize the request distribution. • Come up with ways of optimization. • Quantify potential benefits from the optimization.

  43. Sampling OS Statistics • Sample the statistics provided by stat, net/dev, process/stat. • User/system CPU % • # of pages faults • # of blocks read/written • # of reads/writes • # of packets sent/received • CPU utilization during I/O

  44. Kernel Instrumentation • Instrument each system call in the kernel. Enter system call Exit system call unblock block resume execution

  45. Operating System Profile • Considerable part of the execution time is taken by pread system call. • There is good overlap of computation with I/O for some queries. • More reads than writes.

  46. TPC-H pread Overhead pread overhead = # of preads X overhead per pread.

  47. page table user space 2 page cache 1 pread Optimization pread(dest, chunk) { for each page in the chunk { if the page is not in cache { bring it in from disk } copy the page into dest } } • Optimization: • Re-mapping the • buffer • Copy on write 30s

  48. user space read only page cache Copy-on-write # of copy-on-write % reduction = 1 - # of preads

  49. Operating System Profile • Socket calls are the next dominant system calls.

  50. Message Characteristics Q11 Q16 Message Size (bytes) Message Inter-injection Time (Millisecond) Message Destination