1 / 49

Meta scheduler with AppLeS Local Schedulers

Meta scheduler with AppLeS Local Schedulers. Sathish Vadhiyar. Goals. The aim was to overcome deficiencies of using plain AppLeS agents Also to have global policies Resolving different claims of applications Improving the response times of individual apps. Taking care of load dynamics

jalen
Download Presentation

Meta scheduler with AppLeS Local Schedulers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Meta scheduler with AppLeS Local Schedulers Sathish Vadhiyar

  2. Goals • The aim was to overcome deficiencies of using plain AppLeS agents • Also to have global policies • Resolving different claims of applications • Improving the response times of individual apps. • Taking care of load dynamics • Work done as part of GrADS project • Grid Application Development Software • Collaboration between different universities

  3. Initial GrADS ArchitectureScaLAPACK LU factorization demo Resource Selector User Grid Routine / Application Manager Matrix size, block size Final schedule – subset of resources MDS Resource characteristics, Problem characteristics NWS Performance Modeler

  4. Performance Modeler Grid Routine / Application Manager The scheduling heuristic passed only those candidate schedules that had “sufficient” memory This is determined by calling a function in simulation model Final schedule – subset of resources All resources, Problem parameters Scheduling Heuristic Final Schedule Performance Modeler All resources, problem parameters Candidate resources Execution cost Simulation Model

  5. Simulation Model • Simulation of the ScaLAPACK right looking LU factorization • More about the application • Iterative – each iteration corresponding to a block • Parallel application in which columns are block-cyclic distributed • Right looking LU – based on Gaussian elimination

  6. Gaussian Elimination - Review for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j What GE really computes: for i=1 to n-1 A(i+1:n, i) = A(i+1:n, i) / A(i, i) A(i+1:n, i+1:n) = A(i+1:n, i+1:n) - A(i+1:n, i)*A(i, i+1:n) i Finished multipliers i A(i,i) A(i,k) A(i, i+1:n) Finished multipliers A(j,i) A(j,k) BLAS1 and BLAS2 operations A(i+1:n, i) A(i+1:n, i+1:n)

  7. Need for blocking - BLAS • Basic Linear Algebra Subroutine • Memory hierarchy efficiently exploited by higher level BLAS • 3 levels

  8. Converting BLAS2 to BLAS3 • Use blocking for optimized matrix-multiplies (BLAS3) • Matrix multiplies by delayed updates • Save several updates to trailing matrices • Apply several updates in the form of matrix multiply

  9. Modified GE using BLAS3Courtesy: Dr. Jack Dongarra for ib = 1 to n-1 step b /* process matrix b columns at a time */ end = ib+b-1; /* Apply BLAS 2 version of GE to get A(ib:n, ib:end) factored. Let LL denote the strictly lower triangular portion of A(ib:end, ib:end) */ A(ib:end, end+1:n) = LL-1A(ib:end, end+1:n) /* update next b rows of U */ A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(ib:n, ib:end)* A(ib:end, end+1:n) /* Apply delayed updates with single matrix multiply */ b ib end Completed part of U A(ib:end, ib:end) ib Completed part of L b end A(end+1:n, end+1:n) A(end+1:n, ib:end)

  10. Operations • So, the LU application in each iteration involves: • Block factorization – (ib:n, ib:ib) floating point operations • Broadcast for multiply – message size equals approximately n*block_size • Each process does its own multiply: • Remaining columns divided by number of processors

  11. Back to the simulation model double getExecTimeCost(int matrix_size, int block_size, candidate_schedule){ for(i=0; i<number_of_blocks; i++){ /* find the proc. Belonging to the column. Note its speed, its connections to other procs. */ tfact += … /* simulate block factorization. Depends on {processor_speed, machine_load, flop_count of factorization */ tbcast += max(bcast times for each proc.) /* scalapack follows split ring broadcast. Simulate broadcast algorithm for each proc. Depends on {elements of matrix to be broadcast, connection bandwidth and latency */ tupdate += max(matrix multiplies across all proc.) /* depends on {flop count of matrix multiply, processor speed, load} */ } return (tfact + tbcast + tupdate); }

  12. Initial GrADS ArchitectureScaLAPACK LU factorization demo Resource Selector User Grid Routine / Application Manager Matrix size, block size Problem, parameters, app. Location, final schedule MDS Is this final schedule OK? Resource characteristics, Problem characteristics NWS Contract Development App Launcher Performance Modeler Contract Monitor Application

  13. Contract Monitor Architecture • Uses Autopilot (UIUC) infrastructure • The application is instrumented with calls to Autopilot • For registering to the autopilot manager • For forking off sensor threads • For passing variable information to sensors • A sensor client (in this case, contract monitor) can connect to the sensors and gather variable information

  14. Contract Monitor Architecture Autopilot Manager Obtain sensor information registration Application Contract Monitor Fork Obtain information about variable x Sensors

  15. GrADS Project GrADS Testbed

  16. Performance Model Evaluation

  17. GrADS Benefits Examples 8 mscs, 8 torcs 8 mscs, 8 torcs 8 mscs, 7 torcs 8 8 8 MSC & TORC Cluster 7 7 5 MSC Cluster

  18. GrADS Limitations Needed: A Metascheduler that has global knowledge of all applications !

  19. Metascheduler • To ensure that applications are scheduled based on correct resource information • To accommodate as many new applications as possible • To improve the performance contract of new applications • To minimize the impact of new applications on executing applications • To employ policies to migrate executing applications

  20. Metascheduler Components Applications Requests from applications for permission to execute on the Grid Storing and retrieval of the states of the applications Application level schedules from the applications Permission Service Contract Negotiator Database Manager • Decisions based on resource capacities • Can stop an executing resource consuming application • Can accept or reject contracts • Acts as queue manager • Ensures scheduling based on correct information • Improves performance contracts • Minimizes impact Reschedules executing applications: - To escape from heavy load - To use free resources Rescheduler Metascheduler Request for Migration

  21. Modified GrADS Architecture Resource Selector User MDS Grid Routine / Application Manager NWS Permission Service App Launcher Contract Developer Database Manager Performance Modeler RSS Application Contract Monitor Contract Negotiator Rescheduler

  22. Database Manager • A persistent service listening for requests from the clients • Maintains global clock • Has event notification capabilities – clients can express their interests in various events. • Stores various information: • Application’s states • Initial machines • Resource information • Final schedule • Location of various daemons • Average number of contract violations • Actual performance of application • Various times

  23. Database Manager (Contd…) When an application stops or completes, the database manager calculates percentage completion time of the application time_diff : (current_time – time when the application instance started) Avg_ratio: average of (actual costs / predicted costs)

  24. Permission Service • After collecting resource information from NWS, the GrADS apps. contact PS. • PS makes decisions based on the problem requirements and resource characteristics makes decisions • If resources have enough capacity, then permission is given • If not, the permission service • Waits for resource consuming applications to end soon • Preempts resource consuming applications to accommodate short applications

  25. Permission Service (Pseudo code)

  26. Permission Service (Pseudo code)

  27. Permission Service – determining resource consuming applications • For each currently executing GrADS app., contact DBM, obtain NWS resource information. • Determine change of resources caused by app. i • Add the change to current resource characteristics to obtain resource parameters in the absence of app. i

  28. Determining remaining execution time • Whenever a meta scheduler component wants to determine remaining execution time of app., it contacts the contract monitor of app. • Retrieves average of ratios between actual times and predicted times • Uses {average, predicted time, percentage completion time} to determine r.e.t.

  29. Determing r.e.t (pseudo code)

  30. Determining r.e.t (pseudo code)

  31. Contract Negotiator • Main functionalities • Ensure apps. made decisions based on updated resource information • Improve the performance of current apps. by possibly stopping and continuing executing big apps. • Reduces the impact caused by current apps. on executing apps. • When contract is approved, the application starts using resources • When contract is rejected, the application goes back to obtain new resource characteristics, and generates new schedule • Enforces ordering of the applications whose application-level characteristics use the same resources • approves the contract of one applications • Waits for the application to start using resources • Rejects the contract of the other

  32. Contract Negotiator (Pseudo code) Ensuring app. has made scheduling decision based on correct resource information

  33. Contract Negotiator (Pseudo code) Improving the performance of the current app. by preempting an executing large app.

  34. Contract Negotiator – 3 scenarios t1 – average completion time of current app. and big app. when big app. is preempted, current app. accommodated, big app. continued t2 - average completion time of current app. and big app. when big app. is allowed to complete, then current app. is accommodated t3 - average completion time of current app. and big app. when both applications are executed simultaneously if (t1 < 25% of min(t2, t3) case 1 else if(t3 > 1.2t2) case 2 else case 3

  35. Contract Negotiator (Pseudo code) Improving the performance of the current app. by preempting an executing large app.

  36. Contract Negotiator (Pseudo code) Reducing the impact of the current app. on executing app. by modifying the schedule

  37. Contract Negotiator (Pseudo code) Reducing the impact of the current app. on executing app. by modifying the schedule

  38. Application and Metascheduler Interactions User Problem parameters Resource Selection Initial list of machines Permission Service Requesting Permission Permission Get new resource information NO Permission? Abort YES Application Specific Scheduling Application specific schedule Contract Negotiator Contract Development Get new resource information Contract Approved? NO YES Problem parameters, final schedule Application Launching Get new resource information Application Completion? Application Completed Wait for restart signal Exit Application was stopped

  39. Experiments and ResultsDemonstration of Permission Service • Application 1 – LU, Matrix size 13000, uses 4 opus, 1torc and 2 cypher machines. • Application 2 – LU, Varying matrix sizes, uses 4 opus machines. • Permission service stopped application 1 to accommodate application 2.

  40. Experiments and ResultsPractical Experiments • 5 applications were integrated into GrADS – ScaLAPACK LU, QR, Eigen, PETSC CG and Heat equation • Integration involved – developing performance models, instrumenting with SRS • 50 problems with different arrival rates - Poisson distribution with different mean arrival rates for job submission - uniform distributions for problem types, problem sizes • Different statistics were collected • Metascheduler was enabled or disabled

  41. Experiments and ResultsPractical Experiments – Total Throughput Comparison

  42. Experiments and ResultsPractical Experiments – Performance Contract Violations Measured Time/Expected Time Maximum allowed Measured Time/Expected Time Contract Violation: Measured/Expected > maximum allowed Measured/Expected

  43. Metascheduler BehaviorNumber of metascheduling decisions

  44. References • Vadhiyar, S., Dongarra, J. and Yarkhan, A. “GrADSolve - RPC for High Performance Computing on the Grid". Euro-Par 2003, 9th International Euro-Par Conference, Proceedings, Springer, LCNS 2790, p. 394-403, August 26 -29, 2003. • Vadhiyar, S. and Dongarra, J. “Metascheduler for the Grid”. Proceedings of the11th IEEE International Symposium on High Performance Distributed Computing, pp 343-351, July 2002, Edinburgh, Scotland. • Vadhiyar, S. and Dongarra, J. “GrADSolve - A Grid-based RPC system for Parallel Computing with Application-level Scheduling". Journal of Parallel and Distributed Computing, Volume 64, pp. 774-783, 2004. • Petitet, A., Blackford, S., Dongarra, J., Ellis, B., Fagg, G., Roche, K., Vadhiyar, S. "Numerical Libraries and The Grid: The Grads Experiments with ScaLAPACK, " Journal of High Performance Applications and Supercomputing, Vol. 15, number 4 (Winter 2001): 359-374.

  45. GrADSolve • A practical demonstration of the metascheduling system • A usable integrated framework containing the metascheduler and preemptible applications • Based on NetSolve experience • Provides many more capabilities • Separate domains for administrators, service providers and end users • Powerful data distribution strategies • First RPC system for maintaining and using execution traces

  46. GrADSolve Architecture Service Providers / Library writers add_problem Administrators Add user and machine information Apache XML Xindice Database get_perfmodel_template receive problem specification add_perfmodel End Users download execution model stage out input data, launch application, stage in output data areResourcesSufficient(<problem parameters>, <resource characteristics>){ } getExecutionTimeCost(<problem parameters>, <resource characteristics>){ } mapper(<problem parameters>, <resource characteristics>){ } GrADSolve Resources PROBLEM qrwrapper C FUNCTION qrwrapper(IN int N, INOUT double A[N][N]…) TYPE = parallel CONTINUE_CAPABILITY = yes RECONFIGURATION_CAPABILITY = yes Performance Modeler int main(){ gradsolve(“qrwrapper”, N, NB, A, B); } /* fill up */ Contract Negotiator PostgreSQL Database /* fill up */ Metascheduler /* fill up */ Permission Service Rescheduler

  47. Metascheduler Internals • Applications need to make at least 20% progress between preemptions • An application cannot be preempted if another application waits for its completion

  48. Experiments and ResultsDemonstration of Contract Negotiator • Application 1 – LU, Varying large matrix sizes on N cypher machines • Application 2 – LU, Matrix size 7500 on (N+1) cypher machines, N of which were occupied by application 1 • Contract Negotiator stopped application 1 and made (N+1) machines available to application 2 util_val = Performance gain for application 2 Performance loss for application 1

  49. Metascheduler BehaviorPerformance Contract Violations

More Related