Meta scheduler with AppLeS Local Schedulers

Meta scheduler with AppLeS Local Schedulers Sathish Vadhiyar

Goals • The aim was to overcome deficiencies of using plain AppLeS agents • Also to have global policies • Resolving different claims of applications • Improving the response times of individual apps. • Taking care of load dynamics • Work done as part of GrADS project • Grid Application Development Software • Collaboration between different universities

Initial GrADS ArchitectureScaLAPACK LU factorization demo Resource Selector User Grid Routine / Application Manager Matrix size, block size Final schedule – subset of resources MDS Resource characteristics, Problem characteristics NWS Performance Modeler

Performance Modeler Grid Routine / Application Manager The scheduling heuristic passed only those candidate schedules that had “sufficient” memory This is determined by calling a function in simulation model Final schedule – subset of resources All resources, Problem parameters Scheduling Heuristic Final Schedule Performance Modeler All resources, problem parameters Candidate resources Execution cost Simulation Model

Simulation Model • Simulation of the ScaLAPACK right looking LU factorization • More about the application • Iterative – each iteration corresponding to a block • Parallel application in which columns are block-cyclic distributed • Right looking LU – based on Gaussian elimination

Gaussian Elimination - Review for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) i 0 0 0 0 0 0 k 0 0 0 0 0 i i,i X X x j What GE really computes: for i=1 to n-1 A(i+1:n, i) = A(i+1:n, i) / A(i, i) A(i+1:n, i+1:n) = A(i+1:n, i+1:n) - A(i+1:n, i)*A(i, i+1:n) i Finished multipliers i A(i,i) A(i,k) A(i, i+1:n) Finished multipliers A(j,i) A(j,k) BLAS1 and BLAS2 operations A(i+1:n, i) A(i+1:n, i+1:n)

Need for blocking - BLAS • Basic Linear Algebra Subroutine • Memory hierarchy efficiently exploited by higher level BLAS • 3 levels

Converting BLAS2 to BLAS3 • Use blocking for optimized matrix-multiplies (BLAS3) • Matrix multiplies by delayed updates • Save several updates to trailing matrices • Apply several updates in the form of matrix multiply

Modified GE using BLAS3Courtesy: Dr. Jack Dongarra for ib = 1 to n-1 step b /* process matrix b columns at a time */ end = ib+b-1; /* Apply BLAS 2 version of GE to get A(ib:n, ib:end) factored. Let LL denote the strictly lower triangular portion of A(ib:end, ib:end) */ A(ib:end, end+1:n) = LL-1A(ib:end, end+1:n) /* update next b rows of U */ A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(ib:n, ib:end)* A(ib:end, end+1:n) /* Apply delayed updates with single matrix multiply */ b ib end Completed part of U A(ib:end, ib:end) ib Completed part of L b end A(end+1:n, end+1:n) A(end+1:n, ib:end)

Operations • So, the LU application in each iteration involves: • Block factorization – (ib:n, ib:ib) floating point operations • Broadcast for multiply – message size equals approximately n*block_size • Each process does its own multiply: • Remaining columns divided by number of processors

Back to the simulation model double getExecTimeCost(int matrix_size, int block_size, candidate_schedule){ for(i=0; i<number_of_blocks; i++){ /* find the proc. Belonging to the column. Note its speed, its connections to other procs. */ tfact += … /* simulate block factorization. Depends on {processor_speed, machine_load, flop_count of factorization */ tbcast += max(bcast times for each proc.) /* scalapack follows split ring broadcast. Simulate broadcast algorithm for each proc. Depends on {elements of matrix to be broadcast, connection bandwidth and latency */ tupdate += max(matrix multiplies across all proc.) /* depends on {flop count of matrix multiply, processor speed, load} */ } return (tfact + tbcast + tupdate); }

Initial GrADS ArchitectureScaLAPACK LU factorization demo Resource Selector User Grid Routine / Application Manager Matrix size, block size Problem, parameters, app. Location, final schedule MDS Is this final schedule OK? Resource characteristics, Problem characteristics NWS Contract Development App Launcher Performance Modeler Contract Monitor Application

Contract Monitor Architecture • Uses Autopilot (UIUC) infrastructure • The application is instrumented with calls to Autopilot • For registering to the autopilot manager • For forking off sensor threads • For passing variable information to sensors • A sensor client (in this case, contract monitor) can connect to the sensors and gather variable information

Contract Monitor Architecture Autopilot Manager Obtain sensor information registration Application Contract Monitor Fork Obtain information about variable x Sensors

GrADS Project GrADS Testbed

Performance Model Evaluation

GrADS Benefits Examples 8 mscs, 8 torcs 8 mscs, 8 torcs 8 mscs, 7 torcs 8 8 8 MSC & TORC Cluster 7 7 5 MSC Cluster

GrADS Limitations Needed: A Metascheduler that has global knowledge of all applications !

Metascheduler • To ensure that applications are scheduled based on correct resource information • To accommodate as many new applications as possible • To improve the performance contract of new applications • To minimize the impact of new applications on executing applications • To employ policies to migrate executing applications

Metascheduler Components Applications Requests from applications for permission to execute on the Grid Storing and retrieval of the states of the applications Application level schedules from the applications Permission Service Contract Negotiator Database Manager • Decisions based on resource capacities • Can stop an executing resource consuming application • Can accept or reject contracts • Acts as queue manager • Ensures scheduling based on correct information • Improves performance contracts • Minimizes impact Reschedules executing applications: - To escape from heavy load - To use free resources Rescheduler Metascheduler Request for Migration

Modified GrADS Architecture Resource Selector User MDS Grid Routine / Application Manager NWS Permission Service App Launcher Contract Developer Database Manager Performance Modeler RSS Application Contract Monitor Contract Negotiator Rescheduler

Database Manager • A persistent service listening for requests from the clients • Maintains global clock • Has event notification capabilities – clients can express their interests in various events. • Stores various information: • Application’s states • Initial machines • Resource information • Final schedule • Location of various daemons • Average number of contract violations • Actual performance of application • Various times

Database Manager (Contd…) When an application stops or completes, the database manager calculates percentage completion time of the application time_diff : (current_time – time when the application instance started) Avg_ratio: average of (actual costs / predicted costs)

Permission Service • After collecting resource information from NWS, the GrADS apps. contact PS. • PS makes decisions based on the problem requirements and resource characteristics makes decisions • If resources have enough capacity, then permission is given • If not, the permission service • Waits for resource consuming applications to end soon • Preempts resource consuming applications to accommodate short applications

Permission Service (Pseudo code)

Permission Service – determining resource consuming applications • For each currently executing GrADS app., contact DBM, obtain NWS resource information. • Determine change of resources caused by app. i • Add the change to current resource characteristics to obtain resource parameters in the absence of app. i

Determining remaining execution time • Whenever a meta scheduler component wants to determine remaining execution time of app., it contacts the contract monitor of app. • Retrieves average of ratios between actual times and predicted times • Uses {average, predicted time, percentage completion time} to determine r.e.t.

Determing r.e.t (pseudo code)

Determining r.e.t (pseudo code)

Contract Negotiator • Main functionalities • Ensure apps. made decisions based on updated resource information • Improve the performance of current apps. by possibly stopping and continuing executing big apps. • Reduces the impact caused by current apps. on executing apps. • When contract is approved, the application starts using resources • When contract is rejected, the application goes back to obtain new resource characteristics, and generates new schedule • Enforces ordering of the applications whose application-level characteristics use the same resources • approves the contract of one applications • Waits for the application to start using resources • Rejects the contract of the other

Contract Negotiator (Pseudo code) Ensuring app. has made scheduling decision based on correct resource information

Contract Negotiator (Pseudo code) Improving the performance of the current app. by preempting an executing large app.

Contract Negotiator – 3 scenarios t1 – average completion time of current app. and big app. when big app. is preempted, current app. accommodated, big app. continued t2 - average completion time of current app. and big app. when big app. is allowed to complete, then current app. is accommodated t3 - average completion time of current app. and big app. when both applications are executed simultaneously if (t1 < 25% of min(t2, t3) case 1 else if(t3 > 1.2t2) case 2 else case 3

Contract Negotiator (Pseudo code) Improving the performance of the current app. by preempting an executing large app.

Contract Negotiator (Pseudo code) Reducing the impact of the current app. on executing app. by modifying the schedule

Application and Metascheduler Interactions User Problem parameters Resource Selection Initial list of machines Permission Service Requesting Permission Permission Get new resource information NO Permission? Abort YES Application Specific Scheduling Application specific schedule Contract Negotiator Contract Development Get new resource information Contract Approved? NO YES Problem parameters, final schedule Application Launching Get new resource information Application Completion? Application Completed Wait for restart signal Exit Application was stopped

Experiments and ResultsDemonstration of Permission Service • Application 1 – LU, Matrix size 13000, uses 4 opus, 1torc and 2 cypher machines. • Application 2 – LU, Varying matrix sizes, uses 4 opus machines. • Permission service stopped application 1 to accommodate application 2.

Experiments and ResultsPractical Experiments • 5 applications were integrated into GrADS – ScaLAPACK LU, QR, Eigen, PETSC CG and Heat equation • Integration involved – developing performance models, instrumenting with SRS • 50 problems with different arrival rates - Poisson distribution with different mean arrival rates for job submission - uniform distributions for problem types, problem sizes • Different statistics were collected • Metascheduler was enabled or disabled

Experiments and ResultsPractical Experiments – Total Throughput Comparison

Experiments and ResultsPractical Experiments – Performance Contract Violations Measured Time/Expected Time Maximum allowed Measured Time/Expected Time Contract Violation: Measured/Expected > maximum allowed Measured/Expected

Metascheduler BehaviorNumber of metascheduling decisions

References • Vadhiyar, S., Dongarra, J. and Yarkhan, A. “GrADSolve - RPC for High Performance Computing on the Grid". Euro-Par 2003, 9th International Euro-Par Conference, Proceedings, Springer, LCNS 2790, p. 394-403, August 26 -29, 2003. • Vadhiyar, S. and Dongarra, J. “Metascheduler for the Grid”. Proceedings of the11th IEEE International Symposium on High Performance Distributed Computing, pp 343-351, July 2002, Edinburgh, Scotland. • Vadhiyar, S. and Dongarra, J. “GrADSolve - A Grid-based RPC system for Parallel Computing with Application-level Scheduling". Journal of Parallel and Distributed Computing, Volume 64, pp. 774-783, 2004. • Petitet, A., Blackford, S., Dongarra, J., Ellis, B., Fagg, G., Roche, K., Vadhiyar, S. "Numerical Libraries and The Grid: The Grads Experiments with ScaLAPACK, " Journal of High Performance Applications and Supercomputing, Vol. 15, number 4 (Winter 2001): 359-374.

GrADSolve • A practical demonstration of the metascheduling system • A usable integrated framework containing the metascheduler and preemptible applications • Based on NetSolve experience • Provides many more capabilities • Separate domains for administrators, service providers and end users • Powerful data distribution strategies • First RPC system for maintaining and using execution traces

GrADSolve Architecture Service Providers / Library writers add_problem Administrators Add user and machine information Apache XML Xindice Database get_perfmodel_template receive problem specification add_perfmodel End Users download execution model stage out input data, launch application, stage in output data areResourcesSufficient(<problem parameters>, <resource characteristics>){ } getExecutionTimeCost(<problem parameters>, <resource characteristics>){ } mapper(<problem parameters>, <resource characteristics>){ } GrADSolve Resources PROBLEM qrwrapper C FUNCTION qrwrapper(IN int N, INOUT double A[N][N]…) TYPE = parallel CONTINUE_CAPABILITY = yes RECONFIGURATION_CAPABILITY = yes Performance Modeler int main(){ gradsolve(“qrwrapper”, N, NB, A, B); } /* fill up */ Contract Negotiator PostgreSQL Database /* fill up */ Metascheduler /* fill up */ Permission Service Rescheduler

Metascheduler Internals • Applications need to make at least 20% progress between preemptions • An application cannot be preempted if another application waits for its completion

Experiments and ResultsDemonstration of Contract Negotiator • Application 1 – LU, Varying large matrix sizes on N cypher machines • Application 2 – LU, Matrix size 7500 on (N+1) cypher machines, N of which were occupied by application 1 • Contract Negotiator stopped application 1 and made (N+1) machines available to application 2 util_val = Performance gain for application 2 Performance loss for application 1

Metascheduler BehaviorPerformance Contract Violations

Meta scheduler with AppLeS Local Schedulers

Meta scheduler with AppLeS Local Schedulers

Presentation Transcript

Fun With Apples

Schedulers

CSF4 Meta-Scheduler Tutorial

LTE: Schedulers

Bringing Local Apples to Schools

Apples to Apples

Local Apples

Apples

Apples Apples

New DLA Disposition Services Local Scheduler

Apples

The GridWay Meta-scheduler

A Novel Grid Resource Broker Cum Meta Scheduler

Apples

The STAR Unified Meta-Scheduler (SUMS)

Apples to Apples:

SUMS ( STAR Unified Meta Scheduler )

APPLES!

Meta-scheduler based on advanced reservation

CSF4 Meta-Scheduler

APPLES :

Backyard Success With Apples