Automatic statistical evaluation of resources for condor l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

Automatic Statistical Evaluation of Resources for Condor PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on
  • Presentation posted in: General

Automatic Statistical Evaluation of Resources for Condor. Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara. Motivation. Distributed System/Grid applications execute on wide variety of architectures Clusters Large SMP systems Interactive workstation networks

Download Presentation

Automatic Statistical Evaluation of Resources for Condor

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Automatic statistical evaluation of resources for condor l.jpg

Automatic Statistical Evaluation of Resources for Condor

Daniel Nurmi, John Brevik, Rich Wolski

University of California, Santa Barbara


Motivation l.jpg

Motivation

  • Distributed System/Grid applications execute on wide variety of architectures

    • Clusters

    • Large SMP systems

    • Interactive workstation networks

  • Condor provides vast, easily accessible resource pool, but is best suited to Condor applications


Condor as resource pool l.jpg

Condor As Resource Pool

  • Provides many required features

    • Resource manager

    • Account manager

    • Scheduler

  • Resource availability very dynamic

    • Controlled by large number of variables including overall load, user priority, occupancy time, owner revocation, etc.

    • Resources free up and drop out frequently

  • Long running apps must be checkpointed


Checkpointing schemes l.jpg

Checkpointing Schemes

  • Condor checkpointing

    • Standard Universe uses system call liftoff

    • Core file is used to capture process state for restart

  • Application-level checkpointing:

    • Application developer must generate checkpoints from within the application

    • Disk storage may be limited (none available locally)


Condor checkpointing l.jpg

Condor Checkpointing

  • Checkpointing is invisible to application developer, but…

    • No threads

    • No forking

    • Single architecture support

    • Must use compiler supported by Condor (e.g. no GMP)


Application level checkpointing l.jpg

Application-Level Checkpointing

  • No support from Condor for checkpointing in Vanilla universe

    • Left to the application

  • No restrictions on system calls or compilation

    • If it compiles it will run

  • No local disk storage

    • Checkpoints must traverse the network to a machine with stable storage

  • Checkpoint schedule major performance concern


Checkpoint scheduling l.jpg

Checkpoint Scheduling

  • Given a long running application and volatile resource, determine the amount of time perform useful computation between checkpoints such that the overhead of checkpointing is minimized

  • Well studied

    • K. M. Chandy, C. V. Ramamoorthy. Rollback and recovery strategies for computer systems.

    • M. Elnozahy, L. Alvisi, Y. M. Wang, D. B. Johnson. A survey of rollback-recovery protocols in message passing systems.

    • A. Duda. The effects of checkpointing on program execution time.

    • N. H. Vaidya. Impact of checkpoint latency on overhead ratio of a checkpointing scheme

  • We use Markov Model based approach proposed by N. H. Vaidya.


Checkpoint interval selection l.jpg

Checkpoint Interval Selection

  • Model requires statistical distribution describing resource availability

    • Vaidya, and later Plank assume exponential distributions


What is the availability distribution l.jpg

What is the Availability Distribution?

  • Weibull

    • T. Heath, P. M. Martin, T. D. Nguyen. The shape of failure

    • J. Xu, Z. Kalbarczyk, R. K. Iyer. Networked Windows NT system field failure data analysis

  • Hyperexponential

    • M. Mutka, M. Livny. Profiling workstations’ available capacity for remote execution.

    • I. Lee, D. Tang, R. K. Iyer, M. C. Hsueh. Measurement-based evaluation of operating system fault tolerance.


Generating statistical models l.jpg

Generating Statistical Models

  • Network Weather Service monitoring of Condor pool over 2 year period

    • 708 machines observed

  • Automatic model fitting software

    • Takes as input distribution type and historical Condor uptime values

    • Outputs best fit parameters for given distribution

  • Design experiment to test overall work efficiency of checkpointing scheme using four different distributions


Checkpoint experiment l.jpg

Checkpoint Experiment

  • Test application submitted to Condor and when it runs…

    • Sends resource information to central server

    • Model fitting software estimates model parameters using MLE or EMpht methods

    • Checkpoint scheduler solves the Markov model using tested distribution

    • Application uses schedule, checkpoints its memory, and records performance

  • Test different distributions

  • Checkpointing to disks at UCSB


Empirical results execution time l.jpg

Empirical Results: Execution Time


Empirical results network utilization l.jpg

Empirical Results: Network Utilization


Moral l.jpg

Moral

  • We can determine optimal checkpoint schedules for Condor jobs automatically

    • Execution performance impact is about the same until checkpoint costs get big

    • Network load improvements are substantial (particularly useful in wide area)

  • Software is real, but non-NWS parts are in prototype

    • We want to bring them into the NWS release cycle

  • Paper in submission to HPDC


What s next l.jpg

What’s Next

  • Better Models

    • Brevik Method: we can predict the percentiles of availability with provable confidence bounds using less data

    • Can’t use it (yet) for Markov model

  • Better Utility

    • Provide information to Condor itself

    • Automatic fault and anomaly detection

  • Better Information for users

    • Publish availability predictions the in matchmaker


Thanks l.jpg

Thanks

  • Rich Wolski

  • John Brevik

  • Miron Livny

  • NSF Next Generation Software program

  • VGrADS Project (NSF ITR, Ken Kennedy, PI)

  • NSF Middleware Initiative (NWS)

  • Questions?


Simulation results execution time l.jpg

Simulation Results: Execution Time


Simulation results network utilization l.jpg

Simulation Results: Network Utilization


  • Login