Automatic statistical evaluation of resources for condor l.jpg
Sponsored Links
This presentation is the property of its rightful owner.
1 / 18

Automatic Statistical Evaluation of Resources for Condor PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on
  • Presentation posted in: General

Automatic Statistical Evaluation of Resources for Condor. Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara. Motivation. Distributed System/Grid applications execute on wide variety of architectures Clusters Large SMP systems Interactive workstation networks

Download Presentation

Automatic Statistical Evaluation of Resources for Condor

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Automatic Statistical Evaluation of Resources for Condor

Daniel Nurmi, John Brevik, Rich Wolski

University of California, Santa Barbara


Motivation

  • Distributed System/Grid applications execute on wide variety of architectures

    • Clusters

    • Large SMP systems

    • Interactive workstation networks

  • Condor provides vast, easily accessible resource pool, but is best suited to Condor applications


Condor As Resource Pool

  • Provides many required features

    • Resource manager

    • Account manager

    • Scheduler

  • Resource availability very dynamic

    • Controlled by large number of variables including overall load, user priority, occupancy time, owner revocation, etc.

    • Resources free up and drop out frequently

  • Long running apps must be checkpointed


Checkpointing Schemes

  • Condor checkpointing

    • Standard Universe uses system call liftoff

    • Core file is used to capture process state for restart

  • Application-level checkpointing:

    • Application developer must generate checkpoints from within the application

    • Disk storage may be limited (none available locally)


Condor Checkpointing

  • Checkpointing is invisible to application developer, but…

    • No threads

    • No forking

    • Single architecture support

    • Must use compiler supported by Condor (e.g. no GMP)


Application-Level Checkpointing

  • No support from Condor for checkpointing in Vanilla universe

    • Left to the application

  • No restrictions on system calls or compilation

    • If it compiles it will run

  • No local disk storage

    • Checkpoints must traverse the network to a machine with stable storage

  • Checkpoint schedule major performance concern


Checkpoint Scheduling

  • Given a long running application and volatile resource, determine the amount of time perform useful computation between checkpoints such that the overhead of checkpointing is minimized

  • Well studied

    • K. M. Chandy, C. V. Ramamoorthy. Rollback and recovery strategies for computer systems.

    • M. Elnozahy, L. Alvisi, Y. M. Wang, D. B. Johnson. A survey of rollback-recovery protocols in message passing systems.

    • A. Duda. The effects of checkpointing on program execution time.

    • N. H. Vaidya. Impact of checkpoint latency on overhead ratio of a checkpointing scheme

  • We use Markov Model based approach proposed by N. H. Vaidya.


Checkpoint Interval Selection

  • Model requires statistical distribution describing resource availability

    • Vaidya, and later Plank assume exponential distributions


What is the Availability Distribution?

  • Weibull

    • T. Heath, P. M. Martin, T. D. Nguyen. The shape of failure

    • J. Xu, Z. Kalbarczyk, R. K. Iyer. Networked Windows NT system field failure data analysis

  • Hyperexponential

    • M. Mutka, M. Livny. Profiling workstations’ available capacity for remote execution.

    • I. Lee, D. Tang, R. K. Iyer, M. C. Hsueh. Measurement-based evaluation of operating system fault tolerance.


Generating Statistical Models

  • Network Weather Service monitoring of Condor pool over 2 year period

    • 708 machines observed

  • Automatic model fitting software

    • Takes as input distribution type and historical Condor uptime values

    • Outputs best fit parameters for given distribution

  • Design experiment to test overall work efficiency of checkpointing scheme using four different distributions


Checkpoint Experiment

  • Test application submitted to Condor and when it runs…

    • Sends resource information to central server

    • Model fitting software estimates model parameters using MLE or EMpht methods

    • Checkpoint scheduler solves the Markov model using tested distribution

    • Application uses schedule, checkpoints its memory, and records performance

  • Test different distributions

  • Checkpointing to disks at UCSB


Empirical Results: Execution Time


Empirical Results: Network Utilization


Moral

  • We can determine optimal checkpoint schedules for Condor jobs automatically

    • Execution performance impact is about the same until checkpoint costs get big

    • Network load improvements are substantial (particularly useful in wide area)

  • Software is real, but non-NWS parts are in prototype

    • We want to bring them into the NWS release cycle

  • Paper in submission to HPDC


What’s Next

  • Better Models

    • Brevik Method: we can predict the percentiles of availability with provable confidence bounds using less data

    • Can’t use it (yet) for Markov model

  • Better Utility

    • Provide information to Condor itself

    • Automatic fault and anomaly detection

  • Better Information for users

    • Publish availability predictions the in matchmaker


Thanks

  • Rich Wolski

  • John Brevik

  • Miron Livny

  • NSF Next Generation Software program

  • VGrADS Project (NSF ITR, Ken Kennedy, PI)

  • NSF Middleware Initiative (NWS)

  • Questions?


Simulation Results: Execution Time


Simulation Results: Network Utilization


  • Login