automatic statistical evaluation of resources for condor
Skip this Video
Download Presentation
Automatic Statistical Evaluation of Resources for Condor

Loading in 2 Seconds...

play fullscreen
1 / 18

Automatic Statistical Evaluation of Resources for Condor - PowerPoint PPT Presentation

  • Uploaded on

Automatic Statistical Evaluation of Resources for Condor. Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara. Motivation. Distributed System/Grid applications execute on wide variety of architectures Clusters Large SMP systems Interactive workstation networks

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Automatic Statistical Evaluation of Resources for Condor' - tekli

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
automatic statistical evaluation of resources for condor

Automatic Statistical Evaluation of Resources for Condor

Daniel Nurmi, John Brevik, Rich Wolski

University of California, Santa Barbara

  • Distributed System/Grid applications execute on wide variety of architectures
    • Clusters
    • Large SMP systems
    • Interactive workstation networks
  • Condor provides vast, easily accessible resource pool, but is best suited to Condor applications
condor as resource pool
Condor As Resource Pool
  • Provides many required features
    • Resource manager
    • Account manager
    • Scheduler
  • Resource availability very dynamic
    • Controlled by large number of variables including overall load, user priority, occupancy time, owner revocation, etc.
    • Resources free up and drop out frequently
  • Long running apps must be checkpointed
checkpointing schemes
Checkpointing Schemes
  • Condor checkpointing
    • Standard Universe uses system call liftoff
    • Core file is used to capture process state for restart
  • Application-level checkpointing:
    • Application developer must generate checkpoints from within the application
    • Disk storage may be limited (none available locally)
condor checkpointing
Condor Checkpointing
  • Checkpointing is invisible to application developer, but…
    • No threads
    • No forking
    • Single architecture support
    • Must use compiler supported by Condor (e.g. no GMP)
application level checkpointing
Application-Level Checkpointing
  • No support from Condor for checkpointing in Vanilla universe
    • Left to the application
  • No restrictions on system calls or compilation
    • If it compiles it will run
  • No local disk storage
    • Checkpoints must traverse the network to a machine with stable storage
  • Checkpoint schedule major performance concern
checkpoint scheduling
Checkpoint Scheduling
  • Given a long running application and volatile resource, determine the amount of time perform useful computation between checkpoints such that the overhead of checkpointing is minimized
  • Well studied
    • K. M. Chandy, C. V. Ramamoorthy. Rollback and recovery strategies for computer systems.
    • M. Elnozahy, L. Alvisi, Y. M. Wang, D. B. Johnson. A survey of rollback-recovery protocols in message passing systems.
    • A. Duda. The effects of checkpointing on program execution time.
    • N. H. Vaidya. Impact of checkpoint latency on overhead ratio of a checkpointing scheme
  • We use Markov Model based approach proposed by N. H. Vaidya.
checkpoint interval selection
Checkpoint Interval Selection
  • Model requires statistical distribution describing resource availability
    • Vaidya, and later Plank assume exponential distributions
what is the availability distribution
What is the Availability Distribution?
  • Weibull
    • T. Heath, P. M. Martin, T. D. Nguyen. The shape of failure
    • J. Xu, Z. Kalbarczyk, R. K. Iyer. Networked Windows NT system field failure data analysis
  • Hyperexponential
    • M. Mutka, M. Livny. Profiling workstations’ available capacity for remote execution.
    • I. Lee, D. Tang, R. K. Iyer, M. C. Hsueh. Measurement-based evaluation of operating system fault tolerance.
generating statistical models
Generating Statistical Models
  • Network Weather Service monitoring of Condor pool over 2 year period
    • 708 machines observed
  • Automatic model fitting software
    • Takes as input distribution type and historical Condor uptime values
    • Outputs best fit parameters for given distribution
  • Design experiment to test overall work efficiency of checkpointing scheme using four different distributions
checkpoint experiment
Checkpoint Experiment
  • Test application submitted to Condor and when it runs…
    • Sends resource information to central server
    • Model fitting software estimates model parameters using MLE or EMpht methods
    • Checkpoint scheduler solves the Markov model using tested distribution
    • Application uses schedule, checkpoints its memory, and records performance
  • Test different distributions
  • Checkpointing to disks at UCSB
  • We can determine optimal checkpoint schedules for Condor jobs automatically
    • Execution performance impact is about the same until checkpoint costs get big
    • Network load improvements are substantial (particularly useful in wide area)
  • Software is real, but non-NWS parts are in prototype
    • We want to bring them into the NWS release cycle
  • Paper in submission to HPDC
what s next
What’s Next
  • Better Models
    • Brevik Method: we can predict the percentiles of availability with provable confidence bounds using less data
    • Can’t use it (yet) for Markov model
  • Better Utility
    • Provide information to Condor itself
    • Automatic fault and anomaly detection
  • Better Information for users
    • Publish availability predictions the in matchmaker
  • Rich Wolski
  • John Brevik
  • Miron Livny
  • NSF Next Generation Software program
  • VGrADS Project (NSF ITR, Ken Kennedy, PI)
  • NSF Middleware Initiative (NWS)
  • Questions?