1 / 27

Bust a Move Young MC

Bust a Move Young MC. Modeling and Predicting Machine Availability in Volatile Computing Environments Rich Wolski John Brevik Dan Nurmi University of California, Santa Barbara. Explorations In Grid Computing.

mouton
Download Presentation

Bust a Move Young MC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bust a Move Young MC

  2. Modeling and Predicting Machine Availability in Volatile Computing Environments Rich Wolski John Brevik Dan Nurmi UniversityofCalifornia,SantaBarbara

  3. Explorations In Grid Computing • Performance: How can programs extract high performance levels given that the resource pool is heterogeneous and dynamically changing? • The Network Weather Service • On-line performance monitoring and prediction • Programming: What programming abstractions are needed to enable the Grid paradigm? • EveryWare • Toolkit for building global programs • Analysis: How do we reason about the Grid globally? • G-Commerce • Systemwide efficiency, stability, etc.

  4. Fortune Telling • Grid resource performance varies dynamically • Machines, networks and storage systems are shared by competing applications • Federation • Either the system or the application itself must “tolerate” performance variation • Dynamic scheduling • Scheduling requires a prediction of future performance levels • What performance level will be deliverable?

  5. Skepticism • Is it really possible to predict future performance levels? • Self-similarity • Non-stationarity • With what accuracy? • For how long into the future? • NWS On-line, semi non-parametric time series techniques • Use running tabulation of forecast error to choose between competing forecasters • Bandwidth, latency, CPU load, available memory, battery power

  6. What About Machine Availability?

  7. The “Normal” Approach • Each measurement is modeled as a “sample” from a random variable • Time invariant • IID (independent, identically distributed) • Stationary (IID forever) • Well studied in the literature • Exponential distributions • Compose well • Memoryless • Popular in database and fault-tolerance communities • Pareto distributions • Potentially related to self-similarity • “heavy-tailed” implying non-predictability • Popular in networking, Internet, and Dist. System communities

  8. Our “Abnormal” Approach • Measure availability as “lifetime” in a variety of settings • Student lab at UCSB, Condor pool • New NWS availability sensors • Data used in fault-tolerance community for checkpointing research • Predicting optimal checkpoint • Develop robust software for MLE parameter estimation • Fit Exponential, Pareto, and Weibull distributions • Compare the fits • Visually • Goodness of fit tests • Goal is to provide an automated mechanism for the NWS • Let the best distribution win

  9. UCSB Student Computing Labs • Approximately 85 machines running Red Hat Linux located in three separate buildings • Open to all Computer Science graduate and undergraduates • Only graduates have building keys • Power-switch is not protected • Anyone with physical access to the machine can reboot it by power cycling it • Students routinely “clean off” competing users or intrusive processes to gain better performance response • NWS deployed and monitoring duration between restarts • Can we model the time-to-reboot?

  10. UCSB Empirical CDF

  11. MLE Weibull Fit to UCSB Data

  12. Comparing Fits at UCSB

  13. The Visual Acid Test

  14. More Systems • Condor: Cycle harvesting system (M. Livny, U. Wisconsin) • Workstations in a “pool” run the (trusted) Condor daemons • When a machine running a Condor job becomes “busy” Job is terminated (vanilla universe) • Unknown and constantly changing number of workstations in UWisc Condor Pool (~ 1000 Linux Workstations) • Long, Muir, Golding Internet Survey (1995) • Pinged the rpc.statd as a heartbeat • Used extensive in fault-tolerance community to model host failure • 1170 hosts covering 3 months of Spring

  15. The Condor Picture April 2003 through Oct 2004, 600 hosts

  16. More Condor April 2003 through July 2005, 900 hosts

  17. Condor Clusters April 2003 through July 2005, 730 hosts

  18. Condor Non-cluster April 2003 through July 2005, 170 hosts

  19. Modeling Lessons • Machine availability looks like it is well-modeled by a Weibull, but Condor process lifetime is trickier • Hyper-exponentials do well, but hard to fit and use • Log-normal looks better in the large (need to investigate more) • May be able to do piece-wise fit for “desktops” • Who should care? • Grid simulators • Availability is critical • P2P systems • Oceanstore, CAN, TAPESTRY, etc. all assume very basic availability distributions in their proofs • Replication systems • It does not mean, that model fitting works best for predicting availability => data shortage

  20. Predicting Individual Machine Behavior • Estimating Mean Time to Failure (MTTF) is relatively easy • Unless the data is Pareto, the mean is the “expected value” • Probably not what is needed to support scheduling • The cost of being below the mean is not the same as the cost of being above it • “At least how much time will elapse before this machine reboots with 95% certainty?” • The answer is the 0.05 quantile (not an expectation) from the cumulative distribution function (CDF)

  21. Certainty in an Uncertain World • Predictions of the form • “For at least how long with this machine be available with X% certainty?” • Requires two estimates if certainty is to be quantified • Estimate the (1-X) quantile for the distribution of availability => Qx • Estimate the lower X% confidence bound on the statistic Qx=> Q(x,lb) • If the estimates are unbiased, and the distribution is stationary, future availability duration will be larger than Q(x,lb)X% of the time, guaranteed

  22. Neo-classical Methods • The classical (parametric) method has some drawbacks • Which distribution? • MLE is computationally challenging or impossible for some distributions and/or data sets • Requires quite a bit of data to get a good fit • Quantiles near the tails are “squeezed” so fit error is significant • Estimating confidence bounds for high-order models is computationally (and theoretically) difficult • Non-parametric techniques • Can usually only recover a statistic and not the distribution • Those that appeal to the CLT may have an asymptote problem • New non-parametric inventionbased on Binomial assumptions

  23. Experiments in Fortune Telling • CSIL, Condor (2 years), and Long data sets • Split into training and experimental periods • Use only machines with 20 training samples or more • Using synthetic data we noticed that be best method needed at least 20 samples • Use methods to estimate 95% confidence on 0.05 quantile from training period • Record success if 95%-100% of the remaining experimental availability durations >= estimate • Report % success (want to see 95%)

  24. Non-parametric methods seem to work • Weibull over-estimates the tail for Condor data • Bootstrapping works okay, but is very computationally expensive

  25. On-going Work with Condor • Checkpoint scheduling • Parametric method reduces network load dramatically • Applications • LDPC investigation => lowest observed error rates • Ramsey search • GridSAT • UCSBGrid • Automatic Program Overlay • On-demand Condor as a grid programming middleware • NWS Condor Integration • Publishing NWS forecasts via Hawkeye • Incorporating machine availability predictor

  26. Thanks and More • Miron Livny and the Condor group at the University of Wisconsin • Darrell Long (UCSC) and James Plank (UTK) • UCSB Facilities Staff • NSF SCI and DOE • Middleware and Applications Yielding Heterogeneous Environments for Metacomputing at UCSB • Students: Matthew Allen, Wahid Chrabakh, Ryan Garver, Andrew Mutz, Dan Nurmi, Erik Peterson, Fred Tu, Lamia Youseff, Ye Wen • Research Staff: John Brevik, Graziano Obertelli • www.cs.ucsb.edu/~rich • rich@cs.ucsb.edu

More Related