1 / 26

Job Failure Analysis and Its Implications in a Large-scale Production Grid

Job Failure Analysis and Its Implications in a Large-scale Production Grid. Hui Li Leiden University Dec 5, 2006. Outline. Background The Grid-level Workload Failure Analysis Temporal and Spatial Behavior Cross-correlation Implications Modeling Failure-aware Strategies Summary.

Download Presentation

Job Failure Analysis and Its Implications in a Large-scale Production Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Job Failure Analysis and Its Implications in a Large-scale Production Grid Hui Li Leiden University Dec 5, 2006

  2. Outline • Background • The Grid-level Workload • Failure Analysis • Temporal and Spatial Behavior • Cross-correlation • Implications • Modeling • Failure-aware Strategies • Summary IEEE eScience2006

  3. Related Work • Failure Analysis, Modeling, and Fault-tolerance • Logs at the node level on Large-scale server clusters [Sahoo ‘04, Zhang ‘04] • Fault-tolerant resource management systems at the cluster and the Grid level [Hwang ‘03, Limaye ‘05] • Fault-tolerant techniques at the application level IEEE eScience2006

  4. Our Approach • Another view of failures through workload data at the Grid level • Difficulty of monitoring and data collection in Grids compared with single systems (nodes, disks, networks, stacks of middleware, libraries, applications, human and policy issues, etc) • Higher level statistical analysis on failed jobs in Grids IEEE eScience2006

  5. Motivation • Insights into … • What? -> types, distributions • Why? -> possible explanations • When? -> temporal behavior • Where? -> spatial behavior • How? -> Modeling and failure-aware strategies IEEE eScience2006

  6. Workload Data • LHC Computing Grid • Data-intensive sciences • ~180 sites, 30k CPUs, 4 petabytes storage • Virtual Organizations • Resource Brokers • LCG Real Time Monitor developed by Imperial College London • Monitors most of the major RBs • Representative at the Grid level IEEE eScience2006

  7. RTM view of LCG IEEE eScience2006

  8. Workload Description • Comprehensive in terms of recorded attributes: timestamps, VO, user, RB, CE, WN, status • VOs: lhcb, atlas, cms, dteam, etc • Exponential decay (80%-20% rule) IEEE eScience2006

  9. Summary statistics of jobs with different status IEEE eScience2006

  10. Job failures by VOs IEEE eScience2006

  11. Temporal Behavior IEEE eScience2006

  12. Failure interarrival time IEEE eScience2006

  13. Failure life span IEEE eScience2006

  14. Spatial Behavior IEEE eScience2006

  15. Short Summary • Temporal and spatial burstiness • Correlations in failure interarrival times and life span • A-L. Barabasi. The origin of bursts and heavy tails in human dynamics. Nature, 435:207-211, 2005. IEEE eScience2006

  16. Cross-correlation IEEE eScience2006

  17. Cross-correlation IEEE eScience2006

  18. Modeling Hui Li and Michael Muskulus. Analysis and Modeling of Job Arrivals in a Production Grid, ACM Sigmetrics Performance Evaluation Review, December issue, 2006, to appear. http://www.liacs.nl/~hli IEEE eScience2006

  19. Failure-aware Strategies • Shortcomings in the current scheduling strategies: • One-attribute resource ranking after matchmaking • Does not take the job arrival patterns into account • “Memoryless” of job failures IEEE eScience2006

  20. Historical Awareness • Inspired by the fairshare scheme in the Maui scheduler • Track historical job failures at the Resource Broker level via “historical failure” (HF) IEEE eScience2006

  21. Historical Failure IEEE eScience2006

  22. Illustration IEEE eScience2006

  23. Proactive Awareness • Bags instead of individual jobs • Dividing the jobs in a burst period into bags according to the CE “effective capacity”. • Hui Li. Machine Learning for Performance Predictions on Space-shared Computing Environments. Intl. Transactions on Systems Science and Applications, ISSN 1751-1461, invited paper, to appear. • Proactive Data replication IEEE eScience2006

  24. Accountability • Efforts are needed both from the system and the client side • Negative effect of “place-holder” jobs • Users are held responsible for their behavior in the Grid, whether in terms of priority policies or money IEEE eScience2006

  25. Summary • A comprehensive statistical analysis of job failures in a large-scale Grid • Summary statistics, temporal and spatial behavior, cross-correlation • Modeling and distribution fitting • Scheduling strategies for failure awareness IEEE eScience2006

  26. Acknowledgements • Gidon Moont and David J. Colling (Imperial College London) • David Groep, Jeff Templon (NIKHEF) • Michael Muskulus (Mathematics, Leiden) IEEE eScience2006

More Related