Job Failure Analysis and Its Implications in a Large-scale Production Grid

Job Failure Analysis and Its Implications in a Large-scale Production Grid Hui Li Leiden University Dec 5, 2006

Outline • Background • The Grid-level Workload • Failure Analysis • Temporal and Spatial Behavior • Cross-correlation • Implications • Modeling • Failure-aware Strategies • Summary IEEE eScience2006

Related Work • Failure Analysis, Modeling, and Fault-tolerance • Logs at the node level on Large-scale server clusters [Sahoo ‘04, Zhang ‘04] • Fault-tolerant resource management systems at the cluster and the Grid level [Hwang ‘03, Limaye ‘05] • Fault-tolerant techniques at the application level IEEE eScience2006

Our Approach • Another view of failures through workload data at the Grid level • Difficulty of monitoring and data collection in Grids compared with single systems (nodes, disks, networks, stacks of middleware, libraries, applications, human and policy issues, etc) • Higher level statistical analysis on failed jobs in Grids IEEE eScience2006

Motivation • Insights into … • What? -> types, distributions • Why? -> possible explanations • When? -> temporal behavior • Where? -> spatial behavior • How? -> Modeling and failure-aware strategies IEEE eScience2006

Workload Data • LHC Computing Grid • Data-intensive sciences • ~180 sites, 30k CPUs, 4 petabytes storage • Virtual Organizations • Resource Brokers • LCG Real Time Monitor developed by Imperial College London • Monitors most of the major RBs • Representative at the Grid level IEEE eScience2006

RTM view of LCG IEEE eScience2006

Workload Description • Comprehensive in terms of recorded attributes: timestamps, VO, user, RB, CE, WN, status • VOs: lhcb, atlas, cms, dteam, etc • Exponential decay (80%-20% rule) IEEE eScience2006

Summary statistics of jobs with different status IEEE eScience2006

Job failures by VOs IEEE eScience2006

Temporal Behavior IEEE eScience2006

Failure interarrival time IEEE eScience2006

Failure life span IEEE eScience2006

Spatial Behavior IEEE eScience2006

Short Summary • Temporal and spatial burstiness • Correlations in failure interarrival times and life span • A-L. Barabasi. The origin of bursts and heavy tails in human dynamics. Nature, 435:207-211, 2005. IEEE eScience2006

Cross-correlation IEEE eScience2006

Modeling Hui Li and Michael Muskulus. Analysis and Modeling of Job Arrivals in a Production Grid, ACM Sigmetrics Performance Evaluation Review, December issue, 2006, to appear. http://www.liacs.nl/~hli IEEE eScience2006

Failure-aware Strategies • Shortcomings in the current scheduling strategies: • One-attribute resource ranking after matchmaking • Does not take the job arrival patterns into account • “Memoryless” of job failures IEEE eScience2006

Historical Awareness • Inspired by the fairshare scheme in the Maui scheduler • Track historical job failures at the Resource Broker level via “historical failure” (HF) IEEE eScience2006

Historical Failure IEEE eScience2006

Illustration IEEE eScience2006

Proactive Awareness • Bags instead of individual jobs • Dividing the jobs in a burst period into bags according to the CE “effective capacity”. • Hui Li. Machine Learning for Performance Predictions on Space-shared Computing Environments. Intl. Transactions on Systems Science and Applications, ISSN 1751-1461, invited paper, to appear. • Proactive Data replication IEEE eScience2006

Accountability • Efforts are needed both from the system and the client side • Negative effect of “place-holder” jobs • Users are held responsible for their behavior in the Grid, whether in terms of priority policies or money IEEE eScience2006

Summary • A comprehensive statistical analysis of job failures in a large-scale Grid • Summary statistics, temporal and spatial behavior, cross-correlation • Modeling and distribution fitting • Scheduling strategies for failure awareness IEEE eScience2006

Acknowledgements • Gidon Moont and David J. Colling (Imperial College London) • David Groep, Jeff Templon (NIKHEF) • Michael Muskulus (Mathematics, Leiden) IEEE eScience2006

Job Failure Analysis and Its Implications in a Large-scale Production Grid

Job Failure Analysis and Its Implications in a Large-scale Production Grid

Presentation Transcript

Resource Management of Large-Scale Applications on a Grid

Additives Used in Large Scale Production

Large-Scale Phylogenetic Analysis

Large-scale Enzyme Production

Resource Management of Large-Scale Applications on a Grid

Realistic, large-scale MC production

EGEE A Large-scale Production Grid Infrastructure

Extracting insight from large networks: implications of small-scale and large-scale structure

Large Scale Grid Infrastructures: Status and Future

Large scale data flow in local and GRID environment

Large scale data flow in local and GRID environment

Large scale MC production

Large scale MC production II

EGEE – A Large-Scale Production Grid Infrastructure

Hybrid Poplar Production in Minnesota on a Large Scale

Automating large-scale production

Large-Scale Simulation Experimentation and Analysis

DS-Grid: Large Scale Distributed Simulation on the Grid

Large-Scale Protein Production

large scale data analysis

Extracting insight from large networks: implications of small-scale and large-scale structure

Automating large-scale production