Efficient Response Time Predictions by Exploiting Application and Resource State Similarities

Efficient Response Time Predictions by Exploiting Application and Resource State Similarities Hui Li, David Groep, Lex Wolters Nov 14th, 2005

Outline • Problem Statement • Similarity Definition • The IBL-based Prediction Algorithm • Parameter Optimization via GA • Experimental Results • Conclusions and Future Work Grid'05, Seattle, WA

Problem Statement • Context: Large scale Grids like LCG • Target: Computing resources like clusters and parallel supercomputers • Source: Historical workload traces • Goal: Develop a practically useful technique for job response time predictions • Purpose: Provide dynamic information for metascheduling decision support Grid'05, Seattle, WA

The LCG case (http://lcg.web.cern.ch/LCG/) Grid'05, Seattle, WA

The LCG Challenges • Challenges • Scalable production environment (~211 sites, 16854 CPUS, 5 PB storage) • Many options after matchmaking and authorization filtering • How does the resource broker make a good selection of candidate sites? • What makes a good metric? Sites may not want to publish their policies. Grid'05, Seattle, WA

The NIKHEF Site Grid'05, Seattle, WA

Job Response Times on Resources • Job response time as a dynamic performance metric, defined as the time elapsed from a job’s submission to completion. • Response time = Application Run Time + Queue Wait Time Grid'05, Seattle, WA

Related Work • Predictions based on historical observations • Similarity Templates [Smith et al, 98] - Run Time • Instance Based Learning [Kapadia et al, 99] - Run Time • Scheduler Simulation [Smith 99, Li et al 04] - Wait Time • “Learning it from data” • Can scheduling rules and policies be discovered by mining historical data? • How to use it for wait time predictions? Grid'05, Seattle, WA

Progress • Problem Statement • Similarity Definition • The IBL-based Prediction Algorithm • Parameter Optimization via GA • Experimental Results • Conclusions and Future Work Grid'05, Seattle, WA

Job Similarity • Job attributes recorded in traces that characterize a job • Group, user, queue, executable name, #CPUs, requested run time, arrival time of day (executable arguments*, node specification*) • Naturally for run times, being used for queue wait times Grid'05, Seattle, WA

Resource State Similarity • Definition: A pool of running and queued jobs on the resource at the time to make a prediction • Assumption: “similar” jobs under “similar” resource states would most likely have similar waiting times • Key problems: • How to define attributes to represent a resource state? • How to incorporate local policies into attributes for more fine-grained similarity comparison? Grid'05, Seattle, WA

Resource State Attributes • VecRunJobs: categorized number of running jobs • VecQueueJobs: categorized number of queued jobs • VecAlreadyRun: categorized sum of elapsed run time multiplying with #CPUs of running jobs • VecRunRemain: categorized sum of remaining run time multiplying with #CPUs of running jobs • AlreadyQueue: categorized sum of already queued time multiplying with #CPUs of queue jobs • QueueDemand: categorized sum of run time multiplying with #CPUs of queue jobs Grid'05, Seattle, WA

Policy Attributes • Credential attributes usually used in scheduling policy expressions • Group (VO), user, and queue • Maui (NIKHEF), Catalina (SDSC) • Embedding the policy attributes into resource state attributes via categorization Grid'05, Seattle, WA

RunJobs RunJobs … … Atlas 30 Lhcb 60 cms 30 Alice 60 QueueJobs QueueJobs … … Atlas 45 Lhcb 50 Atlas 45 Lhcb 50 State 1 State 2 Resource State Example • Policy attribute set = <group>, resource attributes = VecRunJobs and VecQueueJobs Grid'05, Seattle, WA

Instance Based Learning • Nonparametric learning technique • Store training data in a historical database, and make predictions by applying an induction model on data entries “near” the query • The distance function and the induction model Grid'05, Seattle, WA

The Distance Function • An extended Heterogeneous Euclidean-Overlap Metric (HEOM) Grid'05, Seattle, WA

The Distance Function (cont’d) Grid'05, Seattle, WA

The Induction Models • Weighted Average (WA) • Linear Locally Weighted Regression (LLWR) Grid'05, Seattle, WA

Parameter Optimization by GA • Genetic Algorithm implementation using standard operators such as selection, mutation, and crossover • Real-encoding v.s. binary-encoding • Chromosomes are structured to match different objectives (i.e. run time or wait time) • Objective: average prediction error Grid'05, Seattle, WA

Chromosomes • Run Time • (WAg, WAu, WAe, WAn, WAr, WAtod), (#CPUs), (method), (neighbor size), (history size), (bandwidth type), (bandwidth) • Wait Time • (WPg, WPu, WPq), (WAg, WAu, WAe, WAn, WAr, WAtod), (WSrj, WSqj, WSalrr, WSalrq, WSrrem, WSqdem), (#CPUs, queue demand credential, queue demand total), (method), (neighbor size), (history size), (bandwidth type), (bandwidth) Grid'05, Seattle, WA

Experimental Setup • Real traces with diverse characteristics • NIKHEF cluster: ~300 CPUs, up to 3GB memory per node, Ethernet connections. Maui scheduler with backfilling, policies based on groups (VOs) and users. • SDSC Blue Horizon: IBM SP, 1152 CPUs. Catalina scheduler with backfilling, policies based on queues. • Evaluation is done on multiple Intel Xeon machines with 4 CPUs and 3GB shared memory Grid'05, Seattle, WA

Methodology • Prediction accuracy • Average Absolute Error (AAE) • Average Relative Error = AAE/Average Real Value • Relative Error = (Est - Real)/(Est + Real) • Prediction time • Average execution time per prediction in milliseconds • Workload traces are divided into training sets and test sets • On NIKHEF, we test trace data of one month of consecutive months, with parameters trained on the preceding two-month data. • ON SDSC, we test data every three months and training is done on the preceding six months. Grid'05, Seattle, WA

Absolute Prediction Error Grid'05, Seattle, WA

Relative Prediction Error (Run Time) Grid'05, Seattle, WA

Relative Prediction Error (Wait Time) Grid'05, Seattle, WA

Error Analysis Grid'05, Seattle, WA

Optimized Parameters Grid'05, Seattle, WA

Prediction Time Grid'05, Seattle, WA

Conclusions • A response time prediction technique based on Instance Based Learning • Novel resource state similarity that incorporate policies • Automatic parameter selection • “Efficient” and “more general” • “I’m VO 1, how many jobs can you tolerate before reaching a max. response time of X” ? Grid'05, Seattle, WA

Future Work • Accuracy (global vs local tuning) • Performance (search structure) • PDM: A Java-based Toolkit for mining performance data in the Grid Grid'05, Seattle, WA

References • Mining Performance Data for Metascheduling Decision Support in the Grid, Technical Report 2005-07, LIACS, Leiden University, 2005. • http://www.liacs.nl/~hli/pub.htm • PDM Toolkit • http://www.liacs.nl/~hli/pdm Grid'05, Seattle, WA

Efficient Response Time Predictions by Exploiting Application and Resource State Similarities

Efficient Response Time Predictions by Exploiting Application and Resource State Similarities

Presentation Transcript

Efficient time management

Resource Selection Functions and Patch Occupancy Models: Similarities and Differences

Response Time

Emergency Response and Resource

Media, International response and Predictions for Darfur people

Exploiting Predicate Structure for Efficient Reachability Detection

Exploiting Flash for Energy Efficient Disk Arrays

Similarities and Differences Between Classical and Item Response Theory

Efficient Consumer Response

Resource Application Paperwork

Continuous Time and Resource Uncertainty

Response Time (Reaction time)

Water Resource Information and Predictions

Similarities and Differences Between Classical and Item Response Theory

2021 Horoscope Predictions by Date of Birth & Time