Efficient Prediction System for Queue Waiting Times in TeraGrid Jobs

Predicting Queue Waiting Time ForIndividual TeraGrid Jobs Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli, Ryan Garver Computer Science Department University of California, Santa Barbara

Problem: Predicting Delay in Batch Queues • Time in queue is experienced as application delay • Sounds like an easy problem, but • Distribution of load from users is a matter of some debate • Scheduling policy is partially hidden • Sites need to change the policies dynamically and without warning • Job execution times are difficult to predict • Much research in this area over the past 20 years, but few solutions • Current commercial systems provide high variance estimates • On-line simulation based on max requested time • “expected” value predictions • Most sites simply disable these features

For Scheduling: It’s all about the big Q • Predictions of the form • “What is the maximum time my job will wait with X% certainty?” • “What is the minimum time my job will wait with X% certainty?” • Requires two estimates if certainty is to be quantified • Estimate the (1-X) quantile for the distribution of availability => Qx • Estimate the upper or lower X% confidence bound on the statistic Qx=> Q(x,b) • If the estimates are unbiased, and the distribution is stationary, future availability duration will be larger than Q(x,b)X% of the time, guaranteed

BMBP: A New Predictive Methodology • New quantile estimator invention based on Binomial distribution • Requires carefully engineered numerical system to deal with large-scale combinatorics • New changepoint detector • Binomial method in a time series context is difficult • Need a system to determining • Stationary regions in the data • Minimum statistically meaningful history in each region • New clustering methodology • More accurate estimates are possible if predictions are made from jobs with similar characteristics • Takes dynamic policy changes into account more effectively

Predicting Things Upside Down • Deadline scheduling: My job needs to start in the next X seconds for the results to be meaningful. • Amitava Mujumdar, Tharaka Devaditha, Adam Birnbaum (SDSC) • Need to run a 4 minute image reconstruction that completes in the next 8 minutes • Given a • Machine • Queue • Processor count • Run time • Deadline • What is the probability that a job will meet the deadline? • http://nws.cs.ucsb.edu/batchq/invbqueue.php

See it In Action • http://nws.cs.ucsb.edu/batchq

How Does it Work? • NWS sensors at each site read batch queue scheduler logs • Sanitized: • Machine name • queue name • Node/core count • Max run time • Submit time • Start time • Sensors periodically send updated log records to UCSB • At UCSB • NWS log data is extracted • Forward and inverted predictions are asynchronously • all made for all machine/queue/cluster combinations • Data served through multiple interfaces • Web service, HTML, BQP

What are the Problems? • Batch queue scheduler logs are designed to support accounting • Each uses a different format and logs different information • Accuracy is not considered important • Not all scheduler relevant events are logged • Node decommisioning/addition • Static metadata is not provided • Queue constraints • Cores or nodes scheduled? • Number of processing elements (nodes/cores) • Better information is needed going forward • Evaluate scheduling policy changes • Urgent computing • Co-allocation/advanced reservations

Static Metadata Proposal • Per Machine • some short one word 'tag' identifying machine (ex: "ncsateragrid") • list of login hostnames that users log in to • hostname of machine with static hostname to ip mapping (net accessibleservices run here) • machine name (ex: "NCSA ia64 TeraGrid") • Number of nodes • Number of processing elements/node • Per Queue • UNIT of computational elements "core", "processor", "node" ...) • default queue? (boolean) • list of job restrictions placed on 'normal user' for this queue max number of computational elements available for request (int) max walltime request (int)

ANL Example • <machine> • <tag>ucteragrid</tag> • <sensorhost>tg-grid.uc.teragrid.org</sensorhost> <sensorport>8062</sensorport> • <totalcores>314</totalcores> • <loginhosts> • <host>tg-login.uc.teragrid.org</host> • <host>tg-login1.uc.teragrid.org</host> • <host>tg-login2.uc.teragrid.org</host> • </loginhosts> • <label>UofC/ANL TeraGrid Cluster</label> <defqueue>dque</defqueue> • <queues> • <queue> • <name>dque</name> • <procunit>cores</procunit> • <proclimit>2048</proclimit> • <walllimit>86400</walllimit> • </queue> • <queue> • <name>high</name> • <procunit>cores</procunit> • <proclimit>512</proclimit> • <walllimit>43200</walllimit> • </queue> • <queue> • name>interactive</name> • <procunit>nodes</procunit> • <proclimit>1</proclimit> • <walllimit>3600</walllimit> • </queue> • </queues> • </machine>

Efficient Prediction System for Queue Waiting Times in TeraGrid Jobs

Efficient Prediction System for Queue Waiting Times in TeraGrid Jobs

Presentation Transcript

Behavior Genetics: Predicting Individual Differences

Waiting Time Management

Waiting for

Waiting for the Right Time Lesson 3

TeraGrid

Part Time Jobs

Predicting Queue Waiting Time in Batch Controlled Systems

On waiting time for elective surgery admissions

Predicting Queue Waiting Time For Individual User Jobs

WAITING TIME DISTRIBUTIONS FOR FINANCIAL MARKETS

part time jobs

Waiting Time and Seed Selection for Homology Search

PART TIME JOBS

Part Time Jobs for Students

Full Time Jobs

Part time jobs

Part time jobs for college students

Best Part Time Jobs for Students

Waiting Time and Seed Selection for Homology Search

Behavior Genetics: Predicting Individual Differences

PART TIME & FULL TIME TELECALLER JOBS FOR FRESHER