- By
**marty** - Follow User

- 83 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Predictions for Parallel Applications and Systems' - marty

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Predictions for Parallel Applications and Systems

GARL Research

Sathish Vadhiyar

Grid Applications Research Laboratory (GARL)

GARL Research

- Grid Applications
- Climate Modeling
- Gene Mutations
- Performance Modeling
- Rescheduling
- Others
- Prediction of queue wait times

GARL Research

- Grid Applications
- Climate Modeling
- Gene Mutations
- Performance Modeling
- Rescheduling
- Others
- Prediction of queue wait times

Rescheduling

- The base is a parallel checkpointing library called SRS
- Checkpointing? – storing application’s state so as to continue from the previous state after interruption
- Interruption either by a scheduler or system faults
- SRS allows processor reconfiguration

Optimal Checkpoint Interval

- Storing checkpoints periodically will help in fault-tolerance
- How periodic?
- What is the optimal checkpoint interval?
- More checkpointing will lead to increased checkpoint overhead
- Less checkpointing frequency will lead to increase times for recovery from failures

Dynamic Determination of Optimal Checkpointing Intervals

- Start the application on a set of resources
- Predict the next failure on the set of resources
- Checkpoint “just before” the next failure
- The prediction has to be really accurate
- But no prediction can be 100% accurate

Probability Distribution of Failures

- Use a probability distribution of failures on the resources
- Need to know: The next time of failure with x% certainty
- But more certainty is also not good

Markov Chains

For parallel M-M checkpointing

In SRS, there is almost no system down phase

For sequential applications

In SRS, transition from state 0 can lead to many states

GARL Research

- Grid Applications
- Climate Modeling
- Gene Mutations
- Performance Modeling
- Rescheduling
- Others
- Prediction of queue wait times

Motivation for Queue Wait Times

- A Grid consisting of number of batch queues
- A meta system that will:
- predict the wait times and execution times of jobs
- Decide which queue is “most suitable” for the job

What is a good predictor?

- There are number of prediction strategies
- Evaluating a predictor’s goodness:
- Mean Absolute Percentage Error (MAPE)
- Upper bound for actual/predicted
- Average of (actual-predicted) [absolute error]
- Absolute error/actual wait time [relative error]
- Average error/average queue wait time
- Coefficient of correlation
- Each of these metrics has flaws

Illustration

Method 1

Method 2

Metric 3 value of Method 1 < Metric 3 value of Method 2

i.e. Method 1 is better

Our goals

- To define useful metrics that can clearly say whether a method is “good” or “bad”
- Goodness of predictors
- In terms of absolute wait times
- In terms of execution times
- In terms of resource demand

What we want to do…

- Define metrics that can evaluate a method in the “absolute” sense, not “comparative” sense
- Stare at a single graph and ask “Is this graph good” as much as possible
- In some cases, it may just not be possible
- Use comparisons
- Evaluate the existing methods on these sets of metrics
- Come up with a method that performs the best in terms of all of the defined metrics

- Grid Applications
- Climate Modeling
- Gene Mutations
- Performance Modeling
- Rescheduling
- Others
- Prediction of queue wait times

Motivation

- Certain large computational phases of climate modeling (CCSM) are done only by some processors
- Load balancing – offload work from these processors to other processors
- Increased processor utilization
- Decreased execution time
- How much offloading?
- Need to predict workload based on previous computations

What should happen…

Proc 0

Proc 1

Proc 2

Proc 3

Proc 4

Phase 1

Phase 2

For this, we need to know the workload in phase 1

We predict the workload based on previous time steps

GARLians

- Yadnyesh Joshi (M.Sc)
- Karthikeyan Raman (M.Tech, jointly with Prof. Govindarajan)
- H.A. Sanjay (Ph.D, jointly with Prof. Ravi Nanjundiah, CAOS)
- Sivagama Sundari (Ph.D)
- Ashish Srivatsava (Project Assistant)
- Alumni
- 1 student intern from INSA, Lyon, France
- Summer interns
- Project assistants
- 2 M.Scs

Download Presentation

Connecting to Server..