The performance of bags of tasks in large scale distributed computing systems
Download
1 / 24

The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems - PowerPoint PPT Presentation


  • 160 Views
  • Uploaded on

The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems. Alexandru Iosup , Ozan Sonmez, Shanny Anoep, and Dick Epema. Parallel and Distributed Systems Group, TU Delft. ACM/IEEE Int’l. Symposium on High Performance Distributed Computing.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems' - bob


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The performance of bags of tasks in large scale distributed computing systems
The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems

Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema

Parallel and Distributed Systems Group, TU Delft

ACM/IEEE Int’l. Symposium on High Performance Distributed Computing


The vl e project

Natural gas price →$$ for grid computing

The VL-e project

  • A grid project in the Netherlands (2004-)

  • Natural gas money: VL-e 45 MEuro / 800 MEuro total research package

  • Overall aim:

    … to design and build a virtual lab for (digitally) enhanced science (e-science)experiments (no in-vivo or in-vitro, but in-silico experiments).

  • Goals:

    • create prototypes of application-specific e-science environments

    • design and develop re-usable ICT/grid components

    • validate with real-life applications in testbeds


The vl e project application areas

Grid Services

Harness multi-domain distributed resources

The VL-e project: application areas

Philips

IBM

Unilever

Data

Intensive

Science

Medical

Diagnosis &

Imaging

Bio-

Diversity

Bio-

Informatics

Food

Informatics

Dutch

Telescience

Virtual Laboratory (VL)

Application Oriented Services

Management

of comm. &

computing


The vl e project application areas1

Grid Services

Harness multi-domain distributed resources

The VL-e project: application areas

Philips

IBM

Unilever

Bags-of-Tasks

Data

Intensive

Science

Medical

Diagnosis &

Imaging

Bio-

Diversity

Bio-

Informatics

Food

Informatics

Dutch

Telescience

Virtual Laboratory (VL)

Application Oriented Services

Management

of comm. &

computing


The vl e project application areas2

Grid Services

Harness multi-domain distributed resources

The VL-e project: application areas

Philips

IBM

Unilever

Data

Intensive

Science

Medical

Diagnosis &

Imaging

Bio-

Diversity

Bio-

Informatics

Food

Informatics

Dutch

Telescience

Bags-of-Tasks

Virtual Laboratory (VL)

Application Oriented Services

Management

of comm. &

computing


The challenge
The Challenge

  • Complete scientific work better, …

    • User-oriented performance metrics(time a critical performance component)

    • Bags-of-tasks for ease-of-use

  • … in real systems

    • Workloads (now that real traces are available)

    • Information unavailability

  • What to do?

    • Hint: the next 10% improvement won’t cut it!


The challenge cont d
The Challenge (cont’d.)

  • System modelWhat is a good model for the study of large-scale distributed computing systems that run bag-of-tasks?

  • Input modelWhat is a good model for bag-of-tasks workloads in large-scale distributed computing systems?

  • What is the best setup for such system/input?

    • How to find the best?

    • If a best is found, can there be another?


The performance of bags of tasks in large scale distributed computing systems1
The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems

  • Introduction and Motivation

  • Context: System Model

  • Workload Model

  • Design Space Exploration

  • Conclusion


Context system model 1 4 overview
Context: System Model [1/4]Overview

  • System Model

    • Clustersexecute jobs

    • Resource managerscoordinate job execution

    • Resource management architecturesroute jobs among resource managers

    • Task selection policiescreate the eligible set

    • Task scheduling policies:schedule the eligible set


Context system model 2 4 resource management architectures route jobs among resource managers

Separated Clusters (sep-c)

Centralized (csp)

Decentralized (fcondor)

Context: System Model [2/4]Resource Management Architecturesroute jobs among resource managers


Context system model 3 4 task selection policies create the eligible set
Context: System Model [3/4]Task Selection Policiescreate the eligible set

  • Age-based:

    • S-T: Select Tasks in the order of their arrival.

    • S-BoT: Select BoTs in the order of their arrival.

  • User priority based:

    • S-U-Prio: Select the tasks of the User with the highest Priority.

  • Based on fairness in resource consumption:

    • S-U-T: Select the Tasks of the User with the lowest res. cons.

    • S-U-BoT: Select the BoTs of the User with the lowest res. cons.

    • S-U-GRR: Select the User Round-Robin/all tasks for this user.

    • S-U-RR: Select the User Round-Robin/one task for this user.


Context system model 4 4 task scheduling policies schedule the eligible set

Task Information

K

H

U

ECT, FPLT

K

ECT-P

FPF

Resource Information

DFPLT,MQD

H

RR, WQR

U

STFR

Context: System Model [4/4]Task Scheduling Policiesschedule the eligible set

  • Information availability:

    • Known

    • Unknown

    • Historical records

  • Sample policies:

    • Earliest Completion Time (with Prediction of Runtimes) (ECT(-P))

    • Fastest Processor First (FPF)

    • (Dynamic) Fastest Processor Largest Task ((D)FPLT)

    • Shortest Task First w/ Replication (STFR)

    • Work Queue w/ Replication (WQR)


The performance of bags of tasks in large scale distributed computing systems2
The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems

  • Introduction and Motivation

  • Context: System Model

  • Workload Model

  • Design Space Exploration

  • Conclusion


Workload modeling 101 what matters
Workload Modeling 101: What Matters

TimeUnit=100s

Longer queues

  • Job arrival process & job service time:

    • Self-similarity (burstiness) vs. Poisson [Leland & Ott ToN’94]

    • Job grouping: bags-of-tasks dominant application type in multi-cluster grids and cycle-scavenging systems (the e-Science infrastructure)[IosupJSE EuroPar’07]

  • Job size: almost always 1CPU [IosupDELW Grid’06]

No.Packets/Time Unit

TimeUnit=0.01s

No.Packets/Time Unit

Time Units

Time Units


A bag of tasks workload model
A Bag-of-Tasks Workload Model

  • Model:

    • Users, Bags-of-Tasks, Tasks

    • Heavy-tailed distributions for inter-arrival time, job service time→ can model self-similar workloads

    • More details (e.g., parameter values): see article

  • Validation data: the Grid Workloads Archive

    • 7 long-term grid traces

    • >5 million tasks

    • >2500 users

    • >40k CPUs

    • Domains: HEP, graphics, AI, math, biomed, climate, finance, aero…

http://gwa.ewi.tudelft.nl/


The performance of bags of tasks in large scale distributed computing systems3
The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems

  • Introduction and Motivation

  • Context: System Model

  • Workload Model

  • Design Space Exploration

  • Conclusion


Design space exploration 1 5 overview
Design Space Exploration [1/5]Overview

  • Design space exploration: time to understand how our solutions fit into the complete system.

  • Study the impact of:

    • The Task Scheduling Policy (s policies)

    • The Workload Characteristics (P characteristics)

    • The Dynamic System Information (I levels)

    • The Task Selection Policy (S policies)

    • The Resource Management Architecture (A policies)

s x 7P x I x S x A x (environment) → >2M design points


Design space exploration 2 5 experimental setup
Design Space Exploration [2/5]Experimental Setup

  • Simulator:

    • DGSim [IosupETFL SC’07, IosupSE EuroPar’08]

  • System:

    • DAS + Grid’5000 [Cappello & Bal CCGrid’07]

    • >3,000 CPUs: relative perf. 1-1.75

  • Metrics:

    • Makespan

    • Normalized Schedule Length ~ speed-up

  • Workloads:

    • Real: DAS + Grid’5000

    • Realistic: system load 20-95% (from workload model)


Design space exploration 3 5 selected results a design guidelines for scheduling policies
Design Space Exploration [3/5] Selected Results ADesign Guidelines for Scheduling Policies

  • Influence of the information type:

    • (K,K): best balance between MS and NSL

    • (*,U),(U,*): surprisingly good (FPF) to surprisingly poor (WQR4x)

    • (*,H),(H,*): poor. Simple runtime predictors don’t work (see article)

  • Where to invest time?

    • K -> H, K-> U: adapt for information type with lowest variation

WQR4x

FPF


Design space exploration 4 5 selected results b task selection only for busy systems
Design Space Exploration [4/5] Selected Results B Task Selection Only for Busy Systems

  • Not much difference until system load over 50%.

    • For DAS + Grid’5000 no change of task selection policy.

S-BoT

Same performance

S-T


Design space exploration 5 5 selected results c resource management architecture
Design Space Exploration [5/5] Selected Results C Resource Management Architecture

  • Centralized, separated, or distributed?

    • Centralized is best [Note: job overhead not considered.]

    • Distributed: good for system load below 50%; over 50% it does not finish all tasks.


The performance of bags of tasks in large scale distributed computing systems4
The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems

  • Introduction and Motivation

  • Context: System Model

  • Workload Model

  • Design Space Exploration

  • Conclusion


Conclusion

Task Information

K

H

U

ECT, FPLT

K

ECT-P

FPF

Resource Information

DFPLT,MQD

H

RR, WQR

U

STFR

Conclusion

• System Model = Resource Management Architecture + Task Selection Policy + Task Scheduling Policy

• Information availability framework

• BoT workload model

• Design space exploration: the performance of bags-of-tasks

?

Future Work

  • Better predictors

  • (H,H) task scheduling policies


Thank you questions remarks observations
Thank you! Questions? Remarks? Observations?

  • Contact: [email protected] [google “Iosup“]

  • Web sites:

    • http://www.vl-e.nl : VL-e project

    • http://www.pds.ewi.tudelft.nl : PDS group articles & software

Help building the Grid Workloads Archive:http://gwa.ewi.tudelft.nl


ad