Resource and test management in grids
Download
1 / 39

Resource and Test Management in Grids - PowerPoint PPT Presentation


  • 255 Views
  • Updated On :

Resource and Test Management in Grids. Dick Epema, Catalin Dumitrescu, Hashim Mohamed, Alexandru Iosup , Ozan Sonmez. Parallel and Distributed Systems Group Delft University of Technology. The Grid Initiative Summer School, Bucharest, RO, 2006. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Resource and Test Management in Grids' - Jeffrey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Resource and test management in grids l.jpg

Resource and Test Management in Grids

Dick Epema, Catalin Dumitrescu, Hashim Mohamed, Alexandru Iosup, Ozan Sonmez

Parallel and Distributed Systems GroupDelft University of Technology

The Grid Initiative Summer School, Bucharest, RO, 2006.


Outline l.jpg
Outline

  • Koala: Processor and Data Co-Allocation in Grids

    • The Co-Allocation Problem in Grids

    • The Design of Koala

    • Koala and the DAS Community

    • The Future of Koala

  • GrenchMark: Analyzing, Testing, and Comparing Grids

    • A Brief Introduction to Grid Computing

    • Grid Performance Evaluation Issues

    • The GrenchMark Architecture

    • Experience with GrenchMark

  • Take home message


The co allocation problem in grids 1 motivation l.jpg
The Co-allocation Problem in Grids (1)Motivation

  • Co-allocation = the simultaneous allocation of resources in multiple clusters to single applications which consist of multiple components

  • Reasons

    • Use more resources than available at single cluster at given time

    • Create a specific virtual environment (e.g., visualization cluster , geographically spread data)

    • Achieve reliability through replication on multiple clusters

    • Avoid resource contention on the same site (e.g., batches)


The co allocation problem in grids 2 overall example l.jpg
The Co-allocation Problem in Grids (2) Overall Example

global queue

KOALA

local queues with local schedulers

load sharing

LS

LS

LS

co-allocation

clusters

global job

local jobs

Source: Dick Epema


The co allocation problem in grids 3 details processors and data co alloc l.jpg
The Co-allocation Problem in Grids (3)Details: Processors and Data Co-Alloc.

  • Jobs have access to processors and data from many sites

    • Files stored at different file sites, replicas may exist

    • Scheduler decides on job component placement at execution sites

    • Jobs can be of high or low priority

Source: Hashim Mohamed


The co allocation problem in grids 4 details co allocated job types l.jpg
The Co-allocation Problem in Grids (4)Details: Co-Allocated Job Types

fixed jobs

non-fixed jobs

Job component size and placement fixed by user

Job component size fixed by user, placement by scheduler decision

semi-fixed jobs

flexible jobs

Job component size and placement by scheduler decision / fixed by user

Job component size and placement by scheduler decision


The koala design l.jpg
The Koala Design

Source: Hashim Mohamed

SelectionPlacing job components

ControlTransfer executable and input files

InstantiationClaiming resources selected for each job component

RunSubmit, then monitor job execution(fault-tolerance)


The koala selection step many placement policies l.jpg
The Koala Selection StepMany Placement Policies

  • Originally supported co-allocation policies:

    • Worst-Fit: balance job components across sites

    • Close-to-Files: take into account the locations of input files to minimize transfer times

    • (Flexible) Cluster Minimization: mitigate inter-cluster communication; can also split the job automatically

  • But, different application types require different ways of component placement

  • So:

    • Modular structure with pluggable policies

    • Take into account internal structure of applications

Use monitoring information(dynamic placement)


The koala instantiation step adv reservations or on the fly l.jpg
The Koala Instantiation StepAdv. Reservations or On-The-Fly

Source: Hashim Mohamed

Use system feedback (dynamic claiming)


The koala selection step hocs exploiting application structure l.jpg
The Koala Selection StepHOCs: Exploiting Application Structure

  • Higher-Order Components:

    • Pre-packaged software components with generic patterns of parallel behavior

    • Patterns: master-worker, pipelines, wavefront

  • Benefits:

    • Facilitates parallel programming in grids

    • Enables user-transparent scheduling in grids

  • Most important additional middleware:

    • Translation layer that builds a performance model from the HOC patterns and the user-supplied application parameters

  • Supported by KOALA (with Univ. of Münster)

  • Initial results: up to 50% reduction in runtimes


The koala instantiation step the runners l.jpg

runner

The Koala Instantiation StepThe Runners

  • Problem: How to support many application types, each with specific (and difficult) requirements?

  • Solution: runners (=interface modules)

  • Currently supported:

    • Any type of single-component job

    • MPI/DUROC jobs

    • Ibis jobs

    • HOC applications

  • API for extensions: write your own!


Koala and the das community l.jpg
Koala and the DAS Community

  • Extensive experience working in real Grid environments: over 25,000 completed jobs!

  • Koala has been released on the DAS in Sep 2005 [ www.st.ewi.tudelft.nl/koala/]

    • Hands-on Tutorials (last in Spring 2006)

    • Documentation (web-site)

    • Papers

      • IEEE Cluster’04, Dagstuhl FGG’04, EGC’05, IEEE CCGrid’05, IEEE eScience’06, etc.

  • Koala helps you get results:

    • IEEE CCGrid’06, others submitted


The future of koala 1 l.jpg

CPU’s

R

CPU’s

R

CPU’s

R

NOC

CPU’s

R

CPU’s

R

The Future of Koala (1)

  • Support for more applications types, e.g.,

    • Workflows, Parameter sweep applications

  • Scheduling your application?

  • Communication-aware and application-aware scheduling policies:

    • Take into account the communication pattern of applications when co-allocating

    • Also schedule bandwidth (in DAS3)

DAS-3


The future of koala 2 l.jpg
The Future of Koala (2)

  • Support heterogeneity

    • DAS3

    • DAS2 + DAS3

    • DAS3 + Grid’5000 + RoGRID

  • Peer-to-peer structure instead of hierarchical grid scheduler


The future of koala 3 l.jpg
The Future of Koala (3)

  • Usage Service Level Agreements (uSLAs)

    • Want to give a partner his share of resources

    • Prevent abusive behavior

    • Rules for “mostly free” system usage pattern(decay-based uSLA mechanism)


Other koala related l.jpg
Other Koala-related

  • Resource Provisioning and Scheduling for World-Wide Data-Sharing Services

    • Grid systems provisioning resources forW-W D-S Services, e.g., P2P file-sharing

    • Provision quickly resources, give guarantees

    • No adverse impact on current level of service

    • Prevent abusive behavior


Outline17 l.jpg
Outline

  • Koala: Processor and Data Co-Allocation in Grids

    • The Co-Allocation Problem in Grids

    • The Design of Koala

    • Koala and the DAS Community

    • The Future of Koala

  • GrenchMark: Analyzing, Testing, and Comparing Grids

    • A Brief Introduction to Grid Computing

    • Grid Performance Evaluation Issues

    • The GrenchMark Architecture

    • Experience with GrenchMark

  • Take home message


A brief introduction to grid computing l.jpg
A Brief Introduction to Grid Computing

  • Typical grid environmente.g., the DAS

    • Applications [!]

    • Resources

      • Compute (Clusters)

      • Storage

      • (Dedicated) Network

    • Virtual Organizations, Projects (e.g., VL-e), Groups, Users

  • Grids vs. (traditional)parallel production environments

    • Dynamic

    • Heterogeneous

    • Very large-scale (world)

    • No central administration

      →Most problems are NP-hard,need experimental validation


Experimental environments real world testbeds l.jpg
Experimental Environments Real-World Testbeds

  • Real-World Testbed

    • DAS, NorduGrid, Grid3/OSG, Grid’5000…

  • Pros

    • True performance, also shows “it works!”

    • Infrastructure in place

  • Cons

    • Time-intensive

    • Exclusive access (repeatability)

    • Controlled environment problem (limited scenarios)

    • Workload structure (little or no realistic data)

    • What to measure (new environment)


Experimental environments simulated and emulated testbeds l.jpg
Experimental Environments Simulated and Emulated Testbeds

  • Simulated and Emulated Testbeds

    • GridSim, SimGrid, GangSim, MicroGrid …

    • Essentially trade-off precision vs. speed

  • Pros

    • Exclusive access (repeatability)

    • Controlled environment (unlimited scenarios)

  • Cons

    • Synthetic Grids: What to generate? How to generate? Clusters, Disks, Network, VOs, Groups, Users, Applications, etc.

    • Workload structure (little or no realistic data)

    • What to measure (new environment)

    • Validity of results (accuracy vs. time)


Grid performance evaluation current practice l.jpg
Grid Performance Evaluation Current Practice

  • Performance Indicators

    • Define my own metrics, or use U and AWT/ART, or both

  • Workload Structure

    • Run my own workload; Mostly all users are created equal assumption (unrealistic)

    • Do not make comparisons (incompatible workloads)

    • No repeatability of results (e.g., background load)

Need a common performance evaluation framework for Grid


Test management the generic problem of analyzing testing and comparing grids l.jpg
Test Management: The Generic Problem of Analyzing, Testing, and Comparing Grids

  • Use cases for automatically analyzing, testing, and comparing Grids

    • Comparisons for system design and procurement

    • Functionality testing and system tuning

    • Performance testing/analysis of grid applications

  • For grids, this problem is hard!

    • Testing in real environments is difficult

    • Grids change rapidly

    • Validity of tests


Test management a generic solution to analyzing testing and comparing grids l.jpg
Test Management: A Generic Solution to Analyzing, Testing, and Comparing Grids

  • “ Generate and run synthetic grid workloads, based on real and synthetic applications “

  • Current alternatives (not covering all problems)

    • Benchmarking with real/synthetic applications (representative?)

    • User-defined test management (statistically sound?)

  • Advantages of using synthetic grid workloads

    • Statistically sound composition of benchmarks

    • Statistically sound test management

    • Generic: cover the use cases’ broad spectrum (to be shown)


Grid performance evaluation current issues l.jpg
Grid Performance Evaluation Current Issues

  • Test Management

    • Perform a representative test in a real(istic) Grid

  • Performance Indicators

    • What are the metrics for the new environment?

  • Workload Structure

    • Which general aspects are important?

    • Which Grid-specific aspects need to be addressed?

Need a common performance evaluation framework for Grid:GrenchMark


Grenchmark a framework for analyzing testing and comparing grids l.jpg
GrenchMark: a Framework for Analyzing, Testing, and Comparing grids

  • What’s in a name?grid benchmark→ working towards a generic tool for the whole community: help standardizing the testing procedures, but benchmarks are too early; we use synthetic grid workloads instead

  • What’s it about?A systematic approach to analyzing, testing, and comparing grid settings, based on synthetic workloads

    • A set of metrics and workload units for analyzing grid settings [JSSPP’06]

    • A set of representative grid applications

      • Both real and synthetic

    • Easy-to-use tools to create synthetic grid workloads

    • Flexible, extensible framework


Grenchmark overview easy to generate and run synthetic workloads l.jpg
GrenchMark Overview: Easy to Generate and Run Synthetic Workloads


But more complicated than you think l.jpg

Workload structure

User-defined and statistical models

Dynamic jobs arrival

Burstiness and self-similarity

Feedback, background load

Machine usage assumptions

Users, VOs

Metrics

A(W) Run/Wait/Resp. Time

Efficiency, MakeSpan

Failure rate [!]

(Grid) notions

Co-allocation, interactive jobs, malleable, moldable, …

Measurement methods

Long workloads

Saturated / non-saturated system

Start-up, production, and cool-down scenarios

Scaling workload to system

Applications

Synthetic

Real

Workload definition language

Base language layer

Extended language layer

Other

Can use thesame workload for both simulations and real environments

… but More Complicated Than You Think


Grenchmark overview unitary and composite applications l.jpg
GrenchMark Overview: Unitary and Composite Applications

  • Unitary applications

    • sequential,MPI, Java RMI, Ibis, …

  • Composite applications

    • Bag of tasks

    • Chain of jobs

    • Direct Acyclic Graph-based (Standard Task Graph Archive)


Grenchmark overview workload description files l.jpg
GrenchMark Overview: Workload Description Files

  • Format:

Number of jobs

Co-allocation and number of components

Language extensions

Composition and application types

Inter-arrival and start time

Combining four workloads into one


Performance indicators l.jpg
Performance Indicators

  • Time-, Resource-, and System-Related Metrics

    • Traditional: utilization, A(W)RT, A(W)WT, A(W)SD

    • New: waste, fairness (or service quality reliability)

  • Workload Completion and Failure Metrics

    “ In Grids, functionality may be even more importantthan performance ”

    • Workload Completion (WC)

    • Task and Enabled Task Completion (TC, ETC)

    • System Failure Factor (SFF)

A.Iosup, D.H.J.Epema (TU Delft), C. Franke, A. Papaspyrou, L. Schley, B. Song, R. Yahyapour (U Dortmund), On modeling synthetic workloads for Grid performance evaluation, JSSPP’06.


Using grenchmark grid system analysis l.jpg
Using GrenchMark: Grid System Analysis

  • Performance testing: test the performance of an application (for sequential, MPI, Ibis applications)

    • Report runtimes, waiting times, grid middleware overhead

    • Automatic results analysis

  • What-if analysis: evaluate potential situations

    • System change

    • Grid inter-operability

    • Special situations: spikes in demand


Using grenchmark functionality testing in grid environments l.jpg
Using GrenchMark: Functionality Testing in Grid Environments

  • System functionality testing: show the ability of the system to run various types of applications

    • Report failure rates [ arguably, functionality in grids is even more important than performance !  10% job failure rate in a controlled system like the DAS ]

  • Periodic system testing: evaluate the current state of the grid

    • Replay workloads


Using grenchmark comparing grid settings l.jpg
Using GrenchMark: Comparing Grid Settings

  • Single-site vs. co-allocated jobs: compare the success rate of single-site and co-allocated jobs, in a system without reservation capabilities

    • Single-site jobs 20% better vs. small co-allocated jobs (<32 CPUs), 30% better vs. large co-allocated jobs [setting and workload-dependent !]

  • Unitary vs. composite jobs: compare the success rate of unitary and composite jobs, with and without failure handling mechanisms

    • Both 100% with simple retry mechanism [setting and workload-dependent !]


A grenchmark success story releasing the koala grid scheduler on the das l.jpg
A GrenchMark Success Story:Releasing the Koala Grid Scheduler on the DAS

  • Koala [ http://www.st.ewi.tudelft.nl/koala/]

    • Grid Scheduler with co-allocation capabilities

  • DAS: The Dutch Grid, ~200 researchers

  • Initially

    • Koala, a tested (!) scheduler, pre-release version

  • Test specifics

    • 3 different job submission modules

    • Workloads with different jobs requirements, inter-arrival rates, co-allocated v. single site jobs…

    • Evaluate: job success rate, Koala overhead and bottlenecks

  • Results

    • 5,000+jobs successfully run (all workloads); functionality tests

    • 2 major bugs first day, 10+ bugs overall (all fixed)

    • KOALA is now officially released on the DAS(full credit to KOALA developers, 10x for testing with GrenchMark)


Grenchmark iterative research roadmap l.jpg

Simple functional system

A.Iosup, J.Maassen, R.V.van Nieuwpoort, D.H.J.Epema, Synthetic Grid Workloads with Ibis, KOALA, and

GrenchMark, CoreGRID IW, Nov 2005.

Open-

GrenchMark

CommunityEffortJSSPP’06

Complex extensible system

A.Iosup, D.H.J.Epema, GrenchMark: A Framework for Analyzing, Testing, and Comparing Grids, IEEE CCGrid'06, May 2006.

University of Dortmund

GrenchMark: Iterative Research Roadmap


Towards open grenchmark grid traces simulators benchmarks l.jpg
Towards Open-GrenchMark: Grid traces, Simulators, Benchmarks

  • Distributed testing

    • Integrate with DiPerF (C. Dumitrescu, I. Raicu, M. Ripeanu, and M.I. Andreica)

  • Grid traces analysis

    • Automatic tools for grid traces analysis

  • Use in conjunction with simulators

    • Ability to generate workloads which can be used in simulated environments (e.g., GangSim, GridSim, …)

  • Grid benchmarks

    • Analyze the requirements for domain-specific grid benchmarks

A. Iosup, C. Dumitrescu, D.H.J. Epema (TU Delft), H. Li, L. Wolters (U Leiden), How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications, IEEE Grid’06.


Outline37 l.jpg
Outline

  • Koala: Processor and Data Co-Allocation in Grids

    • The Co-Allocation Problem in Grids

    • The Design of Koala

    • Koala and the DAS Community

    • The Future of Koala

  • GrenchMark: Analyzing, Testing, and Comparing Grids

    • A Brief Introduction to Grid Computing

    • Grid Performance Evaluation Issues

    • The GrenchMark Architecture

    • Experience with GrenchMark

  • Take home message


Take home message l.jpg
Take home message

  • PDS Group/TU Delft - resource and test management in Grid systems

  • Koala: Processor and Data Co-Allocation in Grids [ www.st.ewi.tudelft.nl/koala/] - Grid scheduling with co-allocation and fault-tolerance- many placement policies available- extensible runners system- tutorials, on-line documentation, papers

  • GrenchMark: Analyzing, Testing, and Comparing Grids[ grenchmark.st.ewi.tudelft.nl]- generic tool for the whole community- generates diverse grid workloads- easy-to-use, flexible, portable, extensible, …- tutorials, papers


Thank you l.jpg
Thank you!

Questions? Remarks? Observations? All welcome!

www.st.ewi.tudelft.nl/koala

grenchmark.st.ewi.tudelft.nl/


ad