welcome to research group meeting on reliability and robustness in grid computing systems
Skip this Video
Download Presentation
Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems

Loading in 2 Seconds...

play fullscreen
1 / 14

Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems - PowerPoint PPT Presentation

  • Uploaded on

Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems. Chris Dabrowski Geoff Fox [email protected] [email protected] OGF21 Seattle, Washington, USA October 17, 2007. Proposed Meeting Agenda.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems' - Jeffrey

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
welcome to research group meeting on reliability and robustness in grid computing systems

Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems

Chris Dabrowski Geoff Fox

[email protected]@indiana.edu


Seattle, Washington, USA

October 17, 2007

proposed meeting agenda
Proposed Meeting Agenda

I. Introduction

II. Presentation/Review of Draft OGF Informational Document “Reliability in Grid Computing Systems”

  • A work in progress

III. Discussion

IV. Close

grid reliability and robustness rg
Grid Reliability and Robustness RG

Purpose:Make recommendations and explore methods for improving reliability and robustness of standards-based grid systems.

Main Product:Produce OGF Informational Document that

Summarizes the state of work on Grid system reliability and identifies reliability and robustness issues/requirements for grid systems

First draft in progress Contributions, review needed!

Additional Products:

Facilitate collaborations between researchers on grid reliability

Preliminary requirements for reliability measurement methods and tools

Web pages and reflector

Official: https://forge.gridforum.org/sf/projects/gridrel-rg

Unofficial: http://gridreliability.nist.gov/ List of resources (in progress)

Reflector: [email protected]

ogf informational document
OGF Informational Document

Title: Reliability in Grid Computing Systems:


Summarizes the state of work on Grid system reliability based on input from grid system practitioners/researchers

Identifies issues that must be addressed/solved to ensure reliability and robustness in grid systems

Provides basis for identifying requirements for establishing and maintaining high levels of reliability in large-scale Grids

Basis for preliminary requirements for methods and tools to measure grid system reliability

 Focus on current practices and research that provide insight on how WS and grid specifications may affect grid reliability

Serve as resource on reliability issues for OGF working groups developing specifications and for grid developers.

document basis previous workshops on grid reliability
Document basis: previous workshops on grid reliability

First workshop (GGF16, Athens, Greece)

Site Assessment and Probabilistic Risk Analysis (PRA) of Grid Computing Facilities, by Joe Higgins and Robert Sewell of Sun Microsystems

Methods for analyzing risks involved in deploying and configuring grid computing sites

Reliable Messaging for Grids and Web Services, by Geoffrey Fox, Shrideep Pallickara, Damodar Yemme, Hasan Bulut and Sima Patel, Community Grids Lab, Indiana University

NaradaBrokering: scalable, standards-based management architecture for fault-tolerant grids

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH), by Heon Y. Yeom of Distributed Computing Systems Laboratory, Seoul National University

Fault-tolerant MPI (FT-MPICH) with coordinated checkpointing of interacting, parallel processes

QoS-Aware Fault Tolerance in Grid Computing, by L. Valcarenghi, F. Cugini, F. Paolucci, and P. Castoldi, Scuola Superiore of Sant’Anna and CNIT, Pisa, Italy

Fault-tolerance thru integrating replicated services and QoS capable network protocol layer

A Program of Work for Understanding Emergent Behavior in Global Grid Systems, by Kevin Mills and Chris Dabrowski, of the U.S. NIST

Developing methods for understanding and controlling complex systems behavior in grids

document basis previous workshops on grid reliability6
Document basis: previous workshops on grid reliability

Second workshop (OGF19, Chapel Hill, USA)

Using a Large-Scale Survivability Architecture to Control Grids: A Status Report, by Zach Hill, Jonathan Rowanhill, Jim Basney, Glenn Wasson, John Knight, Anh Nguyen-Tuong, Andrew Grimshaw and Marty Humphrey, University of Virginia and NCSA/University of Illinois, Urbana-Champaign

Reconfigurable Grid system architecture (Willow) for promoting survivability & dependability

Platform Symphony Reliability, by Nick Werstiuk, Platform Computing

Grid architecture for promoting reliability & dependability through failure detection and failover

Managing Grid and Web Services and their exchanged messages, by Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, and Marlon Pierce, Indiana University

Results showing performance, scalability and cost-effectiveness of NaradaBrokering architecture

Reliability Assessment of Grid Software Systems Using Emergent Features, by Carol Song, Umut Topkara, Jungha Woo, and Sang Phill Park, Purdue University

Method for identifying centralized software components likely to impact grid system reliability

Reflections on Reliability Issues in OGSA, by Matti Hiltunen, AT&T Labs

Summary of requirements for ensuring reliability and availability of OGSA-based services

document outline reliability in grid computing systems
Document Outline: Reliability in Grid Computing Systems
  • Introduction
  • Definitions
  • Current Practices on Grid System Reliability
  • Reliability of Grid Applications
  • Reliability Issues and Preliminary Requirements
  • Reliability Metrics and Preliminary Measurement Requirements
  • Summary
  • Resources
2 definitions
2. Definitions:
  • Source
    • Avizienis, A., Laprie, J., Randell, B., and Landwehr, C.

“Basic Concepts and Taxonomy of Dependable and Secure Computing,”

  • Key definitions:
    • Reliability, availability, dependability, and fault tolerance
    • Grid resources
  • Decomposition of Grid Reliability concerns
    • Hardware and Software computing resources accessible via grid
    • Core infrastructure and resource management services
      • Allocate and manage grid resources
      • Example: discovery, negotiation, execution management, notification, security, etc.
    • Underlying connection and data transport facilities: grid network
    • Overall system perspective
3 current practices research on grid system reliability
3. Current practices/research on grid system reliability
  • Some main points: grid reliability methods
    • Still leverage redundancy
    • In deployed systems are based on methods used in cluster computing
    • Must face scalability & administrative boundary issues
  • Areas covered
    • Fault tolerance of grid resources
      • Fault detection
      • Recovery methods for grid resources

Checkpoint and recovery through process migration, grid

resource replication, replication in data grids

      • Fault removal through testing and code certification
    • Reliability of supporting infrastructure and management services
    • Grid connection and transport reliability
      • Specifications, fault tolerant grid networks, reliable multicasting
    • Reliability from overall system perspective
      • Architectural perspective, complex systems perspective
4 reliability of grid applications
4. Reliability of grid applications
  • Some main points:
    • Grid applications may/should ensure their reliability themselves (perspective of GCPR WG?)
    • Merging of grid user/client FT methods and provider FT methods?
    • What’s being done for FT in grid workflows?
  • Areas covered
    • Fault tolerance of remote application processes
    • Fault tolerance of grid resource compositions and workflows
      • Workflows composed with languages/tools for grid environments
      • Workflows composed with languages/tools for generic web service environments
    • Merging application and provider fault tolerance strategies
4 reliability issues and preliminary requirements
4. Reliability issues and preliminary requirements
  • Fault removal
    • Cost-benefits of testing grid components to determine which functions and kind of tests needed (component, integration, or interaction tests)
  • Fault Tolerance
    • Fault detection: need for scalability of methods, fault taxonomies
    • Recovery: tradeoffs between methods, understanding which methods to use and when, and coordinated checkpoint methods.
  • Special requirements for infrastructure and resource management services
    • Criticality of services leads to different tradeoff dynamics
  • Fault tolerance for grid networking and data transport
    • FT/control in overlays, combining overlays, dedicated networks, enhance specs for reliability(?), reliable multicasting?
  • Fault tolerance of grid applications
    • User vs provider FT, FT considerations for workflow languages?
5 metrics and preliminary measurement requirements
5. Metrics and preliminary measurement requirements
  • Preliminary work on grid reliability metrics
    • OGF Network measurement working group (2004), analysis of reliability of a grid by Xie and colleagues (2004).
  • Preliminary requirements for metrics, three classes:
    • OGF NM WG
    • Metrics to measure availability and reliability of individual grid resources (needed by grid users for evaluation purposes)
    • Metrics to measure reliability of entire grid or significant subsections (as above)
6 summary
6. Summary
  • TBD

7. Resources

  • Over 180 cited
  • Organized topically in an appendix
  • Additional sources to be worked in
presentation summary
Presentation Summary
  • Document work in progress
    • Please review and comment!
    • Please contribute!