Research and development
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Research and Development PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Research and Development. R&D Agenda. Security Bulk Data Movement Data Replication and Mirroring Monitoring Metrics Versioning Product Services. Security: Single Sign-On Solutions. Goal: Single Sign-On (SSO) across browsers and non-browser clients Public Key Infrastructure (PKI) SSO

Download Presentation

Research and Development

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Research and development

Research and Development

R d agenda

R&D Agenda

  • Security

  • Bulk Data Movement

  • Data Replication and Mirroring

  • Monitoring

  • Metrics

  • Versioning

  • Product Services

Security single sign on solutions

Security: Single Sign-On Solutions

Goal: Single Sign-On (SSO) across browsers and non-browser clients

Public Key Infrastructure (PKI) SSO

SSO for non-browser applications, like GridFTP

SSO through X.509 public key certificates issued by MyProxy

Online Certification Authority (CA) with username/password

Auto-provisioning of trust configuration


SSO for http/https applications through OpenID

OpenID Identity Provider (IdP) with username/password

Web-SSO & PKI-SSO share username/password DB

Single primary authentication mechanism for end user

Security integrated websso pki sso

Security: Integrated WebSSO & PKI-SSO

Security myproxy as online ca

Security: MyProxy as Online CA

  • MyProxy: Open Source software from NCSA

  • Online CA is one of its many capabilities

  • Different primary authentication mechanisms through standardized Pluggable Authentication Module (PAM)

  • Shipped with Globus Toolkit, supported on various platforms

  • Client package as separate deployment, including Java clients and API

Earth System Grid Center for Enabling Technologies: (ESG-CET)

Security auto provisioning

Security: Auto-Provisioning

  • PKI-SSO solutions require configuration of trust-roots

    • Identity providers (IdPs), Certification Authorities (CAs)

    • Revocation lists

  • Up-to-date configuration required at servers and clients

    • Scalability issues with large numbers of clients

  • MyProxy provides auto-provisioning option

    • Integrated with login

    • Transparently updates CAs and CRLs

    • Is extended to use for server-provisioning also

Security openid

Security: OpenID

OpenID provides SSO across multiple servers and can leverage multiple IdPs

OpenID satisfies ESG security requirements

OpenID uses standard HTTP/HTTPS protocol

Use ESG-specific OpenID profile to ensure safe deployment

All communication with IdP requires SSL

ClientIdP and IdpRP

Yadis IDs (URIs) for OpenID identifiers

Resource Providers (RP) enforce a white list of IdPs

Security openid4java

Security: OpenID4Java

OpenID4Java: Open Source software

ESG developers contribute enhancements back

Deployable as independent package into standard application servers

Integrates well with ESG’s application server software

Built-in support:

SSL (encrypted communication)

User attributes push

Java API to write authentication filters and identity providers

Extended to support attributes and multiple identity providers

Bulk data movement


Access all data holdings through uniform interfaces, including disk pools and mass storage systems on various nodes, using various security models

Allocate space quotas to users dynamically on gateways in order to serve files to client

Manage file lifetimes in the allocated spaces, and automatically clean up spaces for reuse

Provide easy-to-use user facilities to download many files

Manage large-scale robust data movement for replication of core data between nodes

Storage Resource Management (SRM) tools support these requirements in ESG

Bulk Data Movement

Bulk data movement srm technology and bestman

Bulk Data Movement: SRM Technology, and BeStMan

Storage Resource Managers (SRM) are middleware components over shared distributed storage components, that provide:

Dynamic space allocation

Dynamic file management in spaces

Uniform interface to all storage systems

The Berkeley Storage Manager (BeStMan) is an implementation of the SRM standard

The SRM specification is an OGF (Open Grid Forum) standard that was developed over the last 7 years

BesStMan is used in ESG, several High-Energy-Physics (HEP) experiments, and other applications

BeStMan in ESG (see figure next slide)

Used for coordinating space allocation and transparent access and file movement between ESG nodes and the gateway

Currently interfaces to HPSS in NERSC and ORNL, to MSS at NCAR, and to disk systems at LLNL and LANL

Also used to manage space on the NCAR gateway

Bulk data movement use of bestman in esg

Bulk Data Movement: Use of BeStMan in ESG

BeStMan at Gateway accesses all other BeStMan in nodes to get requested files (highlighted in purple)

Datamoverlite dml simplifying data movement to clients

DataMoverLite (DML): Simplifying Data Movement to Clients

Goal: automate pulling of files into user’s workstation

  • Using various transfer protocols (GridFTP, bbcp, https, …)

  • Have a GUI that shows transfer progress, or summary progress with command line

  • Supports entire directory transfers

  • Supports suspend/resume operations

  • DML available onLinux, PC, MAC

  • GUI shows info on completed, active, pendingtransfers

  • Also, file sizes,transfer times,transfer speed

Bulk data movement service requirements

Bulk Data Movement Service Requirements

Move terabytes to petabytes (many thousands of files)

Asynchronous long-lasting operation

Recovery from transient failures and automatic restart

Take advantage of (dynamic) network provisioning

Use GridFTP, other protocols if necessary

Space verification at target

Support for data checksums

On-demand transfer status information

On-demand completion time estimates

Statistics collection

For security reasons bulk data movement needs to be done in “pull mode”

Workflow for future bulk data movement service

Workflow for Future Bulk Data Movement Service











plan using




at Target







Compose request

for failed files





and generate













Data replication and mirroring

Data Replication and Mirroring

Requirement: several mirror sites around the world want to host key subsets (called a “core”) of ESG data sets

This is a new requirement for ESG

Replication of climate data sets was not originally an ESG goal

Originally considered impractical because of large size of climate data sets

With increasing importance of the IPCC data, international sites want to replicate or “mirror” key data sets

Give scientists in a geographical region access to a “local” copy

Reduce wide area latencies for data access

Provide increased fault tolerance and disaster protection, since data sets are available at multiple sites

Impact of data replication mirroring

Impact of Data Replication/Mirroring

This work will make ESG data sets more accessible to climate scientists outside of the ESG-CET project

Initial planned mirror sites:

UK’s British Atmospheric Data Centre (BADC)

Germany’s Max Planck Institute for Meterology (MPIM)

Both have participated in design discussions for mirroring functionality

Others mirror sites likely (e.g., in Asia)

Global network topology considerations

Impact will be to increase the use of ESG and CMIP5 data sets by scientists around the world, thus advancing climate science discoveries

Requirements for data mirroring

Requirements for Data Mirroring

Newly published data set(s) are added to a common core produced at a gateway

A mirror site replicates some or all of the data sets from the common core published by a gateway

Changes to existing data sets (additions, deletions, replacements, modifications) are propagated from publishing gateway to mirror sites

Data mirroring plans going forward

Data Mirroring Plans Going Forward

Implementation plan (currently in progress) involves integration of several key ESG components and new functionality

Choose among available source replicas for data and metadata

Invoke the Bulk Data Movement component to copy data sets reliably to the mirror site’s data node

Use existing ESG metadata API operations to query the relevant metadata at the publishing node

Use a modified version of the ESG publication client to publish newly replicated data sets at the mirror site’s gateway

Identify updates that need to be propagated to mirror sites using versioning functionality.

Technical objectives

In the next year: complete initial implementation and deployment; evaluate data mirroring at sites in ESG, Europe

Add functionality, including support for automatic subscription and notification of mirrored data sets

Replication architecture 1

Replication Architecture (1)

Replication architecture 2

Replication Architecture (2)



  • Monitoring has contributed significantly to the robustness of the ESG infrastructure

  • Based on the Globus Monitoring and Discovery System (MDS)

  • ESG uses MDS to monitor the status of components in the distributed system

    • GridFTP data transfer services

    • Storage Resource Managers (SRMs)

    • NCAR portal

    • HTTP data services

    • OpenDAP services

    • Replica Location Services (RLSs)

Globus monitoring and discovery system

Globus Monitoring and Discovery System

  • MDS Index Service

    • Collects status information from information providers at each component

    • Report whether a particular service being monitored is currently working correctly

  • MDS Trigger Service

    • Takes actions based on monitored conditions

    • Sends emails to the Earth System Grid administrators’ mailing list when components fail

Impact of monitoring and future plans

Impact of Monitoring and Future Plans

  • Has resulted in much faster recovery of failed services in the distributed ESG infrastructure

  • Lower downtime of our infrastructure

  • The ESG team is quickly informed when components fail

    • Allows the team to quickly restart failed services

    • Often before failures are encountered by users

  • We plan to deploy yet more sophisticated monitoring

    • ESG infrastructure increasingly distributed, federated

    • Also want to monitor status of mirror sites worldwide

    • Monitor service performance as well as availability

      • Investigating NetLogger, PerfSONAR



Metrics are required to track and record users interactions with the ESG enterprise system

Reporting is required to show the benefits of the ESG enterprise system to the scientific community at large

ESG Gateway requests metric data from its Data Nodes

An ESG Gateway will periodically download metrics data (SRM, OPeNDAP, LAS, server hardware performance) gathered by a Data Node for a give interval of time

Returned metrics data will then be stored at the ESG Gateway for future metrics reports

Metrics requirements

Metrics Requirements

Metrics requirements1

Metrics Requirements

Metrics progress

Metrics progress

The gathering of important metrics for the ESG Gateway has been completed

  • User registrations

  • User logins

  • File downloads

  • User clickstreams

  • Browser type usage

    Report generation for key metrics has been completed

  • Total users registered, including monthly trends

  • Total files downloaded, including monthly trends

Metrics plan going foward

Metrics Plan Going Foward

Several improvements are required in the near term for Metrics

  • Design and development of the Data Node “black box” metrics gathering software

  • Design and development of auto generated report notifications via email

  • Design and development of a star schema for the metrics database

Data versioning

Data Versioning

  • Data changes, even after publication

    • Errors in simulation, processing, metadata, etc.

  • Critically important that data publishers and consumers can identify which version of data they are working with

    • Changes to data may affect results of analyses

  • Versioning previously handled manually

    • Adequate for moderate amounts of closely controlled data (current production archives)

    • Insufficient for global scale, especially with replication (key driver)

  • Now putting versioning on formal footing

    • In collaboration with BADC, MPIM

  • Initial focus on identification of key use cases, developing and evaluating preliminary software designs

Proposed versioning software design

Proposed Versioning Software Design

Product services delivering visualization and analysis to users

Product Services: Delivering Visualization and Analysis to Users

  • Product Services provide a web-based easy-to-use interface to a vast array of interactive, science-relevant information products

    • Make plots in 1 and 2 dimensions along any axis or combination of two axes including animation along the time axis

    • Control plot appearance

    • Launch external tools either via scripts to access data in desktop tools or direct launch of Google Earth

    • Compare different data sets and variables in specialized user interface

    • Request server-side analysis and view the results

    • Supports plots of curvilinear data grids and on-the-fly re-gridding to rectangular grids

  • Web-based administrative interface for cache management

Product services architecture

Product Services Architecture

Designed to integrate many data types and products from many legacy applications into a unified user-controlled environment

Combines incoming request with metadata to learn where the data are; what protocol is needed to read them and instructs backend services to read the data and create products

Product services offer diverse capabilities 1 2

Product Services Offer Diverse Capabilities (1/2)

Product Services provide a Web-based easy-to-use interface to a vast array of interactive, science-relevant information products

Launch external tools like Google Earth, Matlab and others

Compute on-the-fly analysis via efficient server-side functions and plot the result

Product services offer diverse capabilities 2 2

Product Services Offer Diverse Capabilities (2/2)

Make comparisons along an axes and/or between data sets

Make comparisons along different cutting planes and/or between data sets

  • Login