Yet another grid project the open science grid at slac
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

Yet Another Grid Project: The Open Science Grid at SLAC PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on
  • Presentation posted in: General

Yet Another Grid Project: The Open Science Grid at SLAC. Matteo Melani, Booker Bense and Wei Yang SLAC. Hepix Conference 10/13/05, SLAC, Menlo Park, CA, USA. July 22 nd , 2005.

Download Presentation

Yet Another Grid Project: The Open Science Grid at SLAC

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Yet another grid project the open science grid at slac

Yet Another Grid Project:The Open Science Grid at SLAC

Matteo Melani, Booker Bense and Wei Yang

SLAC

Hepix Conference

10/13/05,

SLAC,

Menlo Park, CA, USA


July 22 nd 2005

July 22nd, 2005

  • “The Open Science Grid Consortium today officially inaugurated the Open Science Grid, a national grid computing infrastructure for large scale science. The OSG is built and operated by teams from U.S. universities and national laboratories, and is open to small and large research groups nationwide from many different scientific disciplines.”

    - science grid this week-


Outline

Outline

  • OSG in a nutshell

  • OSG at SLAC: “PROD_SLAC” site

  • Authentication and Authorization in OSG

  • LSF-OSG integration

  • Running applications: US CMS and US ATLAS

  • Final thought


Outline1

Outline

  • OSG in a nutshell

  • OSG at SLAC: “PROD_SLAC” site

  • Authentication and Authorization in OSG

  • LSF-OSG integration

  • Running applications: US CMS and US ATLAS

  • Final thought


Once upon a time there was

ATLAS

DC2

CMS DC04

Once upon a time there was…

  • Goal to build a shared Grid infrastructure to support opportunistic use of resources for stakeholders. Stakeholders are NSF, DOE sponsored Grid Projects (PPDG, GriPhyN, iVDGL), and US LHC software program. Team of computer and domain scientists deployed (simple) services in a Common infrastructure and interfaces across existing computing facilities.Operating stably for over a year in support of computationally intensive applications.Added communities without perturbation.

• 30 sites,

• ~3600 CPUs


Vision 1

Vision (1)

The Open Science Grid: A production quality national grid infrastructure for large scale science.

  • Robust and scalable

  • Fully managed

  • Interoperates with other Grids


Vision 2

Vision (2)


What is the open science grid ian foster

What is the Open Science Grid? (Ian Foster)

  • Open

    • A new sort of multidisciplinary cyberinfrastructure community

    • An experiment in governance, incentives, architecture

    • Part of a larger whole, with TeraGrid, EGEE, LCG, etc.

  • Science

    • Driven by demanding scientific goals and projects who need results today (or yesterday)

    • Also a computer science experimental platform

  • Grid

    • Standardized protocols and interfaces

    • Software implementing infrastructure, services, applications

    • Physical infrastructure—computing, storage, networks

  • People who know & understand these things!


Osg consortium

OSG Consortium

Members of  the OSG Consortium are those organizations that have made agreements  to contribute to the Consortium.

  • DOE Labs: SLAC, BNL, FNAL

  • Universities: CCR- University of Buffalo

  • Grid Projects: iVDGL, PPDG, Grid3, GriPhyN

  • Experiments: LIGO, US CMS, US ATLAS, CDF Computing, D0 Computing , STAR, SDDS

  • Middleware Projects: Condor, Globus, SRM Collaboration, VDT

    Partners are  those organizations with whom we are interfacing to work on interoperation of grid infrastructures and services.

  • LCG, EGEE, TeraGrid


Character of open science grid 1

Character of Open Science Grid (1)

  • Pragmatic approach:

    • Experiments/users drives requirements

    • “Keep it simple and make more reliable”

  • Guaranteed and opportunistic use of resources provided through Facility-VO contracts.

  • Validated, supported core services based on VDT and NMI Middleware. (Currently GT3 based, moving soon to GT4)

  • Adiabatic evolution to increase scale and complexity.

  • Services and applications contributed from external projects. Low threshold to contributions and new services.


Character of open science grid 2

Character of Open Science Grid (2)

  • Heterogeneous Infrastructure

    • All Linux but different versions of the Software Stack at different sites.

  • Site autonomy:

    • Distributed ownership of resources with diverse local policies, priorities, and capabilities.

    • “no” Grid software on compute nodes.

      • But users want direct access for diagnosis and monitoring:

      • Quote from physicist on CDF: “Experiments need to keep under control the progress of their application to take proper actions, helping the Grid to work by having it expose much of its status to the users”


Architecture

Architecture


Services

Services

  • Computing Service: GRAM form GT3.2.1+patches

  • Storage Service: SRM Interface (v1.1) as common interface to storage, DRM and dCache; most sites use NFS + GridFTP, we are looking into SRM-xrootd solution

  • File Transfer Service: GridFTP

  • VO Management Service: INFN VOMS

  • AA: GUMS v1.0.1, PRIMA v0.3, gPlazma

  • Monitoring Service: Monalisa, v1.2.34, MDS

  • Information Service: jClarens, v0.5.3-2, GridCat

  • Accounting Service: partially provided by Monalisa


Yet another grid project the open science grid at slac

Open Science Grid Release 0.2

User

Portal

Submit Host:

Condor-GGlobus RSL

Catalogs & Displays:

GridCat

ACDC

MonaLisa

Virtual Organization

Management

Site Boundary (WAN->LAN)

Compute Element

Compute Element

GT2 GRAM

Grid monitor

Monitoring & Information

GridCat, ACDC

MonaLisa, SiteVerify

Storage Element

SRM V1.1

GridFTP

CE

WN

$WN_TMP

PRIMA; gPlazma

Batch queue

job priority

Authentication

Mapping:GUMS

Identity and Roles : X509 Certs

Common Space

across WN:

$DATA (local SE)

$APP

$TMP

Courtesy of Ruth Pordes


Yet another grid project the open science grid at slac

OSG 0.4

User

Portal

Submit Host:

Condor-GGlobus RSL

Catalogs & Displays:

GridCat

ACDC

MonaLisa

Identity and Roles : X509 Certs

Service

Discovery:

Virtual Organization

Management

Site Boundary (WAN->LAN)

Compute Element

Edge Service

Framework (XEN)

Lifetime Managed

VO Services

Compute Element

GT2 GRAM

Grid monitor

Storage Element

SRM V1.1

GridFTP

GT4 GRAM

CE

Some Sites with

Bandwidth Management

Common Space

across WN:

$DATA (local SE)

$APP

$TMP

Accounting

Full Local SE

Monitoring & Information

GridCat, ACDC

MonaLis SiteVerify

Job monitoring

& exit codes reporting

WN

$WN_TMP

PRIMA; gPlazma

Batch queue

job priority

Authentication

Mapping:GUMS

Courtesy of Ruth Pordes

GIP + BDII network


Software distribution

Software distribution

  • Software is contributed by individual OSG members into collections we call "packages".

  • OSG provides collections of software for common services built on top of the VDT to facilitate participation.

  • There is very little OSG specific software and we strive to use standards based interfaces where possible.

  • OSG software packages are currently distributed as a Pacman caches.

  • Latest release on May 24th VDT 1.3.6


Osg s deployed grids

OSG’s deployed Grids

OSG Consortium operates two grids:

  • OSG is the production grid:

    • Stable; for sustained production

    • 14 VOs

    • 38 sites, ~5,000 CPUs, 10 VOs.

    • Support provided

    • http://osg-cat.grid.iu.edu/

  • OSG-ITB: is the test and development grid Grid:

    • For testing new services, technologies, versions…

    • 29 sites, ~2400 CPUs,

    • http://osg-itb.ivdgl.org/gridcat/


Operations and support

Operations and support

  • VOs are responsible for 1st level support

  • Distributed Operations and Support model from the outset.

  • Difficult to explain, but scalable and putting most support “locally”.

    • Key core component is central ticketing system with automated routing and import/export capabilities to other ticketing systems and text based information.

  • Grid Operations Center (iGOC)

  • Incident Response Framework, coordinated with EGEE.


Outline2

Outline

  • OSG in a nutshell

  • OSG at SLAC: “PROD_SLAC” site

  • Authentication and Authorization in OSG

  • LSF-OSG integration

  • Running applications: US CMS and US ATLAS

  • Final thought


Prod slac

PROD_SLAC

  • 100 jobs slots available in TRUE resource sharing

  • 0.5 TB of disk space

  • [email protected]

  • LSF 5.1 batch system

  • VO role-base authentication and authorization

  • VOs: Babar, US ATLAS, US CMS, LIGO, iVDGL


Prod slac1

PROD_SLAC

  • 4 Sun V20z, dual processors machines

  • Storage is provided with NFS: 3 directories $APP, $DATA and $TMP

  • We do not run Ganglia or GRIS


Outline3

Outline

  • OSG in a nutshell

  • OSG at SLAC: “PROD_SLAC” site

  • Authentication and Authorization in OSG

  • LSF-OSG integration

  • Running applications: US CMS and US ATLAS

  • Conclusions


Aa using gums

AA using GUMS


Unix account issue

UNIX account issue

The Problem:

  • SLAC Unix account did not fit the OSG model:

    • Normal SLAC account have too many default privileges

    • Gatekeeper-AFS interaction is problematic

      The Solution:

  • Created a new class of Unix accounts just for the Grids

    • Creation of new process for this new type of account

  • New account type have minimum privileges:

    • no emails, no login accesses,

    • home dir on grid dedicated NFS, no write access beyond Grid NFS server


Dn uid mapping

DN-UID mapping

  • Each (DN, voGroup) pair is mapped to an unique UNIX account

  • No group mapping

  • Account name schema: osg + VOname + VOgroup + NNNNN

    Example:

    A DN in USCMS VO (voGroup /uscms/) => osguscms00001

    iVDGL VO, group mis (voGroup /ivdgl/mis) =>osgivdglmis00001

  • If revoked, the account name/UID will never be reused (unlike for UNIX accounts)

  • Keep track of Grid UNIX accounts like ordinary UNIX user accounts (in RES) 1,000,000 < UID < 10,000,000

  • All Grid UNIX accounts belongs to one single UNIX group

  • Home directories on Grid dedicated NFS, shells are /bin/false


Outline4

Outline

  • OSG in a nutshell

  • OSG at SLAC: “PROD_SLAC” site

  • Authentication and Authorization in OSG

  • OSG-LSF integration

  • Running applications: US CMS and US ATLAS

  • Final thought


Gram issue

GRAM Issue

The Problem:

  • Gatekeeper over aggressively poll jobs status; it overloads the LSF scheduler

  • Race conditions: LSF job manager unable to distinguish between error condition and loaded system (we usually have more than 2K jobs running)

    • Maybe reduced in next version of LSF

      The Solution:

  • Re-write part the LSF job manager: lsf.pm

  • Looking into writing custom bjobs to have local caching


The straw the broke the camel s back

The straw the broke the camel’s back

  • SLAC has more than 4000 jobs slots being schedule by a single machine

  • We operate is a fully production mode: operation disruption has to be avoided at all costs

  • Too many monitoring tools (ACDC, Monalisa, User’s Monitoring tools…) can easily overload the LSF scheduler by running bjobs –u all

  • Implementations of monitoring is a concern!


Outline5

Outline

  • OSG in a nutshell

  • OSG at SLAC: “PROD_SLAC” site

  • Authentication and Authorization in OSG

  • LSF-OSG integration

  • Running applications: US CMS and US ATLAS

  • Final thought


Us cms application

US CMS Application

Intentionally left blank!

We could run 10-100 jobs right away


Us atlas application

US ATLAS Application

  • ATLAS reconstruction and analysis jobs require access to remote database servers at CERN, BNL, and elsewhere

  • SLAC batch nodes don't have internet access

  • Solution is to use have clone of the database within the SLAC network or to create a tunnel


Outline6

Outline

  • OSG in a nutshell

  • OSG at SLAC: “PROD_SLAC” site

  • Authentication and Authorization in OSG

  • LSF-OSG integration

  • Running applications: US CMS and US ATLAS

  • Final thought


Final thought

Final thought

“PARVASEDAPTAMIHISED…”

- Ludovico Ariosto


Yet another grid project the open science grid at slac

QUESTIONS?


Spare

Spare


Governance

Governance


Yet another grid project the open science grid at slac

Physical

View

2

1

3

7

8

6

10

4

9

5

Ticketing Routing Example

User in VO1 notices problem at RP3, notifies their SC (1).

SC-C opens ticket (2) and assigns to SC-F.

SC-F gets automatic notice (3) and contacts RP3 (4).

Admin at RP3 fixes and replies to SC-F (5).

SC-F notes resolution in ticket and marks it resolved (6).

SC-C gets automatic notice of update to ticket (7).

SC-C notifies user of resolution (8).

User can complain if dissatisfied and SC-C can re-open ticket (9,10).

OSG infrastructure

SC private infrastructure


  • Login