1 / 40

Yet Another Grid Project: The Open Science Grid at SLAC

Yet Another Grid Project: The Open Science Grid at SLAC. Matteo Melani, Booker Bense and Wei Yang SLAC. Hepix Conference 10/13/05, SLAC, Menlo Park, CA, USA. July 22 nd , 2005.

talbot
Download Presentation

Yet Another Grid Project: The Open Science Grid at SLAC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Yet Another Grid Project:The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA, USA

  2. July 22nd, 2005 • “The Open Science Grid Consortium today officially inaugurated the Open Science Grid, a national grid computing infrastructure for large scale science. The OSG is built and operated by teams from U.S. universities and national laboratories, and is open to small and large research groups nationwide from many different scientific disciplines.” - science grid this week-

  3. Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Final thought

  4. Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Final thought

  5. ATLAS DC2 CMS DC04 Once upon a time there was… • Goal to build a shared Grid infrastructure to support opportunistic use of resources for stakeholders. Stakeholders are NSF, DOE sponsored Grid Projects (PPDG, GriPhyN, iVDGL), and US LHC software program. Team of computer and domain scientists deployed (simple) services in a Common infrastructure and interfaces across existing computing facilities.Operating stably for over a year in support of computationally intensive applications.Added communities without perturbation. • 30 sites, • ~3600 CPUs

  6. Vision (1) The Open Science Grid: A production quality national grid infrastructure for large scale science. • Robust and scalable • Fully managed • Interoperates with other Grids

  7. Vision (2)

  8. What is the Open Science Grid? (Ian Foster) • Open • A new sort of multidisciplinary cyberinfrastructure community • An experiment in governance, incentives, architecture • Part of a larger whole, with TeraGrid, EGEE, LCG, etc. • Science • Driven by demanding scientific goals and projects who need results today (or yesterday) • Also a computer science experimental platform • Grid • Standardized protocols and interfaces • Software implementing infrastructure, services, applications • Physical infrastructure—computing, storage, networks • People who know & understand these things!

  9. OSG Consortium Members of  the OSG Consortium are those organizations that have made agreements  to contribute to the Consortium. • DOE Labs: SLAC, BNL, FNAL • Universities: CCR- University of Buffalo • Grid Projects: iVDGL, PPDG, Grid3, GriPhyN • Experiments: LIGO, US CMS, US ATLAS, CDF Computing, D0 Computing , STAR, SDDS • Middleware Projects: Condor, Globus, SRM Collaboration, VDT Partners are  those organizations with whom we are interfacing to work on interoperation of grid infrastructures and services. • LCG, EGEE, TeraGrid

  10. Character of Open Science Grid (1) • Pragmatic approach: • Experiments/users drives requirements • “Keep it simple and make more reliable” • Guaranteed and opportunistic use of resources provided through Facility-VO contracts. • Validated, supported core services based on VDT and NMI Middleware. (Currently GT3 based, moving soon to GT4) • Adiabatic evolution to increase scale and complexity. • Services and applications contributed from external projects. Low threshold to contributions and new services.

  11. Character of Open Science Grid (2) • Heterogeneous Infrastructure • All Linux but different versions of the Software Stack at different sites. • Site autonomy: • Distributed ownership of resources with diverse local policies, priorities, and capabilities. • “no” Grid software on compute nodes. • But users want direct access for diagnosis and monitoring: • Quote from physicist on CDF: “Experiments need to keep under control the progress of their application to take proper actions, helping the Grid to work by having it expose much of its status to the users”

  12. Architecture

  13. Services • Computing Service: GRAM form GT3.2.1+patches • Storage Service: SRM Interface (v1.1) as common interface to storage, DRM and dCache; most sites use NFS + GridFTP, we are looking into SRM-xrootd solution • File Transfer Service: GridFTP • VO Management Service: INFN VOMS • AA: GUMS v1.0.1, PRIMA v0.3, gPlazma • Monitoring Service: Monalisa, v1.2.34, MDS • Information Service: jClarens, v0.5.3-2, GridCat • Accounting Service: partially provided by Monalisa

  14. Open Science Grid Release 0.2 User Portal Submit Host: Condor-GGlobus RSL Catalogs & Displays: GridCat ACDC MonaLisa Virtual Organization Management Site Boundary (WAN->LAN) Compute Element Compute Element GT2 GRAM Grid monitor Monitoring & Information GridCat, ACDC MonaLisa, SiteVerify Storage Element SRM V1.1 GridFTP CE WN $WN_TMP PRIMA; gPlazma Batch queue job priority Authentication Mapping:GUMS Identity and Roles : X509 Certs Common Space across WN: $DATA (local SE) $APP $TMP Courtesy of Ruth Pordes

  15. OSG 0.4 User Portal Submit Host: Condor-GGlobus RSL Catalogs & Displays: GridCat ACDC MonaLisa Identity and Roles : X509 Certs Service Discovery: Virtual Organization Management Site Boundary (WAN->LAN) Compute Element Edge Service Framework (XEN) Lifetime Managed VO Services Compute Element GT2 GRAM Grid monitor Storage Element SRM V1.1 GridFTP GT4 GRAM CE Some Sites with Bandwidth Management Common Space across WN: $DATA (local SE) $APP $TMP Accounting Full Local SE Monitoring & Information GridCat, ACDC MonaLis SiteVerify Job monitoring & exit codes reporting WN $WN_TMP PRIMA; gPlazma Batch queue job priority Authentication Mapping:GUMS Courtesy of Ruth Pordes GIP + BDII network

  16. Software distribution • Software is contributed by individual OSG members into collections we call "packages". • OSG provides collections of software for common services built on top of the VDT to facilitate participation. • There is very little OSG specific software and we strive to use standards based interfaces where possible. • OSG software packages are currently distributed as a Pacman caches. • Latest release on May 24th VDT 1.3.6

  17. OSG’s deployed Grids OSG Consortium operates two grids: • OSG is the production grid: • Stable; for sustained production • 14 VOs • 38 sites, ~5,000 CPUs, 10 VOs. • Support provided • http://osg-cat.grid.iu.edu/ • OSG-ITB: is the test and development grid Grid: • For testing new services, technologies, versions… • 29 sites, ~2400 CPUs, • http://osg-itb.ivdgl.org/gridcat/

  18. Operations and support • VOs are responsible for 1st level support • Distributed Operations and Support model from the outset. • Difficult to explain, but scalable and putting most support “locally”. • Key core component is central ticketing system with automated routing and import/export capabilities to other ticketing systems and text based information. • Grid Operations Center (iGOC) • Incident Response Framework, coordinated with EGEE.

  19. Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Final thought

  20. PROD_SLAC • 100 jobs slots available in TRUE resource sharing • 0.5 TB of disk space • osg-support@slac.stanford.edu • LSF 5.1 batch system • VO role-base authentication and authorization • VOs: Babar, US ATLAS, US CMS, LIGO, iVDGL

  21. PROD_SLAC • 4 Sun V20z, dual processors machines • Storage is provided with NFS: 3 directories $APP, $DATA and $TMP • We do not run Ganglia or GRIS

  22. Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Conclusions

  23. AA using GUMS

  24. UNIX account issue The Problem: • SLAC Unix account did not fit the OSG model: • Normal SLAC account have too many default privileges • Gatekeeper-AFS interaction is problematic The Solution: • Created a new class of Unix accounts just for the Grids • Creation of new process for this new type of account • New account type have minimum privileges: • no emails, no login accesses, • home dir on grid dedicated NFS, no write access beyond Grid NFS server

  25. DN-UID mapping • Each (DN, voGroup) pair is mapped to an unique UNIX account • No group mapping • Account name schema: osg + VOname + VOgroup + NNNNN Example: A DN in USCMS VO (voGroup /uscms/) => osguscms00001 iVDGL VO, group mis (voGroup /ivdgl/mis) =>osgivdglmis00001 • If revoked, the account name/UID will never be reused (unlike for UNIX accounts) • Keep track of Grid UNIX accounts like ordinary UNIX user accounts (in RES) 1,000,000 < UID < 10,000,000 • All Grid UNIX accounts belongs to one single UNIX group • Home directories on Grid dedicated NFS, shells are /bin/false

  26. Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • OSG-LSF integration • Running applications: US CMS and US ATLAS • Final thought

  27. GRAM Issue The Problem: • Gatekeeper over aggressively poll jobs status; it overloads the LSF scheduler • Race conditions: LSF job manager unable to distinguish between error condition and loaded system (we usually have more than 2K jobs running) • Maybe reduced in next version of LSF The Solution: • Re-write part the LSF job manager: lsf.pm • Looking into writing custom bjobs to have local caching

  28. The straw the broke the camel’s back • SLAC has more than 4000 jobs slots being schedule by a single machine • We operate is a fully production mode: operation disruption has to be avoided at all costs • Too many monitoring tools (ACDC, Monalisa, User’s Monitoring tools…) can easily overload the LSF scheduler by running bjobs –u all • Implementations of monitoring is a concern!

  29. Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Final thought

  30. US CMS Application Intentionally left blank! We could run 10-100 jobs right away

  31. US ATLAS Application • ATLAS reconstruction and analysis jobs require access to remote database servers at CERN, BNL, and elsewhere • SLAC batch nodes don't have internet access • Solution is to use have clone of the database within the SLAC network or to create a tunnel

  32. Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Final thought

  33. Final thought “PARVASEDAPTAMIHISED…” - Ludovico Ariosto

  34. QUESTIONS?

  35. Spare

  36. Governance

  37. Physical View 2 1 3 7 8 6 10 4 9 5 Ticketing Routing Example User in VO1 notices problem at RP3, notifies their SC (1). SC-C opens ticket (2) and assigns to SC-F. SC-F gets automatic notice (3) and contacts RP3 (4). Admin at RP3 fixes and replies to SC-F (5). SC-F notes resolution in ticket and marks it resolved (6). SC-C gets automatic notice of update to ticket (7). SC-C notifies user of resolution (8). User can complain if dissatisfied and SC-C can re-open ticket (9,10). OSG infrastructure SC private infrastructure

More Related