1 / 16

Status of the European DataGrid Project

Status of the European DataGrid Project. Charles Loomis (LAL/CNRS) LAL December 12, 2002. European DataGrid (EDG). European DataGrid EU-funded, 3-year project (2001-3) Goals: develop grid middleware deploy onto working testbed demonstrate grid technology with working applications

Download Presentation

Status of the European DataGrid Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Status of the European DataGrid Project Charles Loomis (LAL/CNRS) LAL December 12, 2002

  2. European DataGrid (EDG) European DataGrid • EU-funded, 3-year project (2001-3) • Goals: • develop grid middleware • deploy onto working testbed • demonstrate grid technology with working applications • Strong application component unique! 6 Partners; 21 Associates

  3. EDG Goals Transparent Access • Allow users transparent access to authorized resources with single authentication. • Allow users to delegate authorization to services. • High-level selection of resources, including datasets. Virtual Organizations • Allow groups of people to acquire resources from sites. • Allow organization to manage resource use among members. Optimization • Allow optimal use of resources at site and grid levels.

  4. Information Systems User Interface Resource Broker Computing Element Storage Element Site X EDG Architecture Global Batch System: • Centralized Architecture. • Heavy infrastructure. query MDS Replica Catalogs submit retrieve publish state broker chooses optimal site for job submit retrieve

  5. Comments Optimization of Resources  Centralized Architecture • Resource Broker • must know state of grid and schedule effectively • requires knowledge of site policies and user/job details • Information System (MDS & RC) • must respond quickly to high-volume and high-rate queries Central Points-of-Failure • Resource Broker (redundancy at VO-level) • MDS (unique hierarchy; some redundancy possible) With high-rate submissions: • RB requires lots of memory, CPU, disk space. • MDS requires lots of file descriptors, CPU.

  6. Certification Authorities User Computing Element Storage Element Site X Virtual Organizations Authentication & Authorization request certificate ~15 National CAs France, INFN, … /C=FR/O=CNRS/OU=LAL/CN=Charles Loomis/Email=loomis@lal.in2p3.fr proxy sent for authentication Update CRL register accept/reject request retrieve membership lists ~10 Different VOs ATLAS, CMS, …

  7. Comments Infrastructure • ~15 National CAs as production service • 10 Virtual Organizations • High-Energy Physics: ALICE, BaBar, ATLAS, CMS, DZero, LHCb • Earth Observation • Biomedical Applications • Misc.: WP6, ITeam, Guidelines Limited Central Points-of-Failure • VO Membership Server (for VO members) • Certification Authority (for CA members) Caching, infrequent updates minimize problems; compromise security.

  8. Development Testbed (1.4.0) To facilitate testing and integration of new middleware. 3 sites (3 countries) Deployment & Use Application Use • CMS Event Simulation • ATLAS Event Simulation • Regular Tutorial Use Stability • Filled Grid this week! Production Testbed (1.4.0) • For applications to use & stress software in “semi-production” environment. • 8 sites (5 countries)

  9. Globus Experience GSI Security (OK) • Some limitations with size of proxies. GridFTP (OK) • Recent protocol change because of security fix. Replica Catalog (OK, limited) • Unannounced, unnecessary schema change. GateKeeper/JobManager (Poor) • Race conditions under load leading to failures. • High resource use; poor response to errors. Information System-MDS (Poor) • Serious problems with stability. • Query times increase dramatically under load.

  10. Globus Experience (cont.) Interaction • Generally responsive to identified problems. • Little advance warning of major changes. • Schema changes. • Rewrite of JobManager/Batch System interface. Testing • Essentially non-existent by Globus. • Major delays in EDG because of MDS and Gatekeeper. • Finding/testing/fixing of major problems done outside Globus. Globus “high-level” services inappropriate for production environment.

  11. Condor Experience CondorG • Used for reliable job submission from Resource Broker. • Responsive to problems and provide quick fixes. • Encountered few problems in our testing. Condor • Supported “batch” system for EDG. • Largely untested, but expect to use with next major release.

  12. Typical Failure Modes Operations: • CRL generation (CA); CRL update (sites) • Network accessibility (VO LDAP servers) • Misconfiguration of services (typically SE) Poor implementation (BUGS) • Most catastrophic ones eliminated. Resource Exhaustion • File descriptors, ports, disk space. Design Limitations • Central points-of-failure (RB, MDS).

  13. Future Developments EDG Plans: • Advanced data management • Real “Storage Element”. • Replica Location Service (distributed Replica Catalog) • Replica Manager (higher-level user interface) • Job Management • job splitting, checkpointing • interactive jobs • Replace MDS with R-GMA. • More robust, consistent security model. • Local resources better tied to grid credentials. OGSA (Open Grid Services Architecture) • New services written as web services. • Probably no complete conversion with EDG lifetime.

  14. SlashGrid Grid File System: • Uses grid credentials for access to local files. • Frees grid user from local unix account. • Simplifies mapping of users to accounts. • Allows true account recycling. More Uses: • Could hide remote access to data. • Provide compatibility to Globus security model. • … Implementation: • User-space daemon on top of CODA kernel module. • Plug-in interface allows easy extension.

  15. Certification Authorities User Computing Element Storage Element Site X VOMS Authentication & Authorization (VOMS) request certificate ~15 National CAs France, INFN, … proxy sent for authentication and authorization Update CRL request “ticket” accept/reject request Local Authorization Decision!

  16. Conclusions Software & Testbed: • Production-quality security infrastructure in place. • Production and development testbeds: • Deployed. • Starting to see heavy use by end-users. • Reasonable stability for the first time. • Failure modes: • Moving from bugs and operations problems to design and resource limitations. Unanswered Questions: • Can optimization be achieved? At what level? • How can resources be limited, reserved, and shared? • Can efficient scheduling be done with inhomogeneous site policies?

More Related