1 / 17

gLite: forthcoming middleware components

gLite: forthcoming middleware components. Claudio Grandi INFN and CERN. SA3 All Hands meeting CERN, 27-29 November 2006. Summary of current activities. Security Enabling glexec on Worker Nodes - testing on the preview Address user and security policy requirements in VOMS, VOMSAdmin

awena
Download Presentation

gLite: forthcoming middleware components

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. gLite: forthcoming middleware components Claudio Grandi INFN and CERN SA3 All Hands meeting CERN, 27-29 November 2006

  2. Summary of current activities • Security • Enabling glexec on Worker Nodes - testing on the preview • Address user and security policy requirements in VOMS, VOMSAdmin • Proxy renewal library repackaged without WMS dependencies • Shibboleth short-lived credential service and interaction with VOMS • Job Management • Working on certification of “3.1 branch” version of WMS and LB • Certification of the DGAS accounting system on INFNGrid (then SA3) • ICE-CREAM, G-PBox, Job Provenance testing on the preview • Data Management • Adding support for SRM v2.2 in DPM, GFAL, FTS, lcg_utils • Common rfio library in DPM & Castor, Xrootd plug-in’s in DPM • LFC: ssl session reuse, Oracle backend studies • FTS proxy renewal and delegation • Working with NA4 on new Encrypted Data Storage based on GFAL/LFC • Information • Adding a bootstrapping procedure in Service Discovery • Improvements in R-GMA • Development for GLUE 1.3 SA3 All Hands Meeting, CERN, 27-29 November 2006

  3. New features: Insertion of AA Certificates into the AC VO-specific directories in vomsdir Generic Attributes Full-java APIs Support for PKCS12-formatted credentials Support for GT4 Foreseen changes: Java APIs fully in sync with C/C++ APIs libvomsapic.so.* is going to be dropped Version 2.0 of VOMSAdmin New VomsAdmin 2.0 Redesign for stability, performance, extensibility New persistence manager based on Hibernate Better performances, easier to manage, extend, debug “Usable” web administrative interface Search, pagination, authorization work Based on Struts-Tiles MVC technology Simple notification service Admin can send email notifications to specific roles/groups Lightweight registration service User can ask for VO membership, group membership, role assignment First release: 12/06 (all the features of 1.2.x) ; new features: 3/07 VOMS/VOMSAdmin SA3 All Hands Meeting, CERN, 27-29 November 2006

  4. glexec is used by the gLite Computing Elements to change the local uid as function of the user identity (DN & FQAN) Several VOs submit ‘pilot’ jobs with a single identity for all of the VO The pilot job gets user jobs in ‘some’ way and executes them with the placeholder’s identity The site does not ‘see’ the original submitter Allowing the VO pilot job to run glexec on the WN could ‘recover’ the user identity and isolate the user job from the pilot job Issue: sites don’t like to run sudo code on the WN’s Also unknown batch-system behaviour A possibility is to run glexec in “null” mode: log the uid-change request but do not do it The original user identity is recovered but there is no isolation of user and pilot glexec SA3 All Hands Meeting, CERN, 27-29 November 2006

  5. gLite WMS and L&B • WMProxy: web service interface to WMS • decouples interaction with UI and internal procedures • logging to L&B, match-making, submission • added a load-limiter to avoid service hangs (users may go to other instances) • Renewal of VOMS proxies • Support for compound jobs (Compound, Parametric, DAGs) • One shot submission of a group of jobs • Submission time reduction (single call to WMProxy server) • Shared input sandboxes • Single Job Id to manage the group (single job ID still available) • Support for ‘scattered’ input/output sandboxes • Support for shallow resubmission • Resubmission happens in case of failure only when the job didn't start • Issues: • Needed fine tuning to work at the production scale • Difficulties in the management of DAGs • Will work to decouple Compound and Parametric jobs form DAGs • Implied a migration to Condor 6.7.19 • Gang-matchmaking problem. Still has manageability issues • Longer term work for High Availability WMS and LB SA3 All Hands Meeting, CERN, 27-29 November 2006

  6. gLite WMS - CMS Results • ~20000 jobs submitted • 3 parallel UIs • 33 Computing Elements • 200 jobs/collection • Bulk submission • Performances • ~ 2.5 h to submit all jobs • 0.5 seconds/job • ~ 17 hours to transfer all jobs to a CE • 3 seconds/job • 26000 jobs/day • Job failures • Negligible fraction of failures due to the gLite WMS • Either application errors or site problems ~7000 jobs/day on the LCG RB By A.Sciabà - 27 September 2006 SA3 All Hands Meeting, CERN, 27-29 November 2006

  7. gLite WMS - ATLAS Results • Official Monte Carlo production • Up to ~3000 jobs/day • Less than 1% of jobs failed because of the WMS in a period of 24 hours • Synthetic tests • Shallow resubmission greatly improves the success rate for site-related problems • Efficiency =98% after at most 4 submissions From 7-jun to 9-nov By A.Sciabà – 10 November 2006 SA3 All Hands Meeting, CERN, 27-29 November 2006

  8. Store and retain data on finished jobs Complementary to L&B Optionally adds input sandbox files and additional user annotations Provides data-mining capabilities Allows job re-execution JP Primary Storage (JPPS) Stores data coming from L&B purge L&B after job completion now 1-n relation for LB-JPPS JP Index Server (JPIS) User front-end (WSDL interface) May be VO-customized (predefined queries & indices) Job Provenance SA3 All Hands Meeting, CERN, 27-29 November 2006

  9. WMS/LB/JP Proposed Strategy • Phase out the LCG-RB (GT2, SL3) proposed for end-April 2007 • Switch to the gLite WMS v 3.1 asap (under applications’ test now) • Cleaner and more manageable code • Features were developed in 3.1 and back-ported to 3.0 in a Q&D way • Better error reporting and logging and performance improvements • Critical and non-critical bug-fixes • e.g. gang-matchmaking bug • A few new functionalities • Continue to address current management issues • # of WMProxy processes, logmonitor, interlogd • Deploy LB on a separate node (more WMSs to the same LB) • Start the deployment of JP (in test on the preview test-bed now) • JPPS: start with one or two instances per VO statically linked by the LBs • JPIS: deploy a few per VO and the client code on the UIs • Purge LBs after information goes to JP SA3 All Hands Meeting, CERN, 27-29 November 2006

  10. WN Computing Element • Three flavours available now: • LCG-CE (GT2 GRAM) • In production now but will be phased-out next year • gLite-CE (GSI-enabled Condor-C) • Already deployed but still needs thorough testing and tuning. Being done now • CREAM (WS-I based interface) • Deployed on the JRA1 preview test-bed. After a first testing phase will be certified and deployed together with the gLite-CE • Our contribution to the OGF-BES group for a standard WS-I based CE interface • CREAM and WMProxy demo at SC06! • BLAH is the interface to the local resource manager (via plug-ins) • CREAM and gLite-CE • Information pass-through: pass parameters to the LRMS to help job scheduling WMS, Clients Information System Grid Computing Element bdII R-GMA CEMon Site glexec + LCAS/ LCMAPS BLAH LRMS SA3 All Hands Meeting, CERN, 27-29 November 2006

  11. ICE/CREAM/CEMon • CREAM: lightweight Web Service Compute Element • WSDL interface • See OGSA-HPCP-WG demo at SC06: http://forge.ggf.org/sf/wiki/do/viewPage/projects.ogsa-hpcp-wg/wiki/HomePage • Direct submission from UI is possible • Improved security via glexec • no “fork-scheduler” • Will support for bulk jobs on the CE • optimization of staging of input sandboxes for jobs with shared files • ICE: Interface to Cream Environment • includes Condor-G functionalities for submission to CREAM • Already in gLite 3.1 WMS • CEMon: CE and job status • Allows notification of job status changes • CE status also via bdII • Under test on the preview testbed SA3 All Hands Meeting, CERN, 27-29 November 2006

  12. CREAM tests on the preview WMS • Submit a bunch of 1000 jobs to WMS with ICE and JC switched off • Turn on ICE or JC • Measure the throughput to the LRMS and the job successes • CREAM inefficiency probably dominated by bug #20357 (BLAH) • gLiteCE limited throughput due to bug #21529 (max 100 jobs/CE handled by Condor) • The Torino CE had DGAS running (with Marteen’s patch) WM filelist filelist ICE JC + Condor + LM CREAM gLiteCE LCG-CE BLAH BLAH jobmanager LRMS LRMS LRMS SA3 All Hands Meeting, CERN, 27-29 November 2006

  13. CE - Proposed Strategy • Support the LCG-CE with GT2 on SL3 until June 2007 • Note that it may be used as a front-end to gLite 3.1 clusters • Assume that the application interface to the CE is Condor-G • Sites that support VOs that need direct GRAM access need to maintain LCG-CE and gLite-CE together while apps migrate to use Condor-G submission • GT4-pre-WS jobmanagers will be added to the gLite-CE packaging • EGEE will not provide certification for this at the moment • Sites who are interested in this should become part of the PPS • Relevant at the end of life of the LCG-CE • CREAM-CE • Certification as soon as users are happy on the preview testbed • JRA1 should investigate to add GT4-WS GRAM interface • JRA1 should continue to work with OGF on standard interfaces SA3 All Hands Meeting, CERN, 27-29 November 2006

  14. Resource metering collect Usage Records on the CE Combine information from LRMS (CPU, ...) and Grid (DN, ...) Working on standard LRMS UR format with OSG Site HLR Store and possibly aggregate information from local resources VO or central HLR Store and possibly aggregate information from all sites Build a hierarchy of 2nd level HLRs to facilitate information collection Currently information is sent to APEL from the site HLR In future regions will be able to chose to send information to APEL from any layer Access to (high level) HLRs depends on VOMS group/roles Test of LCG-CE sensors on INFNGrid, then SA3 certification DGAS GOC VO or central HLR User/VO layer 2nd level HLR APEL SiteHLR Site layer Resource metering Resource layer SA3 All Hands Meeting, CERN, 27-29 November 2006

  15. GPBOX: Distributed VO policy management tool Interface to define fine-grained VO policies based on VOMS groups and roles Store policies and propagate to sites Enforcement of policies at sites Sites may accept/reject policies May be interfaced to dynamic sources of information E.g. an accounting system to provide fair share GPBOX plug-ins for LCAS/LCMAPS are invoked during authorization Being tested on the preview test-bed now Long term solution Tests on the PPS based on VOViews only (available with Glue 1.2) Standards Compliant (RBAC, XACML, GSI) Group A Group B Policies Group A : high and low priority CEs Group B : low priority CEs VOMS server CEHIGH WMS CELOW Site PBox VO PBox GPBOX SA3 All Hands Meeting, CERN, 27-29 November 2006

  16. Most of work in DM is for support of SRM 2.2 New FTS v2.0 New schema and state machines Reviewed database schema to avoid row locking Performance improvement Reduce the deadlock risk Hooks for extensibility (e.g. VO policy, pre-staging) Delegation No need to pass the MyProxy pass-phrase anymore VOMS proxy used to contact SRM SRM 2.2 New version will be backwards compatible with old one Future development (low priority) Pre-staging State notification Integrated cataloguing Data Management SA3 All Hands Meeting, CERN, 27-29 November 2006

  17. gLite development plans • Complete migration to VDT 1.3.X and support for SL(C)4 and 64-bit • Complete migration to the ETICS build system • Work according to work plans available at: https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEgLiteWorkPlans • In particular: • Continue work on making all services VOMS-aware • Including job priorities • Improve error reporting and logging/monitoring of services • Consolidation, in particular of WMS and LB • Support for all batch systems in the production infrastructure on the CE • Use the information pass-through by BLAH to control job execution on CE • Complete support to SRM v2.2 • Complete the new Encrypted Data Storage based on GFAL/LFC • Complete and test glexec on Worker Nodes • Standardization of usage records for accounting • Interoperation with other projects and adherence to standards • Collaboration with EUChinaGrid on IPv6 compliance SA3 All Hands Meeting, CERN, 27-29 November 2006

More Related