1 / 23

The SAM-Grid / LCG Interoperability Test Bed

The SAM-Grid / LCG Interoperability Test Bed. Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr). Overview. The Interoperability Test Bed Motivations Architecture Status Report Lesson learned / Problems encountered Still discussing… Conclusions.

lisacox
Download Presentation

The SAM-Grid / LCG Interoperability Test Bed

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio (garzogli@fnal.gov) Speaker: Pierre Girard (pierre.girard@in2p3.fr)

  2. Overview • The Interoperability Test Bed • Motivations • Architecture • Status Report • Lesson learned / Problems encountered • Still discussing… • Conclusions

  3. Motivations for the interoperability project • The SAM-Grid is a convenient meta-computing system for the RunII experiments because it offers… • …transparent access to the experiment data through SAM • …integrated application management (job environment preparation, application-sensitive policies, job aggregation) • But deployment is expensive… • The idea: DZero will increase its resource pool within the framework of LCG (EGEE), while relying on the SAM-Grid data and application management

  4. Flow of Job Submission Offers services to … Basic Architecture SAM-Grid / LCG Forwarding Node SAM-Grid LCG SAM-Grid VO-Specific Services • Main issues to track down: • Accessibility of the services • Usability of the resources • Scalability

  5. Network Boundaries Forwarding Node LCG Cluster VO-Service (SAM) Job Flow Offers Service FW C C FW FW FW C C C C C C C C S S S S Service/Resource Multiplicity SAM-Grid

  6. C C C C Current Test Bed Configuration SAM-Grid Network Boundaries Forwarding Node LCG Cluster Integration in Progress VO-Service (SAM) Job Flow Offers Service FW Wuppertal FW C C S C RAL Clermont- Ferrand S Imperial College Lancaster CCIN2P3

  7. Job Scheduling System Adaptation I • The SAM-Grid sees the FW node as another gateway • The SAM-Grid has developed a grid-to-fabric interface (job-manager) that interacts with multiple fabric services (SAM, Monitoring, Environment Preparation): the Batch System is one of them. • Batch system adaptation is done through a layer of abstraction and implemented via robust local scheduler handlers.

  8. Job Scheduling System Adaptation II • This mechanism is so flexible that allowed the adaptation of SAM-Grid to LCG • Job Management (submit, status poll, kill, output gathering, …) is implemented via an LCG “scheduler” handler • The handler uses the LCG UI to submit jobs to an LCG broker (logically part of the FW node, in practice can be anywhere)

  9. Overview • The Interoperability Test Bed • Motivations • Architecture • Status Report • Lesson learned / Problems encountered • Still discussing… • Conclusions

  10. Status Report • We can submit real DZero data reprocessing and montecarlo jobs to LCG via SAM-Grid • Jobs land on the available LCG clusters • Jobs rely on the SAM station at CCIN2P3 to handle input (binaries and data) and output • …see the SAM-Grid monitoring

  11. Problems/Lesson Learned I • Scratch management is responsibility of the site OR the application. • DZero requirements on local scratch space • Cannot run on NFS because of intensive I/O • Need 4 GB of local space • SAM-Grid uses job wrappers to do “smart” scratch management (find best scratch area to use) • These wrappers rely on the job managers to set up scratch variables ($TMP_DIR, …) • Under discussion: one aspect of considering a cluster DZero-certified should be having the scratch variables defined

  12. Problems/Lesson Learned II • Use of the LCG brokers • Experienced problems with disk space for the input sandbox (input sandbox 4 MB, all the rest via SAM) • Needed administrative action to resolve the problem • Possibly mitigated since we can use multiple brokers (tested with Wupperal and CCIN2P3 brokers)

  13. Problems/Lesson Learned III • Job Failure Analysis • In general, for a single SAM-Grid job, the forwarding node submits multiple LCG jobs (aggregation management). The output of all the jobs is bundled together in an output sandbox. • We observed problems retrieving the output of “aborted” LCG jobs • “Maradona” fails in handling the output • In this case, it is tough to understand what went wrong with the job

  14. Problems/Lesson Learned IV • Resubmission of non-reentrant jobs • Some jobs should not be resubmitted in case of failure. They will be recovered as a separate activity • Problems overriding retrials of job submission from the JDL and the UI configuration • Is this a known bug? A configuration problem on our part?

  15. Problems/Lesson Learned V • Network configuration • Sites hosting SAM must allow incoming network traffic from the FW node and from all LCG clusters (worker nodes) to allow data handling control and transport • SAM should be modified to provide port range control

  16. Problems/Lesson Learned VI • SAM configuration • SAM can only use TCP-based communication (as expected, UDP does not work in practice on the WAN) • SAM had to be modified to allow service accessibility for jobs within private networks (pull-based vs call-back interfaces)

  17. Still discussing... I • What does it mean certifying LCG for a certain DZero activity? • For reprocessing, all the SAM-Grid clusters have undergone an initial certification phase • The cluster processes a well known dataset, then results are compared with a reference result • What do we do for LCG? Should every individual cluster be certified? Should the LCG as a whole be certified? • The answer probably depends on the type of activity (Reprocessing, Montecarlo, Analysis, …)

  18. Still discussing... II • Who operates the SAM-Grid / LCG interoperability system? • For the SAM-Grid DZero reprocessing, people at the facilities had interest in having their resources utilized: people at each facility have run operations submitting jobs to their own facilities • Running “operations” means being responsible for the production of the data (routine job submission/monitoring, troubleshooting, facility maintenance/upgrade, …) • How do we organize the people that operate the LCG interoperability system? Is one responsible person enough?

  19. Still discussing... III • Support on LCG • In case something goes wrong on the LCG, DZero has to learn the best channels to request support • What response can DZero expect now and in 2 years? • As the system becomes more complex, it becomes difficult for the operators to pin point the reasons for job failures. LCG will get reports for failures of the SAM-Grid side… and vice-versa.

  20. Overview • The Interoperability Test Bed • Motivations • Architecture • Status Report • Lesson learned / Problems encountered • Still discussing… • Conclusions

  21. Conclusions / SAM • We are moving the test bed to “production” by • expanding the system • ramping up usage • We are discussing open issues in operating the interoperability system • LCG certification • Organizing the operations • Obtaining support for LCG problems • Our principal target production application is montecarlo for DZero

  22. Conclusions / LCG • Grid batch job environment variables • Proposal for standardization made at last HEPIX and last Operations Workshop (Bologna) • http://edms.cern.ch/document/630962 • What is the next step ? How to proceed with implementation ? • Make easier the MW errors handling • By using a well defined set of MW error codes ? • Suitable for automatic handling

  23. More info at… • http://www-d0.fnal.gov/computing/grid/doc/SAMGrid-LCG-integration.pdf • http://www-d0.fnal.gov/computing/grid/doc/SAMGrid-LCG-integration-Lyon-report.pdf • http://samgrid.fnal.gov:8080/ • http://www-d0.fnal.gov/computing/grid/ • http://d0db.fnal.gov/sam/

More Related