1 / 22

Grid Job and Information Management (JIM) for D0 and CDF

Grid Job and Information Management (JIM) for D0 and CDF. Gabriele Garzoglio for the JIM Team. Overview. Introduction Grid-level Management SAM-Grid = SAM + JIM Job Management Information Management Fabric-level Management Running jobs on grid resources Local sandbox management

Download Presentation

Grid Job and Information Management (JIM) for D0 and CDF

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team

  2. Overview • Introduction • Grid-level Management • SAM-Grid = SAM + JIM • Job Management • Information Management • Fabric-level Management • Running jobs on grid resources • Local sandbox management • The DZero Application Framework • Running MC at UWisc

  3. Context • D0 Grid project started in 2001-2002 to handle D0’s expanded needs for globally distributed computing • JIM complements the data handling system (SAM) with jobs and info management • JIM is funded by PPDG (our team here), GridPP (Rod Walker in the UK) • Collaborative effort with the experiments. • CDF joined later in 2002

  4. History • Delivered JIM prototype for D0, Oct 10, 2002: • Remote job submission • Brokering based on data cached • Web-based monitoring • SC-2002 demo – 11 sites (D0, CDF), big success • May 2003 – started deployment of V1 • Now – working on running MC in production on the Grid

  5. Overview • Introduction • Grid-level Management • SAM-Grid = SAM + JIM • Job Management • Information Management • Fabric-level Management • Running jobs on grid resources • Local sandbox management • The DZero Application Framework • Running MC at UWisc

  6. data meta-data job Flow of: User Interface User Interface User Interface User Interface Submission Submission Global Job Queue Resource Selector Grid Client Match Making Info Gatherer Info Manager Info Collector Global DH Services SAM Naming Server Site Data Handling Local Job Handling Cluster XML DB server SAM Log Server Site Conf. Grid Gateway SAM Station (+other servs) Glob/Loc JID map Resource Optimizer ... Local Job Handler (CAF, D0MC, BS, ...) SAM DB Server Web Serv SAM Stager(s) MDS JIM Advertise RC MetaData Catalog Grid Monitoring Info Providers Worker Nodes Bookkeeping Service Cache MSS User Tools Dist.FS AAA Site Site Site SAM-Grid Logistics

  7. Job Management Highlights • We distinguish grid-level (global) job scheduling (selection of a cluster to run) from local scheduling (distribution of the job within the cluster) • We consider 3 types of jobs • analysis: data intensive • monte carlo: CPU intensive • reconstruction: data and CPU intensive

  8. Job Management – Distinct JIM Features • Decision making is based on both: • Information existing irrespective of jobs (resource description) • Functions of (jobs,resource) • Decision making is interfaced with data handling middleware • Decision making is entirely in the Condor framework (no own RB) – strong promotion of standards, interoperability • Brokering algorithms can be extended via plug-ins

  9. User Interface User Interface Submission Client Submission Client Job Management Match Making Service Match Making Service Broker Queuing System Queuing System Information Collector Information Collector JOB Data Handling System Data Handling System Data Handling System Data Handling System Execution Site #1 Execution Site #n Computing Element Computing Element Computing Element Storage Element Storage Element Storage Element Storage Element Storage Element Grid Sensors Grid Sensors Grid Sensors Grid Sensors Computing Element

  10. Information Management • In JIM’s view, this includes: • configuration framework • resource description for job brokering • infrastructure for monitoring • Main features • Sites (resources) and jobs monitoring • Distributed knowledge about jobs etc • Incremental knowledge building • GMA for current state inquiries, Logging for recent history studies • All Web based

  11. Resource Advertisement classad Monitoring Configuration LDIF Service Instantiation XML Information Management via Site Configuration Main Site/cluster Config XMLDB Template XML XSLT … XSLT XSLT XSLT

  12. Overview • Introduction • Grid-level Management • SAM-Grid = SAM + JIM • Job Management • Information Management • Fabric-level Management • Running jobs on grid resources • Local sandbox management • The DZero Application Framework • Running MC at UWisc

  13. Running jobs on Grid resources • The trend: Grid resources are not dedicated to a single experiment • Translation: • no daemons running on the worker nodes of a Batch System • no experiment specific software installed

  14. Running jobs on Grid resources • The situation today is transitioning: • Worker nodes typically access the software via shared FS: not scalable! • Generally, experiments can install specific services on a node close to the cluster. • Local resource configuration still too diverse to easily plug into the Grid

  15. The JIM local sandbox management • It keeps the job executable (from the Grid) at the head node and knows where its product dependencies are • It transports and installs the software to the worker node • It can instantiate services at the worker node • It sets up the environment for the job to run • It packages the output and hands it over to the Grid, so that it becomes available for the download at the submission site

  16. Running a DZero application • We have JIM sandbox: where is the problem now? • JIM sandbox could immediately use the DZero Run Time Environment, but • Not all the DZero packages are RTE Compliant • User don’t have experience/incentives in using it today

  17. Overview • Introduction • Grid-level Management • SAM-Grid = SAM + JIM • Job Management • Information Management • Fabric-level Management • Running jobs on grid resources • Local sandbox management • The DZero Application Framework • Running MC at UWisc

  18. Running Monte Carlo at UWisc • University of Wisconsin offered DZero the opportunity of using a 1000 node non-dedicated condor cluster • We are concentrating on putting it to use to run MC with mc_runjob (in production by year end)

  19. The challenges I • MC code is not RTE compliant today • Chain of 3-5 stages. Each binary 50-200 MB, dynamically linked • Are compiled from 40 packages (total for D0 621). Need these packages at run time for RPC files • Root, Motif, X11, Ace libraries are found as dependencies (for MC generators…) • MC tarballs exist but are hand-crafted (and bug-prone) every time. Size unpacked 2GB (versus 12-15 GB full D0 app tree).

  20. The challenges II • About every advanced C++ feature, every libc library call, every system call, are used • One can get different results on two RedHat 7.2 systems. • Total release tree takes N hours (up to 20+) to build – not something easy to do dynamically at remote site

  21. Summary • The SAM-Grid offers an extensible working framework for Grid-level Job/Data/Info Management • JIM provides Fabric-level management tools for sandboxing • The applications need to be improved to run on Grid resources

More Related