1 / 19

Evaluation of the Globus GRAM Service

Evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova. Evaluation of GRAM Service. GIS. Submit jobs (using Globus tools). Information on characteristics and status of local resources. GRAM. GRAM. GRAM. CONDOR. LSF. PBS. Site1. Site2. Site3.

Download Presentation

Evaluation of the Globus GRAM Service

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

  2. Evaluation of GRAM Service GIS Submit jobs (using Globus tools) Information on characteristics and status of local resources GRAM GRAM GRAM CONDOR LSF PBS Site1 Site2 Site3

  3. Evaluation of GRAM Service • Job submission tests using Globus tools (globusrun, globus-job-run, globus-job-submit) • GRAM as uniform interface to different underlying resource management systems • “Cooperation” between GRAM and GIS • Evaluation of RSL as uniform language to specify resources • Tests performed with Globus 1.1.2 and 1.1.3 and Linux machines

  4. GRAM & fork system call Client Server (fork) Globus Globus

  5. GRAM & Condor Client Server (Condor front - end machine) Globus Globus Condor Condor pool

  6. GRAM & Condor • Tests considering: • Standard Condor jobs (relinked with Condor library) • INFN WAN Condor pool configured as Globus resource • ~ 200 machines spread across different sites • Heterogeneous environment • No single file system and UID domain • Vanilla jobs (“normal” jobs) • PC farm configured as Globus resource • Single file system and UID domain

  7. GRAM & LSF Server (LSF front - end machine) Globus LSF Client Globus LSF Cluster

  8. Results • Some bugs found and fixed (fixes included in INFNGRID 1.1 distribution) • Standard output and error for vanilla Condor jobs • globus-job-status • … • Some bugs can be solved without major re-design and/or re-implementation: • For LSF the RSL parameter (count=x) is translated into: bsub –n x … • Just allocates x processors, and dispatches the job to the first one • Used for parallel applications • Should be: bsub … x times • Maybe we don’t need to solve this problem (see later…) • … • Two major problems: • Scalability • Fault tolerance

  9. Globus GRAM Architecture Client Globus front-end machine pc2 pc1 pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \ –f file.rsl file.rsl: & (executable=/diskCms/startcmsim.sh) (stdin=/diskCms/PythiaOut/filename (stdout=/diskCms/Cmsim/filename) (count=1) LSF/ Condor/ PBS/ … Jobmanager Job

  10. Scalability • One jobmanager for each globusrun • If I want to submit 1000 jobs ??? • 1000 globusrun • 1000 jobmanagers running in the front-end machine !!! • %globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rsl file.rsl: & (executable=/diskCms/startcmsim.sh) (stdin=/diskCms/PythiaOut/filename) (stdout=/diskCms/CmsimOut/filename) (count=1000) • It is not possible to specify in the RSL file 1000 different input files and 1000 different output files … • $(Process) in Condor • Problems with job monitoring (globus-job-status) • Therefore (count=x) with x>1 not very useful !

  11. Fault tolerance • The jobmanager is not persistent • If the jobmanager can’t be contacted, Globus assumes that the job(s) has been completed • Example of problem • Submission of n jobs on a cluster managed by a local resource management systems • Reboot of the front end machine • The jobmanager(s) doesn’t restart • Orphan jobs  Globus assumes that the jobs have been successfully completed

  12. GRAM & GIS • How the local GRAMs provide the GIS with characteristics and status of local resources ? • Tests performed considering: • Condor pool • LSF cluster

  13. GRAM & Condor & GIS

  14. GRAM & LSF & GIS Must be fixed

  15. Jobs & GIS • Info on Globus jobs published in the GIS: • User • Subject of certificate • Local user name • RSL string • Globus job id • LSF/Condor/… job id • Status: Run/Pending/…

  16. GRAM & GIS • The information on characteristics and status of local resources and on jobs is not enough • As local resources we must consider Farms and not the single workstations • Other information (i.e. total and available CPU power) needed • Fortunately the default schema can be integrated with other info provided by specific agents • The needed information must be identified first

  17. RSL • We need a uniform language to specify resources, between different resource management systems • The RSL syntax model seems suitable to define even complicated resource specification expressions • The common set of RSL attributes is often not sufficient • The attributes not belonging to the common set are ignored

  18. RSL • More flexibility is required • Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model) • Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach

  19. Next steps • Bug fixes • Modification of Globus LSF scripts for GIS • Problem (count=x) with LSF ??? • Tests with real applications and real environments (CMS fall production) • Define a small set of attributes of a Condor pool, LSF cluster, PBS cluster that should be reported to the GIS, and try to implement it • Let’s start with information provided by the underlying resource management system • Tests with GRAM API • Not necessary tests with other resource management systems • Scalability and robustness problems • Not so simple and straightforward !!! • Up to Workload management WP, possible collaboration with Globus team and Condor team

More Related