1 / 11

GRID workload management system and CMS fall production

GRID workload management system and CMS fall production. Massimo Sgaravatto INFN Padova. What do we want to implement (simplified design). Resource Discovery. Submit jobs (using Class-Ads). Master. Grid Information Service (GIS). condor_submit (Globus Universe).

hellis
Download Presentation

GRID workload management system and CMS fall production

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova

  2. What do we want to implement (simplified design) Resource Discovery Submit jobs (using Class-Ads) Master Grid Information Service (GIS) condor_submit (Globus Universe) Master chooses in which Globus resources the jobs must be submitted Information on characteristics and status of local resources Condor-G Condor-G able to provide reliability Use of Condor tools for job monitoring, logging, … globusrun Globus GRAM as uniform interface to different local resource management systems Globus GRAM Globus GRAM Globus GRAM Local Resource Management Systems CONDOR LSF … Site1 Farms Site2 Site3

  3. What can be implemented now Not very useful in this model Submit jobs Grid Information Service (GIS) condor_submit (Globus Universe) Information on characteristics and status of local resources Condor-G Condor-G able to provide reliability Use of Condor tools for job monitoring, logging, … globusrun Globus GRAM as uniform interface to different local resource management systems Globus GRAM Globus GRAM Globus GRAM Local Resource Management Systems CONDOR LSF … Site1 Farms Site2 Site3

  4. Status • Tests on basic capabilities and functionalities have been performed • Problems with scalability and fault tolerance found • CMS production useful exercise to test everything with real applications and real environments

  5. CMS production • Application: Pythia + Cmsim • “Traditional” applications • Overview • Job management (submission, monitoring) from a single machine using Condor tools • User must explicitly define in which Globus resource (which farm) the jobs must be submitted • The applications and the input files must be stored in the file system of the executing machine • The output files will be created in the file system of the executing machine • We can try to have just the standard output/error files (useful to check the “status” of the production) created in the submitting machine, using bypass and/or Globus GASS • CMS wants to test bypass as a second step

  6. Bypass vs. GASS • Bypass • Written by Douglas Thain (Condor team) • Redirection of standard input/output/error of a program to a remote machine when the program is running • Can be used for dynamically linked program • Successfully tested with Pythia • Use of Globus Security Infrastructure • Globus GASS • Possibility to copy the input file on the remote machine before the execution, and have the output file back after the execution (otherwise it is necessary to modify the source code)

  7. What is necessary • Local farms with shared file system between the various nodes • Done using CMS installation toolkit • Installation and support up to CMS/local administrators • Installation of CMS environment on these farms • Done using CMS installation toolkit • Support up to CMS

  8. What is necessary • Local resource management system to manage the local farm • LSF • Installation and support up to CMS/local administrators • We should define in a “common” way how to configure the queue/s where the jobs run • Local Condor pool • Installation and configuration (for “dedicated” machines) using CMS toolkit • Support ??? • PBS • Are there sites where PBS will be used ??? • Tests on Condor-G – Globus – PBS not performed yet • Fork • Warmly thoughtless (even for a single machine) • Necessary to install Globus on each machine • Job queuing up to the production manager

  9. What is necessary • Globus • One installation per each farm (on a “visible” node) • Use of personal certificates and host certificates signed by INFN CA • User certificates signed by Globus CA are accepted as well • By default it is not possible to “use” Globus resources outside INFN using personal certificates signed by INFN CA • Workaround 1: Users have also personal certificates signed by Globus CA • Workaround 2: “Small” modification in the Globus configuration of these resources outside INFN in order to accept “our” certificates too • Installation • Installation done by CMS/local administrators/WP1 member (if present) using distribution and procedures provided by INFN GRID release team (http://www.pi.infn.it/GRID/GRID_INST_1.1.html) • In case of problems: globus@infn.it

  10. What is necessary • Condor-G • Just one installation, used by the production manager (Ivano Lippi ?) • Installation and maintenance: Massimo Sgaravatto ??? • Scripts to run CMS production using this GRID environment • Up to CMS • Tools to “monitor” production • condor_q • Condor Job Viewer (Java GUI) • Run the production • Up to production manager

  11. Some items/actors missing ??? • When ??? • Relations with other activities ??? • Data Management (GDMP, …) ??? • ???

More Related