GRID workload management system and CMS fall production

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova

What do we want to implement (simplified design) Resource Discovery Submit jobs (using Class-Ads) Master Grid Information Service (GIS) condor_submit (Globus Universe) Master chooses in which Globus resources the jobs must be submitted Information on characteristics and status of local resources Condor-G Condor-G able to provide reliability Use of Condor tools for job monitoring, logging, … globusrun Globus GRAM as uniform interface to different local resource management systems Globus GRAM Globus GRAM Globus GRAM Local Resource Management Systems CONDOR LSF … Site1 Farms Site2 Site3

What can be implemented now Not very useful in this model Submit jobs Grid Information Service (GIS) condor_submit (Globus Universe) Information on characteristics and status of local resources Condor-G Condor-G able to provide reliability Use of Condor tools for job monitoring, logging, … globusrun Globus GRAM as uniform interface to different local resource management systems Globus GRAM Globus GRAM Globus GRAM Local Resource Management Systems CONDOR LSF … Site1 Farms Site2 Site3

Status • Tests on basic capabilities and functionalities have been performed • Problems with scalability and fault tolerance found • CMS production useful exercise to test everything with real applications and real environments

CMS production • Application: Pythia + Cmsim • “Traditional” applications • Overview • Job management (submission, monitoring) from a single machine using Condor tools • User must explicitly define in which Globus resource (which farm) the jobs must be submitted • The applications and the input files must be stored in the file system of the executing machine • The output files will be created in the file system of the executing machine • We can try to have just the standard output/error files (useful to check the “status” of the production) created in the submitting machine, using bypass and/or Globus GASS • CMS wants to test bypass as a second step

Bypass vs. GASS • Bypass • Written by Douglas Thain (Condor team) • Redirection of standard input/output/error of a program to a remote machine when the program is running • Can be used for dynamically linked program • Successfully tested with Pythia • Use of Globus Security Infrastructure • Globus GASS • Possibility to copy the input file on the remote machine before the execution, and have the output file back after the execution (otherwise it is necessary to modify the source code)

What is necessary • Local farms with shared file system between the various nodes • Done using CMS installation toolkit • Installation and support up to CMS/local administrators • Installation of CMS environment on these farms • Done using CMS installation toolkit • Support up to CMS

What is necessary • Local resource management system to manage the local farm • LSF • Installation and support up to CMS/local administrators • We should define in a “common” way how to configure the queue/s where the jobs run • Local Condor pool • Installation and configuration (for “dedicated” machines) using CMS toolkit • Support ??? • PBS • Are there sites where PBS will be used ??? • Tests on Condor-G – Globus – PBS not performed yet • Fork • Warmly thoughtless (even for a single machine) • Necessary to install Globus on each machine • Job queuing up to the production manager

What is necessary • Globus • One installation per each farm (on a “visible” node) • Use of personal certificates and host certificates signed by INFN CA • User certificates signed by Globus CA are accepted as well • By default it is not possible to “use” Globus resources outside INFN using personal certificates signed by INFN CA • Workaround 1: Users have also personal certificates signed by Globus CA • Workaround 2: “Small” modification in the Globus configuration of these resources outside INFN in order to accept “our” certificates too • Installation • Installation done by CMS/local administrators/WP1 member (if present) using distribution and procedures provided by INFN GRID release team (http://www.pi.infn.it/GRID/GRID_INST_1.1.html) • In case of problems: globus@infn.it

What is necessary • Condor-G • Just one installation, used by the production manager (Ivano Lippi ?) • Installation and maintenance: Massimo Sgaravatto ??? • Scripts to run CMS production using this GRID environment • Up to CMS • Tools to “monitor” production • condor_q • Condor Job Viewer (Java GUI) • Run the production • Up to production manager

Some items/actors missing ??? • When ??? • Relations with other activities ??? • Data Management (GDMP, …) ??? • ???

GRID workload management system and CMS fall production

GRID workload management system and CMS fall production

Presentation Transcript

Content Management System CMS

The Workload Management And Logging Bookkeeping System

Workload Management

Content Management System (CMS)

CONTENT MANAGEMENT SYSTEM (CMS)

gLite Information System and Workload Management System

GRID Workload Management System

COMPETENCY MANAGEMENT SYSTEM (CMS)

Company Management System (CMS)

Content Management System (CMS)

Grid Workload Management

CMS (Campus Management System )

WP1 Grid Workload Management

Workload Management System

campus management system(CMS)

Workload Management System