1 / 26

The Workload Management And Logging Bookkeeping System

The Workload Management And Logging Bookkeeping System. Di Qing Grid Deployment Group Academia Sinica & CERN. WLCG Collaboration Workshop, 25 January, 2007. Outline. Introduction Installation and configuration Test your site Troubleshooting. Introduction(I).

nanji
Download Presentation

The Workload Management And Logging Bookkeeping System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Workload Management And Logging Bookkeeping System Di Qing Grid Deployment Group Academia Sinica & CERN WLCG Collaboration Workshop, 25 January, 2007

  2. Outline • Introduction • Installation and configuration • Test your site • Troubleshooting WLCG Collaboration Workshop

  3. Introduction(I) • Focus on gLite job management system, specially on troubleshooting • Workload Management System (WMS) • Backward compatible with LCG-2 • WMProxy • Web service interface to the WMS • Support bulk submissions and jobs with shared sandboxes • Support for shallow resubmission. • Support BDII, R-GMA and CEMon as resource information repository • Support for Data management interfaces (DLI and StorageIndex) • Support for DAG jobs • …… WLCG Collaboration Workshop

  4. Introduction (II) • Logging and Bookkeeping (L&B) • Tracks jobs during their lifetime (in terms of events) • L&B Proxy • Provides faster, synchronous and more efficient access to L&B services for Workload Management Services • Computing Element (CE) • Service representing a computing resource • CE moving towards a VO based local scheduler • Batch Local ASCII Helper (BLAH) • More efficient parsing of log files (these can be left residing on a remote machine) • Support for hold and resume in BLAHTo be used e.g. to put a job on hold, waiting for e.g. the staging of the input data • Condor-C GSI enabled • CE Monitor (not enabled in gLite 3.0) • Better support for the pull mode; More efficient handling of CEmon reporting • Security support WLCG Collaboration Workshop

  5. WMS Not enabled in gLite 3.0 Not yet WLCG Collaboration Workshop

  6. Gatekeeper LCASLCMAPS WSS LaunchCondor-C LaunchCondor-C GLite CE Submitjob Grid CEMon Not enabled in gLite 3.0 Blahpd Condor-C CE Localbatchsystem Not yet LSF Condor PBS/Torque WLCG Collaboration Workshop

  7. LB WLCG Collaboration Workshop

  8. WMS Components WLCG Collaboration Workshop

  9. Installation (I) • apt-get, yum, manual installation, tarball • glite-WMS, glite-LB, glite-WMSLB • Architecture CPU • If you have non i386 architecture and you need to install some i386 pkg, DON’T use apt it will make the choice by the priority of the repository and not the architecture • Yum does not have this problem and give always the priority to the cpu arch • Dependencies • Those tools are made to satisfy dependencies automatically. • foo is depended on bar which is depended on tux , the 3 packages will be install with just an: apt-get install foo. • Problems when tux is not present • The message will not speak about tux but about bar with the bar is not installable WLCG Collaboration Workshop

  10. Installation (II) • Try to install bar to check which package is really missing • apt-get install bar • Then find tux from somewhere and install it by hand • apt-get dist-upgrade may help WLCG Collaboration Workshop

  11. YAIM configuration of gLite services • gLite 3.0 is a merge of LCG 2.7.0 and gLite 1.5.0 middleware stacks • Both stacks are using different configuration approaches (YAIM/gLite Python configuration system) • Unification of configuration approaches. • YAIM2gLiteConverter: tool to transform YAIM configuration values to gLite XML configuration files • Transparent configuration of gLite services • No additional administrative overhead • For example, configure_node <site-info.def> WMSLB • Function config-glite-wms • Call YAIM2gLiteConvertor (parameter transformation) • Call glite-wms-config.py --config (service configuration) • Call glite-wms-config.py --start (service startup) • Function config-glite-lb • Call YAIM2gLiteConvertor (parameter transformation) • Call glite-lb-config.py --config (service configuration) • Call glite-lb-config.py --start (service startup) WLCG Collaboration Workshop

  12. YAIM2gLiteConvertor site-info.def glite-*.cfg.xml WLCG Collaboration Workshop

  13. Transformation • YAIM parameters are read from site-info.def file • gLite XML files are updated. • With new parameters from template files • Parameter values are: • mapped to their YAIM equivalents • derived from YAIM parameters • defaulted • Necessary structures are created • gLite parameters not managed by YAIM are not modified by converter ! WLCG Collaboration Workshop

  14. Troubleshooting on configuration • Important places and files • ${INSTALL_ROOT}/glite/etc/config/glite-*.cfg.xml modified/updated XML files. To verify if the conversion was O.K. • glite-wms.cfg.xml and glite-lb.cfg.xml for WMS and LB • If you have doubts that the conversion was not O.K. and you don't have any modified parameters, the simplest solution is: # rm -f ${INSTALL_ROOT}/glite/etc/config/glite-*.cfg.xml (normally not needed) • ${INSTALL_ROOT}/glite/yaim/libexec/ Code of the YAIM2gLiteConvertor, support files (parameter mapping, defaults, container definitions). • /opt/glite/etc/glite_wms.conf, /opt/glite/etc/glite_wms_wmproxy_httpd.conf, /opt/glite/etc/glite_wms_wmproxy.gacl, /opt/condor-c/etc/condor_config, /opt/condor-c/local.<hostname>/condor_config.local WLCG Collaboration Workshop

  15. Test your CE(I) • Started from gLite 3.0, glite CE only supports voms proxy !! • Test gatekeeper • globus-job-run <yourCE> /usr/bin/id • Submit job to jobmanger-fork • other jobmanagers for batch system don’t exist since gLite 3.0 uses condor-c to submit jobs to batch system with blah • Logs: /var/log/glite/gatekeeper.log and /var/log/messages • Something wrong? • grid-mapfile should contain only VOMS group and roles • lcas and lcmaps correctly configured? • hostkey readonly, time synchronization,vo membership, certificate expire? • Check logs in file /var/log/glite/gatekeeper "/atlas/Role=production/Capability=NULL" atlasprd "/atlas/Role=NULL/Capability=NULL" .atlas "/atlas" .atlas .... WLCG Collaboration Workshop

  16. Test your CE(II) • Test your batch system • Test job submission by batch system command like qsub for torque on CE as a pool account • ssh problem from WN to CE? • When a WN reinstalled, suggest to remove shosts.equiv and ssh_known_hosts files from /etc/ssh directory on the CE and rerun /opt/edg/sbin/edg-pbs-knownhosts and /opt/edg/sbin/edg-pbs-shostsequiv • Check if a set of condor-c daemon running or not as the VO pool account • Condor_gridmanager log files: /tmp/GridmanagerLog.<poolaccount> • And /<home>/<poolaccount>/Condor_glidein/ WLCG Collaboration Workshop

  17. Test your CE(III) • Test BLAH job execution • As pool account, run /opt/glite/bin/blahpd and type • Check /tmp/out if it is correct • If it failed, check if BLParser is running on batch system head node like: • /opt/glite/bin/BLParserPBS -p 33332 -s <location of log> for PBS • /opt/glite/bin/BLParserLSF -p 33333 -s <location of log> for LSF • Ports may be different. • pbs_submit.sh, pbs_status.sh,lsf_submit.sh,lsf_status.sh • http://egee-jra1-wm.mi.infn.it/egee-jra1-wm/ce_blahp.shtml • Test your CE through WMS • If your CE is not in information system, try “-r” option • glite-job-submit –r <yource>:2119/blah-pbs-dteam your.jdl ASYNC_MODE_ON COMMANDS BLAH_JOB_SUBMIT 23 [cmd=“/bin/date”;out=“/tmp/out”;gridtype=“pbs”]; quit WLCG Collaboration Workshop

  18. Test your WMS • Create a very simple JDL, glite-job-list-match to check your network server and workload management server • For WMSProxy, it’s glite-wms-list-match • If no resource matched, check /var/glite/workload_manager/ismdump.fl • If it’s empty, check if you can contact the BDII defined in /opt/glite/etc/glite_wms.conf with ldapsearch • Submit the job by glite-job-submit • For WMSProxy, it’s glite-wms-job-submit • glite-job-status <jobid> • Or glite-wms-job-status with WMSProxy interface • Check the job status • glite-job-logging-info (glite-wms-job-logging-info for WMSProxy) can give more information and also verify LB services WLCG Collaboration Workshop

  19. Test your WMS(II) • On WMS, Condor commands can be used to query the status of your jobs in condor queue, for example, like • “condor_q –long” gives verbose output about the jobs • “condor_q –global” shows also the status of jobs on CE • For running those condor commands, “source /etc/glite/profile.d/glite_setenv.sh” at first • condorc-launcher, condorc-advertiser, condorc-authorizer • condorc-launcher-s… to launch condor-c on glite CE • JobWrapper.https_3…, user jobs • Instead, condor_status displays the status of the Condor pool, like • “condor_status –schedd” shows the atribute including total running jobs of schedds for example the condor schedd on glite CE • And others: condor_history, … WLCG Collaboration Workshop

  20. Test your WMS (III) • “/etc/init.d/gLite status” gives the status of all services • It calls glite-wms-config.py and glite-lb-config.py • For individual service, try the daemon script under /opt/glite/etc/init.d/ • Or you can restart all services by “/etc/init.d/gLite restart” • You can find most of log info under /var/log/glite/ • Network Server: networkserver_events.log • Workload Management: workload_manager_events.log • Log monitorng: logmonitor_events.log • Job control: jobcontoller_events.log • WMProxy: wmproxy.log, lcmaps.log(authentication) • But for LB and proxy renewal services, the log messages are in /var/log/message • The log info of CondorG for jobs normally is in /var/glite/logmonitor/CondorG.log • And you can also find the logs for Condor-c under /var/local/condor/log/ WLCG Collaboration Workshop

  21. What’s wrong? (I) • Error while calling the "NSClient::multi" native api AuthenticationException: Failed to establish security context... • Check if your DN is in the grid-mapfile on WMS • Restart network server (possibly) • Many jobs stay in running or ready status for ever • Probably logmonitor daemon dead and could not be restarted • Increase the log level of logmonitor in the logmonitor section of glite_wms.conf and restart it to check which log file cause it crash • Remove the corrupted log file from /var/glite/logmonitor/CondorG.log/ and restart it • problems with grid_monitor.sh • Interlogd on WMS got stuck, restart it • A workaround patch available • glite-job-logging-info shows “Cannot take token!” • Check if edg-gridftp-clients or glite-gridftp-clients package installed on WN • Check if you can globur-url-copy files to and from WMS on WNs • Proxy expired before job executing and could not be renewed WLCG Collaboration Workshop

  22. What’s wrong? (II) • “Got a job held event, reason: Spooling input data files” • It may fail with "Globus error 7: authentication with the remote server failed“ • Race condition between the gridmanager on machine A querying the job status of the job on machine B and the schedd on machine B releasing the job after file stage-in, fixed in later version of condor. • glite-lb-bkserverd: “Database call failed (The table 'long_fields' is full)” in /var/log/message • The DB of LB reached 4GB limit • May cause incomplete log events • Increase them by • alter table short_fields max_rows=1000000000; • alter table long_fields max_rows=55000000; • alter table states max_rows=9500000; • alter table events max_rows=175000000; WLCG Collaboration Workshop

  23. What’s wrong (III) • glite-job-logging-info shows: Cannot read JobWrapper output, both from Condor and from Maradona. • Similar to LCG workload management system, more info on http://goc.grid.sinica.edu.tw/gocwiki • glite-job-logging-info shows: Got a job held event, reason: "The PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" • Condor could not submit the job to CE in more than 900 seconds • Probably Condor-C on CE could not be launched • because authentication failed • The previous launcher jobs failed but still in condor queue, remove it by condor_rm • IP address is incorrect in /etc/hosts • Possibly because of firewall WLCG Collaboration Workshop

  24. What’s wrong (IV) • Glite-job-logging-info shows “Got a job held event, reason: Error connecting to schedd …” • Condor met timeout when connecting sched on gLite CE • Possibly because of unstable network, or a disk fills up somewhere • Glite-job-logging-info shows “Got a job held event, reason: Attempts to submit failed” • It means that the job could not be successfully handed over to the batch system by the non-privileged user that resulted from the GRAM/LCMAPS mapping • For example, BLParser not running on batch system head node • Failed to load GSI credential: edg_wll_gss_acquire_cred_gsi() failed • locallogger puts log event under /tmp, but /tmp is full WLCG Collaboration Workshop

  25. What’s wrong (V) • Unable to delegate the credential to the endpoint: https://<HOSTNAME>:7443/glite_wms_wmproxy_server • VOMS extension of your proxy is not in /opt/glite/etc/glite_wms_wmproxy.gacl • VOMS extension missing in your proxy and WMproxy is configured only to support VOMS proxy • Unable to register the job to the service: https://<HOSTNAME>:7443/glite_wms_wmproxy_server • LB or LBProxy is too busy, increase the timeout • LB is in bad status, restart it WLCG Collaboration Workshop

  26. Performance and stability improvement • Deploy standalone LB • One standalone LB can serve several WMS’ • Create a file containing the DN of allowed WMS • Add option "--super-users-file <LOCATION>/<DNFILENAME>" in the startup line of bkserverd • You can add the DN of your certificate there in order to debug and get log info of other users’ jobs • Split /var/glite into several partitions or hard disks • Reduce the planner number per dag job • In case there are too many dag jobs • For example, add DagmanMaxPre=2 in JobController section of glite_wms.conf to reduce it to 2 • Increase timout • Set GLITE_WMS_QUERY_TIMEOUT and GLITE_PR_TIMEOUT in /etc/glite/profile.d/glite-setenv.* and add "PassEnv GLITE_PR_TIMEOUT" in /opt/glite/etc/glite_wms_wmproxy_httpd.conf WLCG Collaboration Workshop

More Related