1 / 15

Preliminary tests with co-scheduling and the Condor parallel universe

Preliminary tests with co-scheduling and the Condor parallel universe. Marian ZUREK for WP2. ETICS All Hands meeting Bologna, October 23-25, 2006. What’s the …. Context Use case Past Condor / NMI setup Results gLite-specific issues Next steps Discussion. Context.

ulla-mckay
Download Presentation

Preliminary tests with co-scheduling and the Condor parallel universe

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Preliminary tests with co-scheduling and the Condor parallel universe Marian ZUREK for WP2 ETICS All Hands meeting Bologna, October 23-25, 2006

  2. What’s the … • Context • Use case • Past • Condor / NMI setup • Results • gLite-specific issues • Next steps • Discussion Bologna -- All Hands Meeting

  3. Context • The gLite software stacks testing activity is highly manual, so the motivation came for the process automation and ease of reproducibility • In the future the system tests should become the part of the release process (reports stored in the DB, easily accessible for the trends creation, performance analysis, bug reproduction, etc.) Bologna -- All Hands Meeting

  4. Web Application NMI Scheduler Web Service Service Overview Via browser Build/Test Artefacts Report DB Project DB Command- Line tools Clients NMI Client Wrapper WNs ETICS Infrastructure Continuous Builds Bologna -- All Hands Meeting

  5. Use case for gLite • We have to deploy six services on six different nodes UI, CE, WMSLB, VOMS_mysql, RGMA, WN • There are interdependencies between them • UI: [RGMA, VOMS_mysql, WMSLB, CE, WN] • CE: [WMSLB, WN] • WMSLB: [VOMS_mysql] • VOMS_mysql: [RGMA] • WN: [CE] • RGMA: [] • No auto discovery possible, order of service startup must be preserved, run-time environment defined • The successful “real job” submission requires all the services being operational Bologna -- All Hands Meeting

  6. Back to the past • The gLite software stack requires the administation rights on the target node, so the root-enabled schema has been developed to address this • root enabled jobs should be performed only on the predefined sets of hosts • The service installation should reflect its operational status by writing to the file e.g. /etc/nmi/publish_services.list • runs_VOMS_server="true", timeOut=3600 • runs_RGMA_server="true” • The timeOut (expressed in seconds) defines the service operational time. After the timeOut node will be released. Absence of the timeOut will mean that the machine is released immediately after the job has been finished. Bologna -- All Hands Meeting

  7. Condor / NMI setup • Experiment on the predefined set of nodes • special STARTD expressions for defining the Condor VMx availability • nodes still available for the regular submissions • synchronisation using Condor-chirp messages • Custom (outside the NMI/Condor) scratching mechanism: • watchdog style (outside process monitoring the node’s “limbo” state) • Initial trouble with lost/stuck jobs resolved with extra wait time • Node down-time < 10mins • Very good candidate for the virtualisation as no re-installation is needed (simple VM restart is enough) Bologna -- All Hands Meeting

  8. Results • Using the NMI and Condor parallel universe we were able to address the above described scenario • The delays were minimal and experimental timeOuts adjusted for optimal performance • The developed code could be consulted in the CVS, module: org.etics.nmi.system-tests • Non-conditonal persistency: The node on which the service runs remains operational for the predefined set of time • Sleep appended to the code • expiry-time communication via NMI/Hawkeye module Bologna -- All Hands Meeting

  9. Results • Conditional persistency : the node should be frozen in case the job fails (not implemented yet, but easy). • Failure propagation: should one of the parallel tests fail - the whole job flow is immediately aborted • Set of parallel job nodes exits immediately when node_0 job exits (let the node_0 be the “last” in the chain) • Output format definition is up to the submitter Bologna -- All Hands Meeting

  10. Results • Context, name spaces - assured thanks to the Condor/NMI design • Tester wants to use its own (external) service instance VOMS_server - possible, but reproducibility not guaranteed • Multi-sites/across firewalls tests - possible (see Andy’s talk) • Is the test job different from the standard build submission - not from the WP2 point of view • Proposal of the YAML format for the dependencies definitions (see flow-spec.yaml) Bologna -- All Hands Meeting

  11. flow-spec.yaml # First, a list of all jobs. --- - UI - CE - WMSLB - VOMS_mysql - RGMA # Now, mapping the job name to its script. --- UI: UI.sh CE: CE.sh WMSLB: WMSLB.sh VOMS_mysql: VOMS_mysql.sh RGMA: RGMA.sh # Now, a hash mapping each job to its dependencies at the nodeid discovery # stage. --- UI: [RGMA, VOMS_mysql, WMSLB, CE] CE: [WMSLB] WMSLB: [VOMS_mysql] VOMS_mysql: [RGMA] RGMA: [] # Timeouts for nodeid discovery stage. --- UI: 35 CE: 25 WMSLB: 15 VOMS_mysql: 10 RGMA: 0 Bologna -- All Hands Meeting

  12. gLite/general issues • Do we adopt YAML format • Do we need to create a temporary CAs servers or we expect this from the testers/code submitters • pass-phrase problem • Do we write site-info.def file upfront or we take the assumption of the future auto-discovery Bologna -- All Hands Meeting

  13. Next steps • Virtualisation using WoD (WindowsOnDemand) service • Initial assessment very positive • Customized installation a-la etics WN • Candidate for the “freeze” scenario - one can programmatically export/import the VM • Free as of today, paid in the future (should we run a dedicated/private server) • Virtualisation using the VMWare • Base installation (Alberto can say much more) • API existing • Virtualisation with Condor see Andy’s talk Bologna -- All Hands Meeting

  14. Next steps • Demo for the PM12 (review) ? Bologna -- All Hands Meeting

  15. Discussion • Q & A Bologna -- All Hands Meeting

More Related