D0 Grid Data Production Initiative: Coordination Mtg 13

Version 1.0 (meeting edition) 11 December 2008 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production Initiative:Coordination Mtg 13 D0 Grid Data Production

D0 Grid Data Production Outline • Summary • Deployment 1 follow-up as much as possible. • System running very smoothly since fcpd fixes. Events per day is up… but need more statistics before drawing conclusions. • Deployment 2 scaled back to observe system under stable conditions, FWD6 re-retasked. • News • Exec Mtg w/D0 Spokes and Vicky on Dec 12. • Stay tuned for whether next week’s Coordination Meeting is on…

D0 Grid Data Production Deployment 1 Status, Follow-up • Deployment 1: Split Data/MC Production Services – Completed, with follow-up on: • a. Queuing nodes now setup using latest WM installer. • Auto-update code in place (v0.3), not enabled on purpose to avoid possible format confusion. • Defer downtime to cleanly enable auto-update and format changes to Jan ‘09. Hand edit gridmap files until then. • b. QUE2 health? • Health issue was: power cord knocked out and samgrid2 did not auto-restart (fixed). • Check samgrid for this issue. • AL’s monitoring still being productized, not running on QUE2 yet. • c. SWITCH: QUE1 = MC, and QUE2 = DATA PROD… Happened yet? • Keep DATA PROD information on QUE1 for expected time. • Remote MC users no longer need to change their usage with this switch, simpler to implement. • d. FWD3 not rebooted yet, so have not picked up ulimit-FileMaxPerProcess… not pressing. • When opportunity arises. No hurry. • e. New Condor version: evidence yet that Periodic Expression no longer blocking system? • YES, it is gone. No more 2-per-day “freezes”. BIG IMPROVEMENT!!! • f. Integrating experience into installation procedures and formalize hand-off from dev to ops. • Lessons Learned plus… Done 12/8/2008. See next slide. • g. Fcpd version, monitoring, and restart mechanism (server_run issue) • New fcpd and server_run deployed by AL everywhere now.

D0 Grid Data Production Deployment 1 Review Highlights • Umbrella/Installer Product approach a success • Tailoring effort same as before, but product installation and version support are now quite simple. Adam created WM Installer v0.3. • How do the XMLDB and text config files work, interact? • How to keep in sync? How to deploy without wiping XMLDB? • Dev/int systems still needed to support a more complete and managed devintprd process • Note: FWD6 to be LCG node, not OSG FWD node after all. • 64-bit dev machine, dedicated integration/test stand hardware • (related: Spare depth and recovery plan) • More Issues (notes are posted in usual places) • Problems reading .doc format files (Installation doc) • New machine deployments – best way to bootstrap UPS • Enabling “operations-driven development” by REX/Ops

D0 Grid Data Production DZGDP System Status: Past Month fcpd fails, jobs wait for data This Week fcpd fails, jobs wait for data This Week deployment deployment • Data Production: Stable since fcpd/server_run issue resolved • Now: Slots are being kept occupied, CPUs are kept loaded. NEvents/day is up (not shown) • Fcpd/server_run issue more easily resolved perhaps since no other issues to disentangle. • MC Production: set OSG resource efficiency record (Abhishek Singh Rana, 12/4/2008) • May-be some collateral improvements? 8^) • And separate work underway to get LCG side of MC Production up to capability too. Normal queue submission cycle

D0 Grid Data Production Deployment 2 “Feature” List • Deployment 2: Optimize Data and MC Production Configurations after splitting of services in deployment 1 • Time frame: December 8-10, with 1 week+ observation before holidays… NOW. • 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length to reduce number of “touches” per day. • RDK: consensus seems to be: leave config as is. Put effort into next layer of issues that are now clearer with the Grid layer so much more stable than before. No config param changes. • 2. New SAM-Grid Release with support for new Job status value at Queuing node • Defect in new Condor (Qedit, old version OK) prevents feature from working. • Kluge alternative or wait for Condor fix? PM posed question to MD… • 3. Deploy FWD6 (Samgfwd06). Takes over current FWD5 role, and FWD5 becomes MC Merge FWD node. • Repeated problems with LCG FWD node and related services  FWD6 to return to its original role as an LCG FWD node and related services host instead of an OSG FWD node. • FWD5 to be used until a modern replacement procured. • 4. Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5 • Not in project plan yet. DEFER TO JANUARY • 5. Formalize transfer of QUE1 (samgrid.fnal.gov) to FEF from FGS (before an OS upgrade) • Not in project plan yet. DEFER TO JANUARY

D0 Grid Data Production Deployment Configuration(Green = now, Blue = in progress,Yellow = future) • Reco • FWD1: GridMgrMaxSubmitJobs/Resource = 1250 (was 750, default 100) • FWD5: 1250 • MC, MC Merge • FWD2: 1250 (was 750, default 100) • FWD4: 1250 • Reco Merge • FWD3: 750/300 grid each • QUE1:MC, MC Merge - not used by MC Prod at first, now is. • QUE2:Reco, Reco Merge – keep here to maintain history • Switch these to simplify transition... Remote MC users make no change. • SAM Station: All job types • Jim Client: submit to QUE1 or QUE2 depending on qualifier, QUE1 default

D0 Grid Data Production Task Status (1 of 3)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.1.8 FWD and QUE Packaging with Version-Based Umbrella Product "GG,AL"Mon 10/27/08 Fri 12/5/08 28d • 1.1.8.16 New FWD Install Proc/Doc hand-off to REX/Ops AL AL Tue 11/18/08 Fri 11/21/08 4d • 1.1.8.6 Umbrella Product: Update FWD Installation Procedure AL JB Mon 12/1/08 Tue 12/2/08 2d • 1.1.8.14 Add ulimitOpenFileMax setting to FWD Installation Procedure AL REX Mon 12/1/08 Mon 12/1/08 1d • 1.1.8.15 New QUE Install Proc/Doc hand-off to REX/Ops AL AL Tue 11/18/08 Fri 11/21/08 4d • 1.1.8.10 Umbrella Product: Update QUE Installation Procedure AL JB Mon 12/1/08 Tue 12/2/08 2d • 1.1.8.13 Umbrella Product: FWD and QUE Installation Procedures archived AL REX Wed 12/3/08 Wed 12/3/08 1d • 1.1.8.17 "Umbrella Product: FWD, QUE Auto-Maint/Monitoring into a package" AL AL Thu 12/4/08 Fri 12/5/08 2d • 1.1.8.11 Milestone: FWD, QUE Pkging with Version-Based Umbrella Prod done "GG,AL" Fri 12/5/08 Fri 12/5/08 0d • Tasks to accomplish some of the follow-up are not relabeled as such above. • 1.1.14 Forwarding Node 6 (Fwd6) --- NEW --- AL Mon 12/1/08 Fri 12/12/08 10d • 1.1.14.1 Fwd6: Server Hardware OS Install AL FEF Mon 12/1/08 Tue 12/2/08 2d • 1.1.14.2 Fwd6: Increase ulimitOpenFileMax to 16k AL FEF Wed 12/3/08 Wed 12/3/08 1d • 1.1.14.3 Fwd6: Server Hardware Burn-in AL FEF Thu 12/4/08 Fri 12/5/08 2d • 1.1.14.4 Fwd6: Verify Platform Installation AL JB Thu 12/4/08 Fri 12/5/08 2d • Log disk space insufficient relative to FWD node specification. • 1.1.14.5 Fwd6: Request and Install Grid Certs AL JB Thu 12/4/08 Mon 12/8/08 3d • 1.1.14.6 Fwd6: Install with Version-Based FWD Umbrella Product AL JB Tue 12/9/08 Tue 12/9/08 1d • 1.1.14.7 Fwd6: Single Job Small-Scale Test AL JB Wed 12/10/08 Wed 12/10/08 1d • 1.1.14.8 Fwd6: Large-Scale Tests AL "JB,MD,JS"Thu 12/11/08Fri 12/12/08 2d • 1.1.14.9 "Fwd6: Setup Automated Maintenance, Monitoring" AL JB Thu 12/11/08 Fri 12/12/08 2d • 1.1.14.10 Milestone: Fwd6 Ready to Deploy AL Fri 12/12/08 Fri 12/12/08 0d • Notes: FWD6 re-re-tasked to be LCG FWD node (out of scope for Initiative Phase 1)

D0 Grid Data Production Task Status (2 of 3)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.3.1 SAM-Grid Job Status Info • 1.3.1.7 New Job Status Value at QUE Node: Later Work GG PM Tue 11/18/08 Mon 11/24/08 5d • Blocked from working by Condor 7 bug in Qedit. Works in Condor 6. • 1.3.1.1 Use "Same" Proxy for Gridftps GG PM Thu 11/20/08 Mon 11/24/08 3d • 1.3.1.3 SAM-Grid Release with Job Status Info feature GG PM Tue 11/25/08 Wed 11/26/08 2d • 1.3.1.6 Pre-deployment test of new SAM-Grid Release AL REX Mon 12/1/08 Fri 12/5/08 5d • 1.3.1.4 Upgrade D0Runjob version used by Data Production AL "MD,AL"Thu 10/30/08Fri 10/31/08 2d • 1.3.1.5 Milestone: SAM-Grid Release Deployable for Data ProductionAL REX Fri 12/5/08 Fri 12/5/08 0d • 1.1.11 Deployment 1 Review AL Tue 12/2/08 Tue 12/2/08 1d • 1.1.7 Deployment Stage 2 • 1.1.7.1 "Deployment 2: Plan: Optimize Data, MC Prod Configurations"AL ALL Wed 12/3/08 Fri 12/5/08 3d • 1.1.7.2 Deployment 2: Execute AL REX Mon 12/8/08 Wed 12/10/08 3d • 1.1.7.3 Deployment 2: Monitor AL REX Thu 12/11/08 Tue 12/16/08 4d • 1.1.7.8 Deployment 2: Complete Grid Production Configuration AL REX Wed 12/17/08Wed 12/17/08 1d • 1.1.7.4 Deployment 2: Sign-off AL REX Thu 12/18/08 Thu 12/18/08 1d • 1.1.7.5 MILE 2: Deployment 2 Completed AL Thu 12/18/08 Thu 12/18/08 0d • 1.1.13 Deployment 2 Review AL Fri 12/19/08 Fri 12/19/08 1d • Moot, under this plan. Similar review as part of Next Phase planning in January.

D0 Grid Data Production Task Status (3 of 3)(Red = critical tasks, Green = done, Blue = in progress,Yellow = added notes) • 1.4 Metrics • nSubmissions plot for Sep ’08 Mike? By Thurs end of day PLEASE??? • nEvents/day at 5-8 MEvts/day so far vs. 5.2 MEvts/day average for 3 months before • A little better than expected since only 11% “slot downtime” observed in Sep ’08. Could be a multiplicative effect on top of the “slot downtime”, so greater impact overall on NEvents/day. • Lets see if this holds. • CPU/Wall time for d0farm is down to 84% (Nov), 73% (Dec). CAB Accounting • Dec stats includes fcpd downtimes which would have reduced this ratio as observed. • Expect to see rebound since fcpd/server_run issues resolved. • Congestion in fcpd queues or in some other data path? • Some other source of CPU inefficiency source “underneath” the Grid layer? • Another 5-15% improvement possible? This is a question for January 2009. • Topics/Tasks not Covered/Planned Yet • Lists of new-machine certs and new-operator authorization: location, process, what uses it, manual or auto updated • Tool to track cert expiration? Run to query known certs once/month? (need to maintain cert list). • Cost-benefit: push FWD, QUE nodes to be appliances • spec’d from OS (including ulimit-FileMaxPerProcess setting) to apps to grid system configuration • rapid wipe and re-install as part of spares plan and recovery procedure

D0 Grid Data Production Issues List (p.1/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 1) Unreliable state information returned by SAM-Grid: SAM-Grid under some circumstances does not return correct state information for jobs. Fixing this may entail adding some logic to SAM-Grid. • SAM-Grid Job Status development (see discussion on earlier slides) • 2) Cleanup of globus-job-managers on forwarding nodes, a.k.a. “stale jobs”: The globus job managers on the forwarding nodes are sometimes left running long after the jobs have actually terminated. This eventually blocks new jobs from starting. • AL: Improved script to identify them and treat symptoms. Not happened recently.But why happening at all? • Not specific to SAM-Grid Grid Production • 3) Scriptrunner on samgrid needs to be controlled, a.k.a. the Periodic Expression problem: This is now locking us out of all operation for ~1 hour each day. This is due to a feature in Condor 6 which we do not use, but which cannot be fully disabled either. Developers say this is fixed in Condor 7, but this has not been proven yet. • Condor 7 Upgrade – RESOLVED! • 4) CORBA communication problems with SAM station: The actual source of all CORBA problems is hard to pin down, but at least some of them seem to be associated with heavy load on samgfwd01 where the SAM station runs. Since the forwarding nodes are prone to getting bogged down at times, the SAM station needs to be moved to a separate node. • Move SAM station off of FWD1 – DONE! • Context Server move as well – DONE! • 5) Intelligent job matching to forwarding nodes: SAM-Grid appears to assign jobs to the forwarding nodes at random without regard to the current load on the forwarding nodes. It will assign jobs to a forwarding node that has reached CurMatch max even if another forwarding node has job slots available. • Nothing in Phase 1. Later Phase may include a less effort-intensive approach to accomplish same result.

D0 Grid Data Production Issues List (p.2/4) (Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 6) Capacity of durable location servers: Merge jobs frequently fail due to delivery timeouts of the unmerged thumbnails. We need to examine carefully what functions the durable location servers are providing and limit activity here to production operations. Note that when we stop running Recocert as part of the merge this problem will worsen. • Nothing in Phase 1. Later Phase may include decoupling of durable location servers? • No automatic handling of hardware failure. System keeps trying even if storage server down. • 7) CurMatch limit on forwarding nodes: We need to increase this limit which probably implies adding more forwarding nodes. We would also like to have MC and data production separated on different forwarding nodes so response is more predictable. • Decouple FWD nodes between Data and MC Production and tune separately for each. • Decoupling done. Can tune for Data Production. • 8) Job slot limit on forwarding nodes: The current limit of 750 job slots handled by each forwarding node has to be increased. Ideally this would be large enough that one forwarding node going down only results in slower throughput to CAB rather than a complete cutoff of half the processing slots. Could be addressed by optimizing fwd node config for data production. • Decouple FWD nodes between Data and MC Production and tune separately for each. • Decoupling done. Can tune for Data Production. • 9) Input queues on CAB: We have to be able to fill the input queues on CAB to their capacity of ~1000 jobs. The configuration coupling between MC and data production that currently limits this to ~200 has to be removed. Could be addressed by optimizing fwd node config for data production. • Decouple FWD nodes between Data and MC Production and tune separately for each. • Decoupling done. Can tune for Data Production.

D0 Grid Data Production Issues List (p.3/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 10) 32,001 Directory problem: Band-aid is in place, but we should follow up with Condor developers to communicate the scaling issue of storing job state in a file system given the need to retain job state for tens of thousands of jobs in a large production system. • Not currently an issue. Acceptable band-aid is in place. • Already a cron job to move information into sub-directories to avoid this. • 11) Spiral of Death problem: See for example reports from 19-21 July 2008. Rare, but stop all processing. We do not understand the underlying cause yet. The only known way to address this situation is to do a complete kill/cold-stop and restart of the system. • Condor 7 Upgrade?May be different causes in other episodes... Only one was understood. • Decouple FWD nodes between Data and MC Production and tune separately for each. (mitigation only) • 12) Various Globus errors: We have repeated episodes where a significant number of jobs lose all state information and fall into a "Held" state due to various Globus errors. These errors are usually something like "Job state file doesn't exist", "Couldn't open std out or std err", "Unspecified job manager error". Mike doesn't think we have ever clearly identified the source of these errors. His guess is they have a common cause. The above errors tend to occur in clusters (about half a dozen showed up last night, that's what brought it to mind). They usually don't result in the job failing, but such jobs have to be tracked by hand until complete and in some cases all log information is lost. • Later Phase to include more detailed debugging with more modern software in use. • At least some issues are not SAM-Grid specific and known not fixed by VDT 1.10.1m. (KC). • For example: GAHP server... Part of Condor • 13) Automatic restart of services on reboot: Every node in the system (samgrid, samgfwd, d0cabosg, etc) needs to be set up to automatically restart all necessary services on reboot. We have lost a lot of time when nodes reboot and services do not come back up. SAM people appear to not get any notification when some of these nodes reboot. • Done during Evaluation Phase. Make sure this is setup on new nodes as well. – DONE!

D0 Grid Data Production Issues List (p.4/4)(Red = not treated in Phase 1, Green = treated or non-issue, Yellow = notes) • 14) SRM needs to be cleanly isolated from the rest of the operation: This might come about as a natural consequence of some of the other decoupling actions. A more global statement would be that we have to ensure that problems at remote sites cannot stop our local operations (especially if the problematic interaction has nothing to do with the data processing operation). • Nothing in Phase 1.Later Phase to include decoupling of SAM stations, 1 each for Data and MC Production. • 15) Lack of Transparency: No correlation between the distinct grid and PBS id’s and inadequate monitoring mean it is very difficult to track a single job through the entire grid system, especially important for debugging. • Tool identified in Evaluation Phase to help with this. Consider refinement in later Phase. • 16) Periods of Slow Fwd node to CAB Job transitions: newly added • Condor 7 Upgrade and increase ulimit-OpenFileMaxPreProcess to high value used elsewhere. • Cures all observed cases? Not yet sure. • MC-specific Issue #1) File Delivery bottlenecks: use of SRM at site helps • Out of scope for Phase 1. SRM specification mechanism inadequate. Should go by the site name or something more specific. • MC-specific 2) Redundant SAM caches needed in the field • Out of scope for Phase 1 • MC-specific 3) ReSS improvements needed, avoid problem sites,…. • Out of scope for Phase 1. PM sent doc, met with Joel. • MC-specific 4) Get LCG forwarding nodes up and running reliably • Out of scope for Phase 1. This is being worked on outside of Initiative Phase 1 though. • FWD nodes resolved. Staging issue recently though…

D0 Grid Data Production Initiative: Coordination Mtg 13