D0 Grid Data Production Initiative: Coordination Mtg

Version 1.0 (meeting edition) 19 March 2009 Rob Kennedy and Adam Lyon Attending: RDK, AL, … D0 Grid Data Production Initiative:Coordination Mtg D0 Grid Data Production

D0 Grid Data Production Overview • News and Summary • System mostly OK • PBS Upgrade Issues resolved • All-CAB2 begun Fri 13th March • 11.9 MEvents (unmerged) produced on Sat 14th March • Some issues have surfaced after this 2x capacity increase, expected and otherwise…. • Backlog: 156 MEvts (Raw – Unmerged) • Down from ~170 MEvts, after 5.5 days of increased capacity. • Model: average of 9.5 MEvts/day unmerged, 5.5 MEvts/day raw = 4MEvts/day catch up. • ~ 5.5 weeks left to work this down to 35 MEvts, maintaining high resource utilization • Still on target to finish early (limited stats), but need steady operations to remain so. • Agenda • All-CAB2 Operations Issues • Phase 2 Other Tasks

D0 Grid Data Production All-CAB2 Processing Issues (1) • PBS Upgrade on CAB: Restore normal ops before starting All-CAB2 use. • Several issues cropped up after the PBS Upgrade 3/10/2009: • FIXED: Rebooted all worker nodes after PBS client upgrade. Several nodes were subsequently re-installed at OS level at this time. CFEngine communication problem led to some of these node re-installations being done with incorrect configuration (ie. Missing or unlinked local scratch areas) • FIXED (is it now?): Monitoring for WN “black holes” (jobs quickly fail, so WN sinks all jobs in queue) appeared to not work correctly, should have taken misconfigured re-installed nodes offline soon after jobs started failing en masse… but did not. • RESOLVED (mitigated, not fixed): Possible race condition related to scratch area handling. More likely to occur if many batch jobs start on same node in short time. Mitigate this with node selection change: most powerful + (new) most available + … • Decision: Impact was considered low enough to proceed with All-CAB2. Proceeded. • …

D0 Grid Data Production All-CAB2 Processing Issues (2) • Forward Node load balancing issue (AL) – appears to be mitigated. • Condor Dev found mistake in Condor Negotiator implementation: not round robin in code. • Simplify new Condor deployments with separate Condor/VDT layers (part of Phase 2 plan) • How many bug fixes do we want in this Condor version before pressing it into production? Can wait… • Short-term mitigation/work-around: “Randomize rank” working seems to work • Not clear that we and dev have complete understanding of ranking implementation, timing, etc. • FWD4, FWD5 swapped in Data, MC Production to handle greater load in All-CAB2 processing. • With “rank randomization” and FWD4/5 swap, ready on this issue for All-CAB2 processing. • Decrease frequency of queries? (in Condor) – We chose last week to defer, but… • High load on FWD4 observed, related to the pings • D0srv063 Repair/Replace (FEF via tickets) – no change, create contingency plan • 3 crash episodes two weeks ago, but not sure what happened. No problems since then…. • Services did not restart after reboots… and appeared to be setup to do so. Why not? • Node is off maintenance. Should be replaced… no new nodes in short-term. (‘65 replaced) • We chose to create a contingency plan in case ‘063 fails again… and it did. • 1. AL rerouted unmerged TMB from ‘063 to ‘071,’072 (skip a hop). Still goes to ‘065 as before. • 2. Bringing up ‘077 as additional (SAM cache or durable?) storage location

D0 Grid Data Production All-CAB2 Processing Issues (3) • Station and Context Server load issues (AL,RI) • Issue 1: Station log > 2GB causes station crash. Log rotation script has not been working since move of station to ‘097. Fixed. 19 Mar 2009 by RI. • Issue 2: ? • Disk space on ‘071, ‘072 (AL,RI) • Status?

D0 Grid Data Production Near-Term Work: Action Items (1) • Merge Processing slowdowns (MD, RI) • … suspect issue in SAM station. DB server related possibly??? Has not done this in a while, and no news. • Test Data Production with Opportunistic Usage (MD) – no news this week • MD: Can submit to a specific cluster, preferred by FSG to submit more generally with a constraint string. Working on escaping the string to get it all the way through (‘/’ mangled at ClassAds stage?). PM helping. • Recent: GPFarms and CDF nodes worked, but recent test of CMS nodes test did not work any more. • All-CAB2 load has pushed the D0GDP system to the edge. We may want to defer large-scale work in this area until afterward. • CDF Node deployment (FEF): Almost Done. • Put these last few CDF worker nodes into system once PBS head nodes upgraded. Low priority. • JA: Desire to have a PBS test stand for validating new releases, testing off to the side.

D0 Grid Data Production Near-Term Work: Action Items (2) • Condor and/or Globus Driven Issues • Condor: Follow-up on Qedit defect – PM is actively working with Condor dev on this • Globus: Address “stuck state” issue affecting both Data and MC Production – PM has some examples. Update: once instance (MC Prod) due to job manager LSF implementation w/GOC ticket created. Similar issue in other job managers? Need to have live “stuck” examples to pass to Condor dev to figure this out. • Condor release procedure: Enable REX to deploy Condor as overlay on top of VDT • Speed up deployment of Condor fixes since no longer have to wait on VDT production validation cycle to complete. • Decouple this more frequent task from GRID developer schedules. • Phase 1 Follow-up • Enable auto-update of gridmap files on Queuing nodes: AL: #2 done. #1 more complicated, not done… no downtime required though. • Enable monitoring on Queuing nodes – AL: not all done yet. • Phase 2 Notes • WHY were the differences in FWD node config not captured beforehand? (FWD4/5 swap) • Suggestion: specializations could be localized to a standard configuration variant. • Eg: “Data Prod” versus “MC Prod” forward node config • Requirement for future: change config in one place, and propagate from there.

D0 Grid Data Production High-Level Schedule Proposal • March 2009 • PBS Head Node Upgrade (FEF) – March 10 • All-CAB2 Data Processing – March 12 through April … Highest priority • Issues Impacting All-CAB2 Data Processing – March • Listed in previous slides. Also adjusting config for deeper queues. • May involve a Condor kluge to mitigate FWD balancing issue • April 2009 • All-CAB2 Data Processing – March 12 through April … Highest priority • Scale-back by end of April, re-establish “normal” operations albeit with deeper queues, etc. • CAB Config Review – Early to Mid April • Cover topics listed in backup slide (slide 12, see amber bullet “review”) • Some topic bullet points may be moot by this time… will help for people to consider these from their point of view earlier. • Monitoring Workshop – Late April • Assess what we all have now, where our gaps are, what would be most cost-effective to address • See Gabriele’s white paper on D0 grid job tracking (includes monitoring, focus on OSG). In draft form now. • May 2009 • Release new SAMGrid with added state feature – Early May • Upgrade production release of Condor with fixes – Early May • Initiative Close-Out processes, including Review, Workshop follow-up – Early to Mid May

D0 Grid Data Production Backup Slides Not expected to present at meeting

D0 Grid Data Production All-CAB2 Processing Plan • March 10 (Tues): PBS Head Node Upgrade – H/w and PBS S/w • Done by FEF. Then, get 2 days experience with upgraded system to isolate “new PBS” issues” from “expanded d0farm” issues. • March 12 (Thu): Increase d0farm over time from ~1800 slots (5.4 MEvts/day) to ~3200 slots (9.5 MEvts/day) • System needs to be stable with new version of PBS first, restore previous service level. All parties sign off, then, begin expansion to All-CAB2. • Mike D. will judge how quickly to push this. Enhanced vigilance: everyone looks for bottlenecks or service problems. • March 16 (Mon): Goal date for d0farm up to full ~3200 slots • All parties assess the system, sign-off on continued running at full size. • March 17 (Tues)  forward: Work through data until any one end condition is met. Monitor weekly. • End Conditions: May 1, 2009 –or– down to 1 week backlog remaining –or– D0 analysis users must have slots back. • Assume backlog starts ~ 175 MEvts. Based on ~1800 slot capability at higher recent L values, One week of backlog is about 35 MEvts (estimate). • To process incoming data + 140 MEvts backlog would require about 4-5 weeks of All-CAB2 processing • March 17 – April 28 = 6 weeks, the max allowed (rounding). • Exploit opportunistic CPU usage during this time too. Mitigate risk of higher luminosity leading to slower processing. • We could reach “backlog down to 1 week” end condition assuming everything runs smoothly and TeVatron luminosity is about what it is nowadays after as little as 4 weeks, though more likely after 4.5-5 weeks. Set stretch goal internally for 4 weeks (April 14). • April 14 (Tues): Stretch goal to meet an end condition early (down to 1 week of backlog) • Evaluate end conditions in detail. End now –or– continue for 1 more week –or– continue to max of 2 more weeks? D0 Analysis OK? • April 28 (Tues): Latest sign-off on reverting to original ~1800 slot system. • Insure all parties are ready for necessary changes and potential config tweaks (like FWD nodes able to handle full load each) • April 30 (Thu): Latest date for reverting to original ~1800 slot system. • Allows a day or so of running in normal configuration before weekend.

D0 Grid Data Production All-CAB2 Processing Project • Early-March  April 2009: Keep-Up Level + Work-through-Backlog Level • Temp Expanded CAB2 Use by Data Production: 2/20/2009 via Email: Regarding temporarily using the whole CAB2 for the production, D0 management has made a decision that from March 10, we will temporarily expand the d0 farm queue to be the whole CAB2. The purpose is to catch up the backlog in data production for the summer conference. This configuration is temporary. We will change it back to the current configuration when one of the following condition happens: - when the backlog has been reduced to be less than one week of data; or - May 1, 2009, or - when there is an analysis need for more CPUs than CAB1 can provide. Although the configuration change will be done by FEF (thanks to FEF!), the SamGrid team may need to plan to adjust related parameters to handle a much larger production farm. The current d0 farm queue has 1800 job slots. The new d0 farm queue will have 1800+1400 job slots, temporarily. Thank you, Qizhong

D0 Grid Data Production Phase 2 Work List Outline • 2.1 Capacity Management: Data Prod is not keeping up with data logging. • Capacity Planning: Model nEvents per Day – forecast CPU needed • Capacity Deployment: Procure, acquire, borrow CPU. We believe infrastructure is capable. • Resource Utilization: Use what we have as much as possible. Maintain improvements. • 2.2 Availability & Continuity Management: Expanded system needs higher reliability • Decoupling: deferred. Phase 1 work has proven sufficient for near-term. • Stability, Reduced Effort: Deeper queues. Goal is fewer manual submissions per week. • Resilience: Add/improve redundancy at infrastructure service and CAB level. • Configuration Recovery: Capture configuration and artefacts in CVS consistently. • 2.3 Operations-Driven Projects • Monitoring: Execute a workshop to share what we have, identify gaps and cost/benefits. • Issues: Address “stuck state” issue affecting both Data and MC Production • Features: Add state at queuing node (from Phase 1). Distribute jobs “evenly” across FWD. • Processes: Enable REX/Ops to deploy new Condor… new bug fixes coming soon. • Phase 1 Follow-up: Few minor tasks remain from rush to deploy… dot-i’s and cross-t’s. • Deferred Work List: maintain with reasons for deferring work.

D0 Grid Data Production 2.1 Capacity Management • Highest Priority: will treat in Phase 2. Refine work list to be more detailed. • Mostly covered in “short-term work”… • 2.1.1 Capacity Planning: Model nEvents per Day – improve to handle different Tevatron store profiles, etc. • Goal: What is required to keep up with data logging by 01 April 2009? • Goal: What is required to reduce backlog to 1 week’s worth by 01 June 2009? … Use all CAB2 to catch up Mar/Apr • Follow-up: Planning meeting Monday? • Goal: What infrastructure is required to handle the CPU capacity determined above? • Latencies impacted CPU utilization?  MD, Adam looking into (consider oversubscription if so) • Ling Ho: 8 core servers (2008) have scratch/data on system disk (only 1 disk). Have observed contention, slowing jobs down. • Data Handling tuning (AL, RI): tuning SAM station, investigate if CPU waiting on data. • 2.1.2 Capacity Deployment: Goals to be determined by capacity planning • Added CDF retired nodes (FEF, REX): 17 Feb 2009 • Upgrade PBS Head nodes (FEF): 10 March 2009 • Plan and execute temporary expansion of “d0farm” on CAB2 to work through backlog in March and into April • Plan and execute some level of opportunistic use of CPU for data production. • Infrastructure capacity: appears sufficient for 25% CPU increase, probably OK for larger increase. • Fcp config: achieved optimal config to max out network bandwidth? Done. • 2.1.3 Resource Utilization: top-down investigation as well as bottom-up investigation. Examples… • Goal: > 90% of available CPU used by Data Prod (assuming demand limited at all times) • Goal: > 90% of available job slots used by Data Prod averaged over time (assuming demand limited at all times) • Goal: TBD… Uptime goal, downtimes limited. Activities to maintain/achieve this overlap with the following…

D0 Grid Data Production 2.2 Availability & Continuity Mgmt • Some tasks should be done, others deferred to after April (post-Initiative). • Initiative will keep a list of these for the historical record • 2.2.1 Decoupling • All work deferred post-Initiative • 2.2.2 Queue Optimization for Stability, High Utilization, Reduced Effort • Goal: System runnable day-to-day by Shifter, supervised by expert Coordinator • Deeper Queues • Increase limits to allow either FWD node to handle the entire load if the other were to go down. • Revisit after PBS head node upgrade, experience with larger capacity system • 2.2.3 Resilience • Eliminate SPOF, add/improve redundancy (see following slides) • 2.2.4 Configuration Recovery • Capture configuration and artefacts in CVS consistently. (continue to do)

D0 Grid Data Production FGS Recommendations Configure d0cabosg[1,2] to have the capability to manage jobs on both d0cabsrv[1,2]. (Yes, do in Phase 2, by FGS, at low priority) Requires a bit of “hackery” in the jobmanager-pbs script, but is doable. Both d0cabosg[1,2] would have jobmanager-d0cabsrv[1,2]. jobmanager-pbs would become a symlink to the appropriate jobmanager-d0cabsrv[1,2]. This would be in addition to work being performed by FermiGrid on high availability gatekeepers. Open up additional slots for opportunistic use on both clusters. (cab1 first) Ideally make all pbs job slots available for opportunistic use by Grid. Research/develop automatic “eviction” policy for pbs when slot is needed by dzero (as with condor). (requires D0 CPB input, should go through  QL. This is in-kind for D0 opportunistic use) Review Globus Errors: Adam’s text table of held reasons. Conduct review(s) covering the following topics, plan for future work: Consider extra queues to segregate D0 MC production (J. Snow) from other VO opportunistic usage. Consider adding extra roles in dzero VO (such as /Role=mc or /Role=monte-carlo). Investigate if special-purpose d0farm nodes still needed. FEF, FGS, Grid Dev. If so, do the coding that should have been done long ago to report them accurately. Review the layout of D0 resources advertising to ReSS, in order to see if it can be done in a more uniform way as opposed to the special-case hackery for CAB that is done now. (related to above 2 bullets). Do we want to have VOMS:/dzero/users access rules? Can d0cabsrv1 worker nodes be increased to have 10GB of scratch (like d0cabsrv2 worker nodes already have) instead of 4GB? Irrelevant due to retirements? A “higher RAM per node” pool for large memory jobs? Eliminate specialized queues in favor of priorities to allow greater CPU utilization. Load-balancing/combining CAB1/CAB2 (avoid user’s manually load balancing)

D0 Grid Data Production 2.3 Operations-Driven Projects • 2.3.1 Monitoring • Workshop to assess what we all have now, where our gaps are, what would be most cost-effective to address • Can we “see” enough in real-time? Collect what we all have, define requirements (w/i resources available), and execute. • Can we trace jobs “up” as well as down? Enhance existing script to automate batch job to grid job “drill-up”? • 2.3.2 Issues • Address “stuck state” issue affecting both Data and MC Production – PM has some examples. Update? • Large job issue from Mike? (RDK to research what this was. Was it 2+ GB memory use? If so, add memory to few machines to create a small “BIG MEMORY” pool?) • 2.3.3 Features • Add state at queuing node (from Phase 1 work). Waiting on Condor dev. PM following up on this. GG to try to push this. • FWD Load Balancing: Distribute jobs “evenly” across FWD… however easiest to do or approximate. • 2.3.4 Processes • Enable REX/Ops to deploy new Condor. In Phase 2, but Lower Priority. Condor deployments for bug fixes coming up. • Revisit capacity/config 2/year? Continuous Service Improvement Plan – RDK towards end of Phase 2. • 2.3.5 Phase 1 Follow-up • Enable auto-update of gridmap files on Queuing nodes. Enable monitoring on Queuing nodes. AL Partly done. • Lessons Learned from recent ops experience: <to be discussed> (RDK revive list, reconsider towards end of Phase 2)

D0 Grid Data Production Phase 2 Deferred Work List (1)Do not do until proven necessary and worthwhile • Phase 1 Follow-up Work • Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5. • Only a minor OS version difference. Wait until this is needed to avoid another disruption and risk • 2.2.1 Decoupling • @ SAM station… no proven value after Station bug fix?  DEFER • @ Durable Storage… does decoupling this have as much value now?  DEFER • Virtualization & FWD4 (FWD5 was used for decoupling, like to retire)  DEFER • Estimate for FEF readiness to deploy virtualization? JA: Months away… (starting from 0, using Virtual Iron) • 2.2.2 Queue Optimization for Stability, High Utilization, Reduced Effort • Few Long Batch Jobs holding Grid Job Slots – not an issue since we split QUE/FWD  DEFER. • 2.2.3 Resilience • Means to implement or emulate Durable Storage fail-over? (hardwired config in FWD nodes)  DEFER • 2.3.2 Issues • Context Server issues? Believe these are cured by cron’d restart by RI.  DEFER

D0 Grid Data Production Phase 2 Deferred Work List (2)Do not do until proven necessary and worthwhile • FGS Recommendations • Add slots for I/O bound jobs on worker nodes. (Defer. Not as motivated now by current job mix. May have been motivated by unusual jobs. Needs study. • 1-2 additional slots per system to handle I/O. • Balance dzeropro job slots across both clusters.(Defer, reconsider w/new WN) • Move of corresponding worker nodes probably best, since it would also balance job slots. • This mitigates what the next bullet would attempt to provide outright. Could be part of CAB config review too. • Explore mechanisms for redundant pbs masters. (FEF. Long time scale?) • Will allow access to the worker nodes to continue even if the “primary” pbs master is down. • Monitor the samgrid forwarding nodes, look at the globus errors going into CAB, see if there are any patterns. • Consider extra queues to segregate D0 MC production (J. Snow) from other VO opportunistic usage. • Consider adding extra roles in dzero VO (such as /Role=mc or /Role=monte-carlo for instance). • Investigate if special-purpose d0farm nodes still needed. FEF, FGS, Grid Dev • If so, do the coding that should have been done long ago to report them accurately. • Review the layout of D0 resources advertising to ReSS, in order to see if it can be done in a more uniform way as opposed to the special-case hackery for CAB that is done now. (related to above 2 bullets) • Do we want to have VOMS:/dzero/users access rules? • Can d0cabsrv1 worker nodes be increased to have 10GB of scratch (like d0cabsrv2 worker nodes already have) instead of 4GB?

D0 Grid Data Production Data Flow for Data Production Also on cache nodes: 0-bias skim, LCG cache 0 Tarballs Initiated by Reco Job 1 Enstore LTO4-G SAM Cache d0srv071 d0srv072 2 Raw Data 5 Other data destined for Tape Storage Initiated by Merge Job, Via gridftp In2p3 remote uploads 4 Unmerged TMB Worker Nodes Scratch space 3 6 Durable Store, Stager Space d0srv063 d0srv065 Enstore LTO4-F Merged TMB 7 Durable Storage and Stager Space are on separate partitions Shared w/Analysis Users No automated failover between ‘63, ‘65

D0 Grid Data Production Initiative: Coordination Mtg