D0 Grid Data Production: Evaluation Mtg 3

Version 1.0 18 September 2008 Rob Kennedy and Adam Lyon (Attendance will be impacted by major IT downtime today) D0 Grid Data Production:Evaluation Mtg 3 D0 Grid Data Production

D0 Grid Data Production Outline • Server Expansion and Decoupling Data/MC • And other topics involving FEF • Metrics • What we have identified so far, what we want • Small Quick Wins – information gathering • SAM-Grid Status Values • Next candidate topic for discussion? • Rough Plan (due by end of month) • Work plan to pursue 4 task chains • Estimate of achievable goal levels in metrics.

D0 Grid Data Production Priorities (1 of 4) • 1. Server Expansion and Decoupling of Data/MC Prod at Services • Breakdown into Major Activities • A. Install Forwarding Node 4, Queuing Node 2: from procurement arrival to production service • B. Update Grid Production system configuration to decouple Data/MC Production services • C. SAM Station moved from FWD1 to distinct hardware (one station each for Data and MC Production) • A. Install Fwed4, Que2 – Rough Timeline • Nodes arrive between last week of Sept to first week of October • Time from arrival of “raw” hardware to “OS setup done” estimate? • Past experience: 20-30 days (4-6 weeks) to deploy new “OS setup done” Fwd or Que node into production service. Will these 2 be treated in series or parallel… staffing, percent on tasks? • Should reduce this time, basically a cloning operation. Possibly coupled to VDT upgrade task chain. • Support, Location Issues • Transfer support: Queue node support from FGS to FEF • Transfer support: CAB head nodes from FEF to FGS • Location (and subnet) in FCC1, FCC2, or GCC? Not a 24x7 service, but critical servers for the service.

D0 Grid Data Production Priorities (1b of 4) • 1. Server Expansion and Decoupling of Data/MC Prod at Services (cont.) • B. Config: Optimal svc config for data prod and MC prod with decoupling • Rough timeline = ? A meeting + test deployment + production deployment? • C. SAM Station moved from FWD1 to distinct hardware (one station each for Data and MC Production) • Rough Timeline = ? • Related topics • Virtualization may be a bigger project than staff/time permit for phase 1? • Follow-up issue: sufficient spares to rapidly replace a misbehaving server node? Deploying more servers... Can we sustain this config over time?

D0 Grid Data Production Priorities (2 of 4) • 2. Condor 7 Upgrade and Support • (placeholder: not on agenda this week) • (note VDT packaging dependency)

D0 Grid Data Production Priorities (3 of 4) Metrics • Resource Utilization: We want to maximize resource utilization, mitigate SPOF, understand system capacity • % Job slots occupied from point of view of job queuing system • FermiGrid plots and summary numbers • CAB plots • Adam’s annotated plot reported to CD Ops weekly • % CPU used for assigned job slots – includes some wait time for DH, DB access • … also in above? … • MEvts/day successfully processed • Historical: in CD Ops reports for REX Dept. --- Source = ? • Effort to Run Data Production: Reduce effort to operate • Mean time between touches: how often does coordinator have to interact with system just to keep job queues full • Hours spent per week: Launching jobs, working on error recovery, debugging jobs, etc. ESTIMATE.... • Challenge: define something meaningful that CAN be measured with little added effort and which still tracks this issue. • Metrics to Quantify Data Production Service Quality: Quantify Rework, “Data Sample” Defect Risk • First-pass success rate: Fraction of jobs succeeding on first try • May or may not be able to disentangle user executable failure from this • N-pass success rate: Fraction of jobs succeeding after N retries – do jobs eventually succeed? • Mean tries to success: Average number of tries until job succeeds – quantifies “rework” effort • (Mike has/will have his #s on web, will show and discuss next week.) • Metrics: Decision – can we complete this (or similar) set of “automated” metrics by end of 2008?

D0 Grid Data Production Priorities (4 of 4) • 4. Small Quick Wins: Scale= 1-2 FTE-weeks from 1-2 people • a. Reliable Status Info returned by SAM-Grid: Subject of meeting Monday • Situations lead to job being labeled at queuing node as “Complete” when in fact some parts of the job are still running or have failed. • Resolution involves change to gridftp cert used + change to state aggregation in SAM-Grid • Rough evaluation: 1-2 FTE-weeks (SAM-Grid Dev) to address. • Positive reception by all parties, no “decision” made. This depends on total work and consulting being requested of Dev. • b. Grid  Batch JobID tracking - try out GG's procedure. May be no added effort required in Phase 1. • Root access on CAB is not required to use the procedure, but user access to CAB is required. • c. Auto-restart on reboot (or make as restartable as possible). Integrating D0Farm occupancy into ops reporting: Robert I is working on this now. • d. Why are nodes rebooting? Can we reduce/mitigate downtime due to hardware failure with altered process/procedures? • Related issues: Depth of Spare Pool? Fast procedure to quickly swap out flaky/degraded server? Virtualization? Aggressive replacement of servers if any repeated error seen? Ability to requalify such pulled servers on a test stand? • Need to define this a little better as a package of work IMHO. • e. Slow FWD node to CAB Job Transition Investigation - can see when slow, but upstream or downstream cause? • f. Intelligent selection of Fwd nodes from a pool of candidates (taking into account which are busy) or some mechanism which avoids sending jobs to a “full” Fwd node while another Fwd is available. • (new) Process concerns from Steve Timm – improve communications on follow-ups and recommendations. • 4. Small Quick Wins: Decision on task set • Given rough benefit/cost of above, are (a-d, process concern) the best feasibleproposal to pursue for Phase 1? • This depends on work plan that comes out of (d), so propose to have a meeting on (d) as soon as all parties available. • Else, shall we discuss benefit/cost for (e) or (f) in next week? I propose putting them off to Phase 2 (after reassessment).

D0 Grid Data Production Rough Plan • Outline Work that Could be done in time frame • 1. Server Expansion and Decoupling Data/MC Production at Services • A. Install hardware and services on nodes • B. Configuration to optimize production, decouple Data and MC Production • C. Sam Station move, one each for Data and MC Production • 2. Condor 7.0 Upgrade and Support • 3. Metrics • 4. Small Quick Wins • Determine a Feasible Package of Work in time frame • Rough scheduling to check that this is feasible with available staff • Prefer to do fewer different task chains and finish earlier than do all possible useful work. • Decision: Consensus • Agreement by all parties to feasible package of work. • May include items “to be discussed further” if imprecise agreement. • Present to CD Head, D0 Spokespeople “near end of Sept”, date yet to be set. CD Budget Review next week may push this to first week of Oct.

D0 Grid Data Production Backup Slides • … not used in the meeting …

D0 Grid Data Production Roadmap • September 2008: Planning (verbatim from last week) • Rob Kennedy, working with Adam Lyon, charged by Vicky White to lead effort to pursue this. • First stage is to list, understand, and prioritize the problems and the work in progress. • Next, develop a broad coarse-grained plan to address issues and improve the efficiency. • Present plan to Vicky and D0 Spokespeople towards end of September 2008. • October 2008 – December 2008: Phase 1 (In detail in later slides) • 1. Server Expansion and Decoupling Data/MC Production at Services • 2. Condor 7.0 Upgrade and Support • 3. Metrics • 4. Small Quick Wins • Balance the benefit and costs to get the most improvement soonest without overwhelming disruption • EB: Note impact on FY09 plans with SAM-Grid dev sharp ramp-down. GG: 20% to 10% to 5% (just consult). • January 2009: Re-Assess • 1. Formally re-assess D0 Grid Data Production • 2. Plan new work as needed… • January/February 2009 – April-ish 2009: Phase 2 … very rough picture • 1. Additional work on systems as needed… • 2. After basic Data Production goals achieved though, recommend moving on to MC Production issues

D0 Grid Data Production Priorities (2 of 4) • 2. Condor 7 Upgrade and Support • Issues: Eliminates (we are told) the Periodic Expression hangs • Communication: establish closer communications with REX/FGS/Condor Supp • Good discussion (Steve, Keith, Rob, Adam) after the last FermiGrid Users Meeting (AL to post notes... URLs of some useful plots included). • Upgrade is prerequisite for working closely with Condor developers on issues • Packaging: In past, depended on VDT packaging by Grid/OSG group. Major dependency. Avoidable? • Requires root to install? ST: May-be not, but very hard to do and not recommended. • CDF Offline also facing this upgrade, opportunity for leveraging? (not as much as hoped) • Standard install of VDT. Thin ups/upd packaging of Condor. • Not the same Condor use-case as D0 efforts though. CDF uses less of its services. • SAM-Grid repackaging may be involved if distribution mechanism is changed... Complicated. • Cost/benefit of reducing this (VDT <–> SAM-Grid) packaging coupling, make upgrades less complicated, now in order to save on less SAM-Grid-specific VDT/Condor repackaging by developers later? • Consensus: Do recommend separate VDT and Condor installations to allow Condor to upgrade independently and more quickly.

D0 Grid Data Production Metrics List • Resource Utilization • Issue: We want to maximize resource utilization, mitigate SPOF, understand system capacity • % Job slots occupied from point of view of job queuing system • % CPU used for assigned job slots: includes data handling and DB access wait time outside of our scope. • MEvts/day successfully processed: another top-level metric. Compare to data-taking rate. • Effort to Run Data Production • Issue: Reduce effort to operate • Mean time between touches: how often does coordinator have to interact with system just to keep job queues full, assuming no unusual errors in the system. • Hours spent per week: Launching jobs, working on error recovery, debugging jobs, etc. ESTIMATE.... • Metrics to Quantify Data Production Service Quality (what can we measure?) • First-pass success rate: Fraction of jobs succeeding on first try • May or may not be able to disentangle user executable failure from this • N-pass success rate: Fraction of jobs succeeding after N retries – do jobs eventually succeed? • Mean tries to success: Average number of tries until job succeeds – quantifies “rework” effort

D0 Grid Data Production Issues List (p.1/4) • 1) Unreliable state information returned by SAM-Grid: SAM-Grid under some circumstances does not return correct state information for jobs. Fixing this may entail adding some logic to SAM-Grid. • 2) Cleanup of globus-job-managers on forwarding nodes, a.k.a. “stale jobs”: The globus job managers on the forwarding nodes are sometimes left running long after the jobs have actually terminated. This eventually blocks new jobs from starting. • 3) Scriptrunner on samgrid needs to be controlled, a.k.a. the Periodic Expression problem: This is now locking us out of all operation for ~1 hour each day. This is due to a feature in Condor 6 which we do not use, but which cannot be fully disabled either. Developers say this is fixed in Condor 7, but this has not been proven yet. • 4) CORBA communication problems with SAM station: The actual source of all CORBA problems is hard to pin down, but at least some of them seem to be associated with heavy load on samgfwd01 where the SAM station runs. Since the forwarding nodes are prone to getting bogged down at times, the SAM station needs to be moved to a separate node. • 5) Intelligent job matching to forwarding nodes: SAM-Grid appears to assign jobs to the forwarding nodes at random without regard to the current load on the forwarding nodes. It will assign jobs to a forwarding node that has reached CurMatch max even if another forwarding node has job slots available.

D0 Grid Data Production Issues List (p.2/4) • 6) Capacity of durable location servers: Merge jobs frequently fail due to delivery timeouts of the unmerged thumbnails. We need to examine carefully what functions the durable location servers are providing and limit activity here to production operations. Note that when we stop running Recocert as part of the merge this problem will worsen. • 7) CurMatch limit on forwarding nodes: We need to increase this limit which probably implies adding more forwarding nodes. We would also like to have MC and data production separated on different forwarding nodes so response is more predictable. • 8) Job slot limit on forwarding nodes: The current limit of 750 job slots handled by each forwarding node has to be increased. Ideally this would be large enough that one forwarding node going down only results in slower throughput to CAB rather than a complete cutoff of half the processing slots. Could be addressed by optimizing fwd node config for data production. • 9) Input queues on CAB: We have to be able to fill the input queues on CAB to their capacity of ~1000 jobs. The configuration coupling between MC and data production that currently limits this to ~200 has to be removed. Could be addressed by optimizing fwd node config for data production.

D0 Grid Data Production Issues List (p.3/4) • 10) 32,001 Directory problem: Band-aid is in place, but we should follow up with Condor developers to communicate the scaling issue of storing job state in a file system given the need to retain job state for tens of thousands of jobs in a large production system.11) Spiral of Death problem: See for example reports from 19-21 July 2008. Rare, but stop all processing. We do not understand the underlying cause yet. The only known way to address this situation is to do a complete kill/cold-stop and restart of the system. • 12) Various Globus errors: We have repeated episodes where a significant number of jobs lose all state information and fall into a "Held" state due to various Globus errors. These errors are usually something like "Job state file doesn't exist", "Couldn't open std out or std err", "Unspecified job manager error". Mike doesn't think we have ever clearly identified the source of these errors. His guess is they have a common cause. The above errors tend to occur in clusters (about half a dozen showed up last night, that's what brought it to mind). They usually don't result in the job failing, but such jobs have to be tracked by hand until complete and in some cases all log information is lost. • 13) Automatic restart of services on reboot: Every node in the system (samgrid, samgfwd, d0cabosg, etc) needs to be set up to automatically restart all necessary services on reboot. We have lost a lot of time when nodes reboot and services do not come back up. SAM people appear to not get any notification when some of these nodes reboot.

D0 Grid Data Production Issues List (p.4/4) • 14) SRM needs to be cleanly isolated from the rest of the operation: This might come about as a natural consequence of some of the other decoupling actions. A more global statement would be that we have to ensure that problems at remote sites cannot stop our local operations (especially if the problematic interaction has nothing to do with the data processing operation). • 15) Lack of Transparency: No correlation between the distinct grid and PBS id’s and inadequate monitoring mean it is very difficult to track a single job through the entire grid system, especially important for debugging. • 16) Periods of Slow Fwd node to CAB Job transitions: newly added • MC-specific #1) File Delivery bottlenecks: use of SRM at site helps • MC-specific 2) Redundant SAM caches needed in the field • MC-specific 3) ReSS improvements needed, avoid problem sites,…. • MC-specific 4) Get LCG forwarding nodes up and running reliably • MC Note) 80% eff. on CAB, but 90% on CMS – Why? Something about CAB/PBS to note here for data production expectations too?

D0 Grid Data Production: Evaluation Mtg 3

D0 Grid Data Production: Evaluation Mtg 3

Presentation Transcript

D0 Grid Data Production Initiative: Coordination Mtg

D0 Grid Data Production Initiative: Coordination Mtg 5

D0 Grid Data Production Initiative: Coordination Mtg

D0 Grid Data Production: Evaluation

Production evaluation

D0 Grid Data Production Initiative: Coordination Mtg 13

Primary production group ARCSS SASS mtg

D0 and D0bar signal from 3 rd Production

Evaluation of “data” grid tools

D0 Grid Data Production Initiative: Coordination Mtg

D0 Grid Data Production Initiative: Coordination Mtg

D0 Grid Data Production Initiative: Coordination Mtg

D0 Grid Data Production Initiative: Coordination Mtg

D0 Grid Data Production Initiative: Coordination Mtg 10

D0 Grid Data Production Initiative: Coordination Mtg 7

D0 Grid Data Production Initiative: Coordination Mtg 9

D0 Grid Data Production Initiative: Coordination Mtg 12

D0 Grid: CCIN2P3 at Lyon

Analysing first D0 data

D0 Luminosity DB Status: Loading Production Data

D0 Grid Data Production Initiative: Coordination Mtg 11

D0 Grid Data Production Initiative: Coordination Mtg