1 / 11

D0 Grid Data Production Initiative: Coordination Mtg 7

Version 1.0 (pre-meeting edition) 23 October 2008 Rob Kennedy and Adam Lyon Attending: … Unable to Attend: EB, KC, ST, CB. D0 Grid Data Production Initiative: Coordination Mtg 7. Outline. Summary and News Open Action Items Deployment “Feature List”: drives what is critical

trixie
Download Presentation

D0 Grid Data Production Initiative: Coordination Mtg 7

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Version 1.0 (pre-meeting edition) 23 October 2008 Rob Kennedy and Adam Lyon Attending: … Unable to Attend: EB, KC, ST, CB D0 Grid Data Production Initiative:Coordination Mtg 7 D0 Grid Data Production

  2. D0 Grid Data Production Outline • Summary and News • Open Action Items • Deployment “Feature List”: drives what is critical • Overview of Draft Baseline Schedule • Task Status (4 slides) • Metrics Summary Work

  3. D0 Grid Data Production Summary and News • Summary: • (details to follow… just to set the tone) • Roughly on time, on “budget” • News: • Alain Roy (VDT Team) deliver an official release EARLY with just our needed fix. THANKS! • We can roll this into the Nov ‘08 deployment 1 instead of waiting until the Dec ‘08 deployment 2

  4. D0 Grid Data Production Open Action Items(Green = effectively done, Yellow = added notes, Blue = coming week) • RDK: Baseline schedule with: • Resource names, Current status, Adjustments per feedback • Links to the related JIRA tickets (organized tasks to match, at higher level) • D0runjob upgrade task: need to understand what is involved: today? • Post schedule/plan and basic Gantt Chart to web-accessible area • This afternoon with posting of slides+notes, will send URLs. • AL/JA: Time Estimate for: 1.1.3.2 "Re-install OS on d0srv015, rename d0samgfwd5“: Done • AL/JA: Time Estimate for: 1.1.7.2 Repurpose, OS Install on new SAM head node: Done • RDK: Add Status Overview to initial slide with 1 sentence summary. (but we should avoid drilling down into it).

  5. D0 Grid Data Production Current Deployment “Feature” Lists • Deployment 1: Split Data/MC Production Services • Time frame: November 13-17, with 1 week+ observation before holidays • 1. Config: Basic Splitting of Fwd,Que Services between Data and MC Production with 2 Fwd nodes assigned to each, plus 1 Fwd dedicated to all Merging • 2. Fwd4 deployed (w/o virtualization) • 3. Fwd5 deployed • 4. Que2 deployed, with client software to enable parallel use of 2 QUE nodes • 5. New SAM Station (moved off of FWD1) • 6. Condor 7 via “new” 1.10.1 official release from UWisc • 7. FileMax increase on all Fwd nodes to handle large nJob actions • Deployment 2: Optimize Data and MC Production Configurations • Time frame: December 8-10, with 1 week+ observation before holidays • 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length • 2. D0Runjob Upgrade for Data Production (being conservative until better understood) • 3. New SAM-Grid Release with support for new Job status value at Queuing node

  6. D0 Grid Data Production Today Th-Day Holiday Schedule v0.9.5 (Phase 1) Fwd 4 Prep Fwd 5 Prep Que 2 Prep (status?) SAM’ Prep Deploy 1 Deploy 2 VDT “new” D0Runjob+Job Status Dev Filemax Metrics Summaries

  7. D0 Grid Data Production Task Status (1 of 4)(Red = a critical task chain, Green = effectively done, Yellow = added notes) • 1.1.1 Forwarding Node 4 (Fwd4) AL Wed 10/1/08 Mon 11/10/08 29d • 1.1.1.1 INPUT: Fwd4 Server Hardware On-site AL FEF Wed 10/1/08 Wed 10/1/08 0d • 1.1.1.2 Fwd4: Server Hardware OS Install AL FEF Wed 10/1/08 Thu 10/16/08 12d • 1.1.1.3 Fwd4: Server Hardware Burn-in AL FEF Fri 10/17/08 Fri 10/17/08 1d • 1.1.1.4 Fwd4: Verify Platform Installation AL JB Fri 10/17/08 Fri 10/17/08 1d • 1.1.1.5 Fwd4: Install VDT 1.10.1 "old"+patches AL JB Fri 10/17/08 Tue 10/21/08 3d • 1.1.1.6 Fwd4: Request and Install Grid Certs AL JB Tue 10/21/08 Wed 10/22/08 2d • 1.1.1.7 Fwd4: Install FWD-node-specific Components…AL JB Wed 10/22/08 Fri 10/24/08 3d • 1.1.1.8 Fwd4: Pre-Deployment As-Is Test AL JB Mon 10/27/08 Mon 11/3/08 6d • 1.1.1.9 Fwd4: Pre-Deployment FileMax=16k Test AL JB Tue 11/4/08 Mon 11/10/08 5d • 1.1.1.10 Milestone: Fwd4 Ready to Deploy AL Mon 11/10/08 Mon 11/10/08 0d • 1.1.2 Forwarding Node 5 (Fwd5) AL Tue 10/14/08 Mon 11/10/08 20d • 1.1.2.1 "Fwd5: d0srv015 Request Platform Prep" AL AL Tue 10/14/08 Tue 10/14/08 1d • 1.1.2.2 "Fwd5: Re-install OS on d0srv015" AL FEF Fri 10/17/08 Mon 10/20/08 2d • 1.1.2.3 Fwd5: Verify Platform Installation AL JB Tue 10/21/08 Tue 10/21/08 1d • 1.1.2.4 "Fwd5: Setup VDT 1.10.1 ""old""+patches" AL JB Tue 10/21/08 Thu 10/23/08 3d • 1.1.2.5 Fwd5: Request and Install Grid Certs AL JB Thu 10/23/08 Fri 10/24/08 2d • 1.1.2.6 Fwd5: Install FWD-node-specific Components…AL JB Fri 10/24/08 Tue 10/28/08 3d • 1.1.2.7 Fwd5: Pre-Deployment As-Is Test AL JB Wed 10/29/08 Mon 11/3/08 4d • 1.1.2.8 Fwd5: Pre-Deployment FileMax=16k Test AL JB Tue 11/4/08 Mon 11/10/08 5d • 1.1.2.9 Milestone: Fwd5 Ready to Deploy AL Mon 11/10/08 Mon 11/10/08 0d

  8. D0 Grid Data Production Task Status (2 of 4)(Red = a critical task chain, Green = effectively done, Yellow = added notes) • 1.1.3 Queuing Node 2 (Que2) AL Wed 10/1/08 Mon 11/10/08 29d • 1.1.3.1 INPUT: Que2 Server Hardware On-site AL FEF Wed 10/1/08 Wed 10/1/08 0d • 1.1.3.2 Que2 Server Hardware OS Install AL FEF Wed 10/1/08 Mon 10/20/08 14d • 1.1.3.3 Que2 Server Hardware Burn-in AL FEF Tue 10/21/08 Tue 10/21/08 1d • 1.1.3.4 Que2: Verify Installation AL JB Tue 10/21/08 Wed 10/22/08 2d • 1.1.3.5 Que2: Setup VDT 1.10.1 "old"+patches AL JB Wed 10/22/08 Mon 10/27/08 4d • 1.1.3.6 Que2: Request and Install Grid Certs AL JB Tue 10/28/08 Wed 10/29/08 2d • 1.1.3.7 Que2: Install QUE-node-specific Components…AL JB Tue 10/28/08 Fri 10/31/08 4d • 1.1.3.8 Que2: Test w/1-QUE Client AL JB Mon 11/3/08 Wed 11/5/08 3d • 1.1.3.9 Que2: Integration Test w/2-QUE Client AL JB Thu 11/6/08 Mon 11/10/08 3d • 1.1.3.10 Que2: Jim_Client 2-QUE Support: Client DeployAL DEV,JB Mon 11/10/08 Mon 11/10/08 1d • 1.1.3.11 Milestone: Que2 Ready to Deploy AL Mon 11/10/08 Mon 11/10/08 0d • 1.1.4 Jim_Client Development for 2 Queue Nodes Support GG Mon 11/3/08 Wed 11/5/08 3d • 1.1.4.1 Jim_Client: 2-QUE Node Support: Develop, Package GG ABa Mon 11/3/08 Tue 11/4/08 2d • 1.1.4.2 Jim_Client: 2-QUE Node Support: Test w/o Que2 GG ABa Wed 11/5/08 Wed 11/5/08 1d

  9. D0 Grid Data Production Task Status (3 of 4)(Red = a critical task chain, Green = effectively done, Yellow = added notes) • 1.1.5 New Distinct Sam Station AL Wed 10/1/08 Fri 11/14/08 33d • 1.1.5.1 SAM Station: Identify Hardware For Role AL FEF Wed 10/1/08 Wed 10/15/08 11d • 1.1.5.2 "SAM Station: Repurpose, OS Install" AL FEF Thu 10/16/08 Fri 10/17/08 2d • 1.1.5.3 SAM Station: Verify Platform Installation AL AL Mon 10/20/08 Tue 10/21/08 2d • 1.1.5.4 SAM Station: Setup Station AL AL Thu 10/23/08 Wed 10/29/08 5d • 1.1.5.5 SAM Station: Pre-Deployment Test AL AL Thu 10/30/08 Wed 11/5/08 5d • 1.1.5.6 SAM Station: Deployment Plan AL AL Thu 11/6/08 Thu 11/6/08 1d • 1.1.5.7 Milestone: SAM Station Ready to Deploy AL Thu 11/6/08 Thu 11/6/08 0d • 1.1.5.8 SAM Station: Setup Context Server AL AL Thu 11/13/08 Fri 11/14/08 2d • 1.1.6 Deployment Stage 1 AL Mon 11/10/08 Tue 11/25/08 12d • 1.1.6.1 Deployment 1: Plan: Split Data/MC Production Services AL ALL Mon 11/10/08 Wed 11/12/08 3d • 1.1.6.2 Deployment 1: Execute AL REX Thu 11/13/08 Mon 11/17/08 3d • 1.1.6.3 Deployment 1: Monitor AL REX Tue 11/18/08 Mon 11/24/08 5d • 1.1.6.4 Deployment 1: Sign-off AL REX Tue 11/25/08 Tue 11/25/08 1d • 1.1.6.5 MILE 1: Deployment 1 Completed AL Tue 11/25/08 Tue 11/25/08 0d

  10. D0 Grid Data Production Task Status (4 of 4)(Red = a critical task chain, Green = effectively done, Yellow = added notes) • 1.3.1 SAM-Grid Job Status Info GG Mon 10/13/08 Tue 11/11/08 22d • 1.3.1.1 "Use "Same" Proxy for Gridftps" GG PM Wed 11/5/08 Fri 11/7/08 3d • 1.3.1.2 New Job Status Value at QUE Node GG PM Mon 10/13/08 Fri 11/7/08 18d • 1.3.1.3 SAM-Grid Release with Job Status Info feature GG PM Mon 11/10/08 Tue 11/11/08 2d • 1.3.1.4 "Upgrade D0Runjob: Test, Make Workable" AL MD Mon 10/27/08 Fri 11/7/08 10d • What does this require to be done? • 1.3.1.5 Milestone: SAM-Grid Release Deployable for Data Prod AL REX Tue 11/11/08 Tue 11/11/08 0d • 1.3.2 Slow Fwd-CAB Job Transition "AL,GG" Wed 10/1/08 Mon 11/17/08 34d • 1.3.2.1 Investigation and Recommendations GG PM Wed 10/1/08 Tue 10/14/08 10d • 1.3.2.2 LINK: Condor 7 Upgrade Fixes Deployed AL Mon 11/17/08 Mon 11/17/08 0d • 1.3.2.3 Increase FileMax value to 16k on FWD1-3 on the fly AL REX Tue 11/4/08 Tue 11/4/08 1d • 1.3.2.4 Add FileMax value change to FWD Install Document AL REX Wed 11/5/08 Wed 11/5/08 1d • 1.3.2.5 Milestone: FileMax Value Change Deployed AL Wed 11/5/08 Wed 11/5/08 0d • 1.3.2.6 Milestone: Known Palliatives for Slow Job Transitions Deployed AL Mon 11/17/08 Mon 11/17/08 0d • 1.3.3 Improved H/w Uptime AL Mon 10/13/08 Mon 10/13/08 1d • 1.3.3.1 "Consider a FWD5: Full decoupling w/o virtualization, improved robustness to FWD node failures" AL AL Mon 10/13/08 Mon 10/13/08 1d

  11. D0 Grid Data Production Metrics Summaries • Work In Progress, Breaking Down Possible States Involved • We want to account for downtime: be fair, but from customer’s view • “Draining farm for new version of Reco” = customer driven, not D0 Grid • “Scheduled downtime = charged to D0 Grid Service, discretionary (or not) • “SRM at Purdue jams up SAM station” = charged to D0 Grid Service • So far, appears that MOST resource non-utilization in September 2008 was due to something other than slots being empty. Rework = wasted CPU cycles too. • Questions • CPU vs Slots Used: Do we have a ganglia summary for D0Farm? • Slots Used vs Events Produced: Do we count an only once even if resubmitted N times? • This would mean rework leads to busy slots, busy CPUs, and NO additions events/day • Rework Causes – break down observed causes… will arrange outside of this meeting • Complete September plot for nSubmissions per dataset? • Stepping Back: How well are the original 16+4 issues covered in Phase 1? • (next week) <end>

More Related