1 / 34

Tier-1 – Final preparations for data

Tier-1 – Final preparations for data. Andrew Sansum 9 th September 2009. Themes (last 9 months). Improve planning Recruitment Re-engineer production and operations processes Enhance resilience Test it works (STEP09) Move to R89 “Test” new Disaster Management System

trang
Download Presentation

Tier-1 – Final preparations for data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tier-1 – Final preparations for data Andrew Sansum 9th September 2009

  2. Themes (last 9 months) • Improve planning • Recruitment • Re-engineer production and operations processes • Enhance resilience • Test it works (STEP09) • Move to R89 • “Test” new Disaster Management System • Final preparations for data taking Tier-1 Status

  3. The Plan prepare for STEP prepare for R89 Prepare fordata taking Update Freeze Update Freeze contingency Apr Nov Sep May Jun Aug Oct Jul SRM + nameserver SL5 upgrade CASTOR upgrade STEP LFC/FTS3D R89 Migration Test disasterManagement system  New Hardware CASTOR Hardware Resilience

  4. Recruitment complete • Recruitment has been tough (but good team in place now) • Initially STFC freeze • Later, hard to recruit STFCfreeze Tier-1 Status

  5. Meeting Experiment Needs • VO survey carried out in April • Based on a series of qualitative and quantitative questions • Very helpful and considered feedback from most significant Vos • Generally very positive: Key findings • Communication between Tier-1 and VOs generally working well • Production team have made a big difference • Meeting commitments/expectations of LHC VOs • VOs not always clear on Tier-1 priorities (since tried to address this by liaison meeting) • Non LHC VOs particularly commented that although support was good Tier-1 did not always deliver service on agreed timescales (unfortunately intentional, reflecting priorities – expectations management?) - • Documentation poor (need to work on this still) Tier-1 Status

  6. Production Team/Production ops • Daytime team of 3 staff (Gareth Smith, John Kelly, Tiju Idiculla) • Handle operation exceptions (NAGIOS alerts/pager callouts) • track tickets • Monitor routine metrics, loads, network rates • Ensure operational status is communicated to VOs • Represent Tier-1 to WLCG daily operations • Oversee downtime planning, agree near term downtime plan • Oversee progression of Service Incident reports • (re-)engineer operational processes • Night-time/weekend team of 5 staff on-call at any time (2 hour response): • Primary on-call (triage and fix easy faults) • Secondary on-call: CASTOR, Grid on-call, Fabric, Database Tier-1 Status

  7. Callout rate • Big improvement over 2009 – recent deterioration owing to recent development activity and major incidents Tier-1 Status

  8. Process Improvement • Service is complex • Frequent routine interventions – eg:. • Add disk servers to class • take disk servers offline • Mistakes occur if not engineered out. • Work in progress but critical if we are to meet high expectations Tier-1 Status

  9. CASTOR (I) • Process of gradual improvement, tracking down causes of individual transfer failures. Improving processes (eg disk server intervention status) • Applied ORACLE patch to fix the Big ID bug • Series of CASTOR minor version upgrades to 2.1.7-27. These have predominantly included bug-fixes, including one workaround to prevent the ORACLE Crosstalk bug from reoccurring • Reconfiguration of internal LSF scheduler to improve stability and scalability (move from NFS to HTTP) • Tuning changes • ORACLE migration to new hardware (two EMC RAID arrays) which provides additional resilience, improved performance and better maintenance. • SRM upgrades to version 2.7.15 Tier-1 Status

  10. CASTOR: Downtime (2008-2009) 2.1.7 upgrade R89 Tier-1 Status

  11. CASTOR (III): Plans • September • Nameserver upgraded to 2.1.8 • SRM upgrade to version 2.8 • CIP upgrade to version 2 (in progress) • 2009Q4 • optimizing the ORACLE database • Additional resilience • Disaster recovery testing Tier-1 Status

  12. STEP09: Operations Overview • Generally very smooth operation: • Most service systems relatively unloaded plenty of spare capacity • Calm atmosphere. • Daytime “production team” monitored service • Only one callout, • Most of the team even took two days out off site for department meeting! • Very good liaison with VOs and good idea what was going on. • In regular informal contact with UK representatives • Some problems with CASTOR tape migration (3 days) on ATLAS instance but all handled satisfactorily and fixed. Did not visibly impact experiments. • Robot broke down for several hours (stuck handbot led to all drives de-configured in CASTOR). Caught up quickly. • Very useful exercise – learned a lot, but very reassuring • More at: http://www.gridpp.rl.ac.uk/blog/category/step09/

  13. STEP09: Batch Service Farm typically running > 2000 jobs. By 9th June at equilibrium: (ATLAS 42%, CMS 18%, Alice 3%, LHCB 20%) • Problem 1: ATLAS job submission exceeded 32K files on CE • See hole on 9th. We thought ATLAS had paused  took time to spot. • Problem 2: Fair shares not honoured as aggressive ALICE submission beat ATLAS to job starts. • Need more ATLAS jobs in queue faster. Manually cap ALICE. Fixed by 9th June. See decrease in (red) ALICE work. • Problem 3: Occupancy initially poor (initially 90%). Short on memory (2GB/core but ATLAS jobs needed 3GB vmem). Gradually increase MAUI over-commit on memory to 50%. Occupancy --> 98%.

  14. STEP09: Network Batch Farm drawing approx 3Gb/s from CASTOR during reprocessing. Peaked at 30Gb/s for CMS reprocessing without lazy download. Total OPN traffic. Inbound 3.5Gb/s, outbound 1Gb/s RAL->Tier-2 outbound rate average 1.5Gb/s but 6Gb/s spikes!

  15. STEP09: Tape • Tape system worked well. Sustained 4Gb/s during peak load on 13 drives (ATLAS+CMS), 15 drives with LHCB. We played with a mix of dedicated (4 ATLAS, 4 CMS, 2 LHCB, 5 shared). • Typical average rate of 35MB/s per drive (1 day average) • Lower than we would like (looking for nearer 45MB/s) • On CMS instance, modified write policy gave > 60MB/s but reads more challenging to optimise.

  16. R89: Migration • Migration planning started early 2008 (building early 2006) • Detailed equipment documentation together with a requirements document was sent to vendors during September 2008 • Workshop hosted during November. Vendors committed to 3 racks (each) per day (we believe 5-6 was feasible) • Orders placed at the end of November to move 77 racks of equipment (and robot) to an agreed schedule (T1=43 racks). • Started 22nd July and ended 6th August • Completed to schedule Tier-1 Status

  17. R89 Migration • 43 racks moved Batch workerscomplete Disk complete CASTORrestarting Restart Mon 6 Fri 3 Fri 26 Wed 24 Mon 22 Wed 1 Fri 19 Mon 29 Wed 17 Drain CEs CastorCore + Disk DrainFTS batch workersstart Drain WMS Critical Services Tier-1 Status

  18. Disasters: Swine Flu • First to test new disaster management system • Easy to handle – trivial to generate a contingency plan based on existing template. • Situation regularly assessed. Tier-1 response initially running ahead of RAL site planning. • Reached level 2 in DMS with assessment meetings every 2 weeks. Work mainly on remote working and communication strategy • Now downgraded to level 1 until significant rise in case frequency • Expect to dust off again before Christmas Tier-1 Status

  19. Disasters: Air-conditioning(I) • Two cooling failures in 3 days • Monday day: both chiller systems shutdown, restarted quickly • Tuesday: one chiller shutdown and failed over to second chiller • Wednesday night: both chillers shutdown could not restart • After third event decided not to restart Tier-1 room reaches equilibrium 45 chiller restart 35 hot isle 25 cold isle 15 shutdown Tier-1 Status

  20. Disasters: Air –conditioning (II) • Initial post mortem started after first (daytime) event • Thermal monitoring, callout and automated shutdown in R89 not fully implemented/working correctly • urgent remedial work underway • Second, night-time incident raised further concerns • Tier-1 called out and rapidly escalated • But automated shutdown still in test mode • Forced to do manual shutdown • Operations thermal callout failed to work as required • Site security did not escalate BMS alarm (not expected alarm) • Escalation to building services very slow (owing toR89 being still under warranty/acceptance) • Chillers could not be restarted • No explanation of cause of outage • Concluded we would not restart Tier-1 until issues resolved Tier-1 Status

  21. Disasters: Air-conditioning (III) • Critical Services continued to run: • Separate, redundant cooling system in UPS room. • Tape robotics and CASTOR core OK too (low temperature room) • By Friday: • Tier-1 response at disaster level 3 (meeting held with VOs and PMB) • Building services believed that cooling was stable and fault could not recur. • all necessary automation, callout and escalation processes in place • Nevertheless Tier-1 team not prepared to run hardware unattended over the weekend. • On Monday: • Full service restart • plan to baby-sit service during Mon/Tue evening • Forensics and post-mortem continued Tier-1 Status

  22. Disasters: Air-conditioning (IV) • Monday 10th incident believed to be caused by a planned reboot of the Building Management System (BMS) • Caused pumps to stop • Low pressure caused chiller valves to close • BMS returned but system deadlocked • Tuesday 11th – single chiller trip followed by failover • logs do not allow diagnosis. • Wednesday 12th – BMS detected overpressurein cooling system and triggered shutdown • Probably true over pressure (1.9 Bar) • Settings (1.7bar) considered to be too low • Now raised to 2.5 Bar and only calls out • System tested to 6 bar. • Investigations continue Tier-1 Status

  23. Disasters: Water Leak • Water found dripping on tape robot!!!!!! • “I don’t believe this is happening” moment • Should not be able to happen as no planned water supplies above machine room. • “Fortunately” Tier-1 already shut-down so turnoff robot too. • STK engineer investigates and concludes that damage is mainly superficial splash damage,drive heads not contaminated, tapes (60 splashed)probably OK. • Indication that had been occurring occasionally for several weeks Tier-1 Status

  24. Disasters: Water leak • Cause: condensation from 1st floor cooling system • Incorrect damper setting (air intake) led to excess condensation • Condensation collected in “drip tray” and pumped • Tray too small and pump inadequate • Water overflowed tray and tracked along floor to hole • Remedy • Place umbrella over robot • Chillers switched off – 1st floor inspected daily! • Planning underway to re-engineer drip trays/pumpsalarms, etc. • Monitor tape error rate Tier-1 Status

  25. Procurements • Disk, CPU and robotics procurements delayed from January/February delivery dates • New SL8500 tape robot entirely for GRIDPP, 2PB of disk – 24 drive units (50% Areca/WD, 50% 3Ware/Seagate), CPU capacity • Eventually delivered in May, but entangled in R89 migration, • New Robot in production in July • CPU completed acceptance test and deploying into SL5 • One Lot of disk (1PB) ready for deployment • Second Lot failed acceptance (many drive ejects) • Positive aspects of acceptance failure • Two Lot risk avoidance strategy worked • Vendor 1 week load test failed to find fault • Our 28 day acceptance caught fault before kit reachedproduction Tier-1 Status

  26. LFC , FTS and 3D • Now complete • Upgrade back end RAID arrays and Oracle servers • Replace elderly RAID arrays with pair of new EMC RAID arrays • Better support (we hope) • Better performance • Move to ORACLE RAC for LFC/FTS (increased resilience) • Separate ATLAS LFC from general LFC • Upgrade 3D servers and move to new RAID arrays • Work commenced on testing replication of LFC for disaster contingency Tier-1 Status

  27. Quattor – Story so Far • Began work in earnest in June 2009 • Set up Quattor Working Group instance to manage deployment and configuration of new hardware. • leverages strong QWG support for gLite • Have SL5 torque/maui server under Quattor control • Are (as of today) deploying 220+ new WNs in SL5 batch service • Significant work to get up and running. New way of working. • Have uncovered and helped fix a number of bugs and issues in the process

  28. Quattor – Next Steps • As we move existing WNs them to SL5 (need 75% of our capacity in SL5) we will quattorise them • Move CEs and other grid service nodes to Quattor • Gradually migrate non-grid services to Quattor control • AQUILON • Database backend to Quattor developed by Morgan Stanley • Improves scalability and manageability (MS are managing >15,000 nodes) • Will first deploy at RAL • Then plan to make Aquilon make usable by other grid sites as well

  29. Dashboard • Available at http://www.gridpp.rl.ac.uk/status • Constantly evolving • Components can be added/updated/removed • Present components • SAM Tests • Latest test results for critical services • Locally cached for 10 minutes to reduce load • Downtimes • Ongoing and upcoming downtimes pulled from GOCDB • Red colour for OUTAGE and yellow for AT_RISK • Notices • Latest information on Tier 1 operations • Only Tier 1 staff can post • Ganglia plots of key components from the Tier1 farm • Feedback welcome

  30. SL5 Migration (I) • Next week - 14th - 18th September! • LHC only (for now) – but all VOs affected • New batch service - lcgbatch01 • Quattorised torque/maui server • Quattorised worker nodes • New LCG-CEs (6-8) for LHC vos – old LHC CEs (3-5) being retired, other CEs reconfigured • Same queue configuration • Use submit filter script on CEs to add SLX property requirement as required

  31. SL5 Migration (II) • CPU08 going straight into SL5 now (~1800 job slots) • All 64-bit capable existing WNs will be reinstalled eventually • Non-LHC vos will get new CE for migration after dust settles • No plan to retire SL4 WNs completely yet

  32. October Freeze • No planned upgrades beyond September except possibly network upgrade. • Recognise that some change will have to take place • Need to put in place lightweight change control process • Allow changes where benefit outweighs risk • Expect increased stability as downtimes reduce • Apply pressure once more to reduce low grade failures. Tier-1 Status

  33. Conclusion • Recent staff additions have had a huge impact on quality of service we operate. • Tier-1 development plan for 2009 nearly complete. • Positive feedback from STEP09 that service meets requirements. • Still a few major items (like SL5) to get through (fingers crossed). • Probably still some R89 suprises in pipeline. • Looking forward to start of data taking Tier-1 Status

More Related