1 / 34

Deployment issues and SC3

This document discusses current deployment issues in GridPP, including gLite migration, dCache, data migration, security, Ganglia deployment, and use of ticketing system. It also provides an update on the deployment progress, lessons learned, and the next release plan.

parteaga
Download Presentation

Deployment issues and SC3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deployment issues and SC3 Jeremy Coles GridPP Tier-2 Board and Deployment Board Glasgow, 1st June 2005

  2. Current deployment issues Main GridPP concerns: • gLite migration, fabric management & future of YAIM • dCache • Data migration – classic SE to SRM SE • Security • Ganglia deployment • Use of ticketing system • Use of UK testzone General • Jobs at sites – improving (nb. Freedom of Choice is coming!) • Few general EGEE VOs supported at GridPP sites Deployment update

  3. 2nd LCG Operations Workshop • Took place in Bologna last week: http://infnforge.cnaf.infn.it/cdsagenda//fullAgenda.php?ida=a0517 • Covered the following areas: • Daily operations • Pre-production service • Glite deployment and migration • Future monitoring (metrics) • Interoperation with OSG • User support (Executive Support Committee!) • VO management processes • Fabric management • Accounting (DGAS and APEL) • Little on security! Romain presented potential tools. Deployment update

  4. LCG-2_4_0 CPUs: 2_4_0 10642 2_3_1 912 2_3_0 2167 Plan Deployment update

  5. Version Change in the last 100 days Others: Sites on older versions or down All sites in LCG-2 Deployment update

  6. Russia Canada Italy Regions with less than 5 sites are not shown Germany/Switzerland Deployment update

  7. France SW Northern Asia Pacific Deployment update

  8. Central SE Deployment update

  9. UKI Deployment update

  10. LCG-2_4_0 Lessons learned: • Harder than expected (rate independent of packaging) • Differences between regions --> ROCs matter • Release definition non trivial with 3 months intervals • Components dependencies • X without Y and V is useless…. • During certification we still find problems • Upgrade and installation from scratch needed (time consuming) • Test pilots for deployment are useful • Early announcement of releases is useful • We need to introduce “updates” via APT to fix bugs that show during deployment • Number of sites is the wrong metric to measure success • CPUs on new release needs to be tacked, not sites Deployment update

  11. The next release • Why? • SC3 is approaching and the needed components are not deployed at the sites • What? • File transfer service (will need VDT 1.2.2) • Servers for Tier1 and Tier0, clients for the rest • Improved monitoring sensors for gridFtp • RFC proxy extension for VOMS • New version of the GLUE schema (compatible) • LFC production service • Interoperability with GRID3/OSG • User level stdio monitoring (maybe later) • Bug fixes …….. as always • When? • Aimed at mid June • Who? • Tier 1 centers and Tier 2 centers participating in SC3 • As fast as possible • Others? • At their own pace • Updated release (fixes from 1st release) expected by July 1st. Deployment update

  12. VOMS Coexistence & Extended Pre-Production Catalogue and access control LFC RB gLite WLM FIREMAN myProxy BD-II BD-II APEL dgas Independent IS R-GMA R-GMA R-GMAs can be merged (security ON) UIs gLite-IO LCG gLite LCG CE SITE CEs use same batch system WNs gLite-CE FTS for LCG uses user proxy, gLite uses service cert FTS FTS shared LCG SRM-SE Data from LCG is owned by VO and role, gLite-IO service owns gLite data gLite Deployment update

  13. shared LCG gLite RB gLite WLM myProxy VOMS Gradual Transition 1 Optional additional WLM Data Management LCG Optional dgas accounting LFC BD-II dgas APEL R-GMA UIs LCG gLite LCG CE SITE gLite-CE CEs use same batch system WNs FTS for LCG uses user proxy, gLite uses service cert FTS SRM-SE Deployment update

  14. shared LCG gLite VOMS Gradual Transition 2 LFC gLite WLM FIREMAN myProxy BD-II dgas APEL R-GMA Removed LCG WLM Optional Catalogue R-GMA in gLite mode UIs LCG gLite SITE gLite-CE WNs FTS SRM-SE Deployment update

  15. shared LCG gLite VOMS Gradual Transition 3 LFC gLite WLM FIREMAN myProxy BD-II dgas APEL R-GMA Adding gLite-IO Second path to data Additional security model Data migration phase UIs LCG gLite SITE gLite-CE WNs gLite-IO FTS FTS SRM-SE Data from LCG is owned by VO and role, gLite-IO service owns gLite data Deployment update

  16. shared LCG gLite VOMS Gradual Transition 4 gLite WLM FIREMAN myProxy BD-II dgas APEL R-GMA UIs LFC LCG gLite SITE gLite-CE WNs gLite-IO Finalize switch to new security model. LFC, now a local catalogue under VO control BDII later replaced by R-GMA FTS SRM-SE Deployment update

  17. Metrics - EGEE • General Agreement on the concept • detailed discussions on: • time windows • Sliding windows (week, month, 3 month) • quantities to watch for (RCs, ROCs, CICs…..) • ROCs based on RCs • CICs based on services • Release quality has to be measured • To make progress: workgroup to define quantities • Organized by: Ognjen Prnjat (oprnjat@admin.grnet.gr) • Small (˜5), Ognjen, Markus, Helene, Jeff T. and Jeremy • Ognjen will collect input • ROCs, CICs and OMC have to agree on ONE set of quantities Deployment update

  18. Operations summary • CIC On Duty is now well established • COD is just 6 month old!!!!! • Tools have evolved at a dramatic pace • Portal, SFT,…… • Many rapid iterations • Truly distributed effort • Integration of new COD partner (Russia) went smoothly • Tuning of procedures is an ongoing process • No dramatic changes (take resource size more into account) Deployment update

  19. Accounting Last November still an area of concern • APEL now well established • Support for batch systems is improving • Several privacy related problems have been understood and solved • gLite Accounting: DGAS • Some concerns about amount of information published • Can be handled by proper authorization? • Collaboration with APEL on batch sensors (BBQS, Condor,..) • DGAS agreed to provide them • Will be introduced initially on a voluntary basis • Sites will give feedback (including privacy issues) Deployment update

  20. Current deployment issues (recap) Main GridPP concerns: • gLite migration, fabric management & future of YAIM • dCache • Data migration – classic SE to SRM SE • Security • Ganglia deployment • Use of ticketing system • Use of UK testzone General • Jobs at sites – improving (nb. Freedom of Choice is coming!) • Few general EGEE VOs supported at GridPP sites Deployment update

  21. Freedom of choice - VO Page Deployment update

  22. Service Challenge 3 Deployment update

  23. SC timelines 2005 2006 2007 2008 SC3 First physics cosmics First beams Full physics run SC4 LHC Service Operation June05 - Technical Design Report Sep05 - SC3 Service Phase May06 – SC4 Service Phase Sep06 – Initial LHC Service in stable operation Apr07 – LHC Service commissioned SC2 SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERN SC3 –Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period) SC4 –All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput LHC Service in Operation –September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput Deployment update

  24. Service Challenge 3 - Phases High level view: • Throughput phase • 2 weeks sustained in July 2005 • “Obvious target” – GDB of July 20th • Primary goals: • 150MB/s disk – disk to Tier1s; • 60MB/s disk (T0) – tape (T1s) • Secondary goals: • Include a few named T2 sites (T2 -> T1 transfers) • Encourage remaining T1s to start disk – disk transfers • Service phase • September – end 2005 • Start with ALICE & CMS, add ATLAS and LHCb October/November • All offline use cases except for analysis • More components: WMS, VOMS, catalogs, experiment-specific solutions • Implies production setup (CE, SE, …) Deployment update

  25. SC implications • SC3 will involve the Tier 1 sites (+ a few large Tier 2) in July • Must have the release to be used in SC3 available in mid-June • Involved sites must upgrade for July • Not reasonable to expect those sites to commit to other significant work (pre-production etc) on that timescale • T1: ASCC, BNL, CCIN2P3, CNAF, FNAL, GridKA, NIKHEF/SARA, RAL and • Expect SC3 release to include FTS, LFC, DPM, but otherwise be very similar to LCG-2.4.0 • September-December: experiment “production” verification of SC3 services; in parallel set up for SC4 • Expect “normal” support infrastructure (CICs, ROCs, GGUS) to support service challenge usage • Bio-med also planning data challenges • Must make sure these are all correctly scheduled Deployment update

  26. SC3 issues • Tier-1 network being extensively re-configured. Tests showed up to 40% packet loss! Waiting for UKLight to be fixed. Not intending to use dual-homing but dCache have provided a solution • Lancaster link up at the link level • What is the bandwidth of the Lancaster connection • Edinburgh hardware problem with raid-array to be used as SE – IBM investigating • Lancaster set up test system. Now deploying more hardware • Need clarification about classification of volatile vs permanent data in respect of Tier-2s • The file transfer service should be ready now but has problems with the client component • RAL would like longer period for testing tape than suggested in SC3 plans • There has been an issue with CMS preferring to use Phedex and not to use FTS for transfers. We need to add into the plans a period to do Phedex only transfer tests • dCache mailing list very active now. There have been problems with the installation scripts Deployment update

  27. SC3 issues continued • We have questions about whether FTS uses SRM-put or SRM-cp. • From September onwards SC3 infrastructure is to provide a production quality service for all experiments – remember comments about UKLight being a research network – risk!? • Differing engagement with the experiments. Edinburgh needs a better releationship with LHCb • There is an LCG workshop in mid-June where the experiment plans should be almost final! • GridPP needs to do more load testing than is anticipated in SC3 • Planning for SC4 needs to start soon. Currently we are pushing dCache but DPM is also supposed to be available. Deployment update

  28. Imperial (London Tier-2) • SRM/dCache Status • Production server installed • gfe02.hep.ph.ic.ac.uk • Information provider still developing • 1.5TB Pool node added • RHEL 4 , 64 bit system • Installed using dcache.org instructions http://www.dcache.org/downloads/dCache-instructions.txt • Extra 1.5TB ready to add when CMS ready • 6TB being purchased. Should be in place by start of Setup Phase • CMS Software • Service node provided • Phedex installed • Confirmation on FTS/Phedex issue sought Deployment update

  29. Edinburgh • Current LCG production setup: • Compute Element (CE), Classic Storage Element (SE), 3 Worker Nodes (2 machines, 3 CPUs). Monitoring takes place on the SE, running LCG 2.4.0. About to add 2 Worker Nodes (2 CPUs in 1 machine) and have a User Interface (UI) in testing. We have a 22TB datastore available • Plans • £2000 available for 2 machines - one for dCache work and one to connect to EPCC's SAN (10 TBytes promised). • Considering the procurement of more WNs but have no clear requirements from LHCb. Deployment update

  30. Lancaster (current) Deployment update

  31. Lancaster (planned) • LighPath and terminal Endbox installed. • Still require some hardware for our internal network topology. • Increase in Storage to ~84TB to possible ~92TB with working resilient dCache from CE Deployment update

  32. Other areas… Deployment update

  33. JRA4 request • We have some idea of requirements from networking experts within JRA4 • Draft requirements document available here: • https://edms.cern.ch/document/593620/1 • Draft use case document available here: • https://edms.cern.ch/document/591777/1 • We’re looking for more input from NOCs and GOCs • If you have requirements, use cases or opinions on interfaces or needed metrics, please send them to us • Even if you don’t have ideas at the moment, but would like to be involved in the process, please get in contact • Contact details are at the end of the talk Deployment update

  34. DTEAM discussion • Review of team objectives – what is the team focus for the next 3 & 5 months • Communications with the experiments • Using a project tool to work better as a team • Metrics!! • Review of plans and what needs to be done to keep them up-to-date including GridPP challenges and SC4 • Web-page status • Areas raised at the T2B and DB meetings • Security challenge involvement • Accounting – status and making further progress • Libraries and understanding expt. Needs • Review dCache efforts • Address issues with Quarterly reports & weekly reports • Next release, test-zone and test-zone machines • Data management – guidelines required • Improving robustness • GI – (Documentation (esp. releases), multi-Tier R-GMA, intro. New sites, LCFGng distribution (Kickstart & Pixieboot… ), jobs – how to get Deployment update

More Related