1 / 25

SA1

SA1. Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006. CPU. sites. Outline. Recommendations from intermediate focused review Highlights of last 3 months of the project Summary of SA1 achievements and open issues. SA1 Achievements.

Download Presentation

SA1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23rd – 24th May 2006

  2. CPU sites Outline • Recommendations from intermediate focused review • Highlights of last 3 months of the project • Summary of SA1 achievements and open issues Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  3. SA1 Achievements • Scale of the infrastructure • Has grown steadily during the project • Now slowed – expansion with related projects • Sustained real production use of the infrastructure • Which is supported by the operations teams • Maturing but evolving operations procedures • Dealing with all aspects of operations • User support • GGUS is becoming the central coordination point, use is growing • Middleware distribution • Now clear how to evolve the production service • Convergence between existing LCG-2.x and gLite-1.x • Progress on interoperability and interoperation • With OSG significant progress, progress with ARC • Related projects Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  4. Recommendation 16 – i “Plan the migration procedure of service support for gLite in full production service more clearly with precise dates and mandates for each site, and advertise to the users well in advance.” & comment: “Pre-production service must not take on a life of its own…” • Early set up of TCG; • forum for agreeing schedules across the technical and application activities. • Schedule proposed and agreed for 2006 – see next slide Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  5. Deliver and deploy LCG-2.7.0 end January 2006 Bug fixes, patches, etc. accumulated since last major release in August. Delivered on time and deployed Prepare gLite-3.0 for initial deployment in May 2006. Convergence of LCG-2.x and gLite-1.x Evolutionary from deployment point of view – will not be a big-bang change of production service Schedule driven by LCG service challenges Foresee second major “release” on October/November timescale Added functionality – driven by apps via TCG Quickfixes, security patches May be produced at any time, deployed with agreement of TCG Client tools May be updated more frequently, and can be deployed rapidly without need for major upgrades Other stand-alone services may be deployed centrally or at a few sites To demonstrate functionality or provide new facilities Usually need by-hand installation Recommendation 16 – ii Deployment schedule for 2006 In general we try to move away from big-bang releases: • Focus on service/component upgrades where possible • Check-point releases to consolidate changes and to provide new sites a starting point • See this more like a Linux distribution – major releases with continual component updates, security patches, etc. • Pre-production service – now integral part of the release process – should demonstrate new releases • Continuous process of integration, certification, pre-production testing  eventual deployment Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  6. Recommendation 17 – i “Help to establish exemplary procedures for interoperations of more divergent infrastructures and take the lead in such activities.” • Several avenues • Collaborative activities – security and operational policy • Interoperability • Interoperation / shared operation – workshops • Other projects • Joint collaborative activities: • Security – JSPG, MWSG, GridPMAs • Grid Interoperability Now (GIN) group – many projects • Very active in GGF17 Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  7. Recommendation 17 – ii Interoperability Several initiatives at various stages • With OSG • Most advanced – cross job submission has been put in place for WLCG • Used in production by US-CMS for several months • EGEE Generic Info Provider installed on OSG site (now in VDT) • Allows all sites to be seen in info system • GStat and SFT can run on OSG sites • EGEE clients installed on OSG-LCG sites • Inversely – EGEE sites can run OSG jobs • All use SRM SEs; • File catalogues are application choice – LFC widely used Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  8. Interoperability – cont. • With ARC/NorduGrid • Strategies: • Agree standard interfaces at site level & evolve services for these interfaces • Present these interfaces at Grid boundary  portal to forward and translate • Deploy EGEE and ARC CE in parallel • Large sites for LCG • 1 is long-term goal; 2 is medium term solution • Several workshops to follow progress • Work on information system (GLUE) • EGEEARC submission works • With NAREGI • First workshop in March • Several joint activities agreed; work just starting • Information system translators (GLUE ↔ CIM) • Data management tools – NAREGI will test EGEE LFC, FTS, DPM • Job management JDL ↔ JSDL etc. • Security Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  9. Recommendation 17 – iii Operations (Interoperations) • Joint operations: • WLCG is a strong driver – bring together EGEE and OSG grid operations • Extend ROC concept • Structures for routing tickets – prototype to be demonstrated in June • Use of GOC-DB for OSG sites • OSG sites join weekly operations meeting • Run SFTs on LCG production sites in OSG • Agreed ops VO for joint operations • Accounting – for LCG – use GGF usage record • Related projects • EUMedGrid, BalticGrid, EELA, EUChinaGrid, SEE-Grid: • implement EGEE operational concepts and procedures • Operations workshops • Explicitly joint with OSG, ensure related projects attend Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  10. Recommendation 17 – iv • Future: • Shared operations will be a reality – required for LCG • EGEE, OSG, ARC, NAREGI • EGEE-II • Explicit tasks on interoperability • ARC and UNICORE • Expectation is for coexisting campus, local, regional, national, international grid infrastructure • Coexistence, interoperability, interoperations, common policies will be a way of life • Long term sustainable infrastructure after EGEE-II will be built on this work Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  11. Recommendation 18 “Move away from present primary dependence on particular flavours of both processors and Linux and provide support for more heterogeneous resources, including supercomputers, to allow increased collaborative adoption at major computing centres.” • Current porting status: • Several ports to other architectures: IA64, several Linux flavours. Available a few months after main release; • Done by partners; outside of main build and integration system • Future: • Important to have several important ports close to or part of main integration and testing; • Include 64-bit cleanliness as part of build test – will flag as failure • Move to ETICS to provide distributed build system to support many platforms; helps tie porting partners into central process • Partner interested in a particular port can provide build and test hardware and ETICS can help integrate this into the process • TCG should agree a reasonable/realistic set of standard primary platforms to be provided as part of base release • E.g. SL4 + Debian on 32 and 64 bit • Other ports can be asynchronous and should be certified by partners providing resources • Supercomputers – should be supported by ports to relevant OS, MPI • Collaboration with DEISA in EGEE-II Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  12. SA1 Highlights

  13. EGEE Grid Sites : Q1 2006 CPU sites EGEE: Steady growth over the lifetime of the project EGEE: > 180 sites, 40 countries > 24,000 processors, ~ 5 PB storage Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  14. A global, federated e-Infrastructure EGEE infrastructure ~ 200 sites in 39 countries ~ 20 000 CPUs > 5 PB storage > 20 000 concurrent jobs per day > 60 Virtual Organisations Related projects & collaborations are where the future expansion of resources will come from BalticGrid NAREGI SEE-GRID OSG EUChinaGrid EUMedGrid EUIndiaGrid EELA Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  15. Use of the infrastructure Sustained & regular workloads of >30K jobs/day • spread across full infrastructure • doubling/tripling in last 6 months – no effect on operations Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  16. Use of the infrastructure Massive data transfers > 1.5 GB/s • Several applications now depend on EGEE as their primary computing resource Sustainability: • Usage can (and does) grow without need for additional operational effort Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  17. EGEE Operations Process • Grid operator on duty • 6 teams working in weekly rotation • CERN, IN2P3, INFN, UK/I, Ru,Taipei • Crucial in improving site stability and management • Expanding to all ROCs in EGEE-II • Operations coordination • Weekly operations meetings • Regular ROC managers meetings • Series of EGEE Operations Workshops • Nov 04, May 05, Sep 05, June 06 • Geographically distributed responsibility for operations: • There is no “central” operation • Tools are developed/hosted at different sites: • GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon) • Procedures described in Operations Manual • Introducing new sites • Site downtime scheduling • Suspending a site • Escalation procedures • etc Highlights: • Distributed operation • Evolving and maturing procedures • Procedures being in introduced into and shared with the related infrastructure projects Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  18. Site Functional Tests • Very important in stabilising sites: • Apps use only good sites • Bad sites are automatically excluded • Sites work hard to fix problems • Site Functional Tests (SFT) • Framework to test (sample) services at all sites • Shows results matrix • Detailed test log available for troubleshooting and debugging • History of individual tests is kept • Can include VO-specific tests (e.g. sw environment) • Normally >80% of sites pass SFTs • NB of 180 sites, some are not well managed Extending to service availability: • measure availability by service, site, VO • each service has associated service class defining required availability (Critical, highly available, etc.) First approach to SLA Use to generate alarms • generate trouble tickets • call out support staff Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  19. Middleware Distributions and Stacks • Terminology: • EGEE deploys a middleware distribution • Drawn from various middleware products, stacks, etc. • Do not confuse the distribution with development projects or with software packages • Count on 6 months from software developer “release” to production deployment • The EGEE distribution: • Current production version labelled: LCG-2.7.0 • New production version labelled: gLite-3.0 • Name change to hopefully reduce confusion • EGEE distribution contents: evolution • LCG-2.7.0: • VDT – packaging Globus 2.4, Condor, MyProxy • EDG workload management • LCG components: • BDII (info sys), • catalogue (LFC), • DPM, data management libraries and CLI tools • monitoring tools • gLite: R-GMA, VOMS, FTS • gLite-3.0: • Based on LCG-2.7.0, and • gLite workload management • Other gLite components (not in the distribution but provided as services): • AMGA, Hydra, Fireman • gLite-IO Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  20. Process to deployment Support, analysis, debugging VDT/OSG SA3 OMII- Europe Testing & Certification Integration Pre-production service Production service … Middleware providers JRA1 SA3 Certification activities SA3+SA1 SA1 Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  21. Interface Webportal The Support Model “Regional Support with Central Coordination" Regional Support units The ROCs, VOs and other project-wide groups such as the middleware groups (JRA), network groups (NA), service groups (SA) are connected via a central integration platform provided by GGUS. • GGUS is now being used for all problem reporting: • Operational, deployment and user support • VOs are using it for their support system • The use is growing steadily Operations Support ROC 1 ROC… ROC 10 Deployment Support Central Application (GGUS) TPM Middleware Support VOSupport Network Support User Support units Technical Support units Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  22. Collaborative policy development Many policy aspects are collaborative works; e.g.: Joint Security Policy Group Certification Authorities EUGridPMA  IGTF, etc. Grid Acceptable Use Policy (AUP) common, general and simple AUP for all VO members using many Grid infrastructures EGEE, OSG, SEE-GRID, DEISA, national Grids… Incident Handling and Response defines basic communications paths defines requirements (MUSTs) for IR not to replace or interfere with local response plans Incident Response Certification Authorities Audit Requirements Usage Rules Security & Availability Policy VOSecurity Application Development & Network Admin Guide User Registration & VO Management Security & Policy Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  23. SA1 goals for EGEE-II Key goal: • We have a large running production infrastructure; But EGEE-II MUST take what we have now and make it: • Reliable • Middleware components fail, error reporting is missing, … • There is an application responsibility here too – needs effort • … but ! The service has been running non-stop for > 2 years • Robust • Must continue to address service aspects – move away from prototypes • Usable • It is still hard to use for many users; still too slow to introduce new VOs • Acceptable • It must be easy to deploy in a wide variety of environments and coexist with other grid infrastructures • Sustainable • The infrastructure must become sustainable for the long term Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  24. SA1 Outlook • LHC VOs must achieve reliable production and analysis in 2006 • Will be making significant use of resources • Applications must bring resources  show commitment • Consolidate and improve existing services: Focus on • Reliability, robustness, manageability, performance, scalability, etc. • Evolution or replacement of services driven by needs of application (or operations/security/manageability) • TCG has key role here • Expand grid operations • Spread expertise to ROCs • Collaboration with OSG, A-P, etc. and related projects • Start to negotiate SLAs • Sustainability: processes evolving, spread of expertise and tasks • Resource sharing and negotiation – must become streamlined • Will need a mechanism for cost/credit for use of resources Ian Bird, SA1, EGEE Final Review 23-24th May 2006

  25. Summary • SA1 has built a large production grid infrastructure • In constant and extensive daily production use • Several applications depend on it for resources • Tools and processes are maturing and evolving • Security and usage policies also evolving • We have a basic set of middleware that addresses most requirements • Production middleware is converged now LCG-2 + gLite  gLite 3 • EGEE-II will focus on making this sustainable and really usable Ian Bird, SA1, EGEE Final Review 23-24th May 2006

More Related