1 / 15

WP2: Infrastructure and Service Management

WP2: Infrastructure and Service Management. Status Report ETICS All-Hands – 23 October 2006 CERN: Marian Zurek INFN: Matteo Selmi UW-Madison: Peter Couvares, Becky Gietzel, Andy Pavlo. Personnel News. Changes @ UW-Madison Tolya Karp replaced by Andy Pavlo and Becky Gietzel

asher
Download Presentation

WP2: Infrastructure and Service Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WP2: Infrastructure and Service Management Status Report ETICS All-Hands – 23 October 2006 CERN: Marian Zurek INFN: Matteo Selmi UW-Madison: Peter Couvares, Becky Gietzel, Andy Pavlo

  2. Personnel News • Changes @ UW-Madison • Tolya Karp replaced by Andy Pavlo and Becky Gietzel • Peter still here :) • Carlos to join WP2 @ CERN in November • Much needed sysadmin help for Marian!

  3. Deliverables • D2.2 - Infrastructure installation and usage documentation (PM06) • Delivered (a little late -- PM07) • D2.3 - Status of certification, integration and validation testbed setup (prototype) (PM12) • Document not yet started -- but will contain positive news: prototype testbeds are up and have been operational for >6 months.

  4. Major Tasks Performed • Certification, Integration and Validation Infrastructure Expansion: CERN Facility • Due entirety to Marian’s ongoing hard work, WP2 has expanded the NMI Build/Test Facility at CERN and improved its operation. • etics.cern.ch: official ETICS WS/submission node, production host • 19 CPUs: ia32, x86_64, ia64, ppc • SLC3, SLC4, RHES3, Deb3, FC3, FC4, FC5, WinXP, MacOS • 2500+ jobs (as of 17 October 2006) vs. 1300+ jobs (as of 22 May 2006) • etics-test.cern.ch: test submission node • a few machines with SLC3,SLC4 on ia32 • 2200+ jobs (as of 17 October 2006) vs. 450+ jobs (as of 22 May 2006) • etics-dev.cern.ch: development node, non-stable • a few machines with SLC3, SLC4 on ia32 • 1650+ jobs (as of 17 October 2006) • etics-hd.cern.ch: new host for SLC4 WS/submission node prototype • Operational setup • WNs status page: http://etics.cern.ch/nmi/?page=pool/index • Job status page: http://etics.cern.ch/nmi/?page=results/overview

  5. ETICS, 4th EGEE Conference, Pisa, Italy, November 2005

  6. Major Tasks Performed • Certification, Integration and Validation Infrastructure Expansion (Cont.) • INFN Facility • Thanks to Matteo, WP2 has also expanded NMI Build/Test Facility at INFN • etics-01.cnaf.infn.it: ETICS WS/submission node • 5 CPUs: ia32, x86_64, ppc • SLC3/SLC4/CentOS4/MacOSX • 330+ jobs • UW-Madison Facility • 100+ CPUs, 43+ platforms, and still growing… • Thanks to Becky, local ETICS WS currently being deployed

  7. Major Tasks Performed • Parallel Testing Feature Delivered • Allows co-scheduling of multiple heterogeneous resources, e.g. to dynamically deploy a custom tested for testing client/server or p2p s/w. • Originally an end of Q4 goal, delivered ~5 months early in response to to gLite demands • D2.2 Infrastructure Installation and Usage Document Completed • Thanks to all of WP2 for content & reviewers for helpful feedback • gLite System Testing Prototype • To be described in detail by Marian tomorrow… • Continued Improvements to NMI Infrastructure • Many a result of Marian & Matteo’s feedback & experiences setting up facilities at CERN and INFN. • Additional NMI documentation • New NMI website (http://nmi.cs.wisc.edu) • LISA ‘06 NMI paper, etc.

  8. Major Tasks Performed • Implemented short-term solution for root-level testing @ CERN • Initial approach is only loosely integrated with NMI • To be replaced by future NMI virtual machine capability? • Participation in OMII-Europe • Continued involvement to ensure infrastructure harmony • Cross-site job migration is also a top OMII-Europe goal • And last but not least: Boring system administration … every day • OS updates/upgrades, reboots, backups, disk space mgmt., disappearing WNs, crashes, power outages, filesystem failures, etc. • As CERN is the facility with the most usage, most of this falls onto Marian • The etics.cern.ch service is highly available. No significant downtime was caused by the WP2 infrastructure

  9. Issues • Capacity Planning / Scalability • Marian: “How many more needed?” • Good question! I have no idea. • Major new users/projects may need to provide new resources. • We need to better understand how easy/quick it is to add resources to an existing facility, and how many can be added in the same manner before new scalability issues arise. • NMI has been demonstrated to scale to 100’s of nodes, and Condor to 1000’s… but ETICS + NMI + Condor? It also depends on specific workload… • Additional ETICS Testbeds for Development • Marian: “Does every developer need their own ETICS installation?” • Combined deployment of NMI submit node + ETICS WS is not trivial or fully automated (no simple RPM or “plug’n’play”) • WP2 needs help from other WPs to better automate their deployment

  10. Issues • Uneven Facility Utilization • Was an issue in May, still an issue today • 3/3 sites set up, 1/3 in use • CERN facility set up, already in use, production-ready • INFN facility set up, butlesser used • UW facility set up, but not yet in regular use by ETICS • Why? Two reasons: • Minor: CERN facility known to work, other facilities less stress-tested. • Major: inconvenience of submitting to multiple ETICS sites with multiple DBs & WS interfaces • Upcoming cross-site job migration capabilities should largely address both issues -- if jobs automatically migrate, users don’t need to think about it, and all three pools will be exercised • To be described in more detail by Andy tomorrow…

  11. Issues • Communication • Evening in Europe == Morning in Madison • Bi-weekly calls stopped happening over summer • I’ve been slow to address the problem • Matteo in May: • “I think we need more coordination among the three sites. It is quite difficult for us at INFN to understand what are the urgent operations to be done.” • Marian in October: same complaint! • Sysadmin Work • Only one person @ CERN • Frequent OS updates/upgrades • Reboots • because of the power-cut (too hot), kernel update/upgrade, HW failure • Marian: “I know it is not interesting for you, but this must work !! !! !!” • Heterogeneous clusters inherently harder to manage than homogenous clusters of the same size • Complex s/w stack: ETICS client -> ETICS WS -> NMI -> Condor -> OS

  12. Workplan • Q4 Top Priorities • Develop/deploy/test cross-facility job migration capability. • …and increase utilization of INFN and UW-Madison pools as a result. • Keep up with increasing sysadmin demands -- keep infrastructure running smoothly for ETICS users & developers • Responding to Hardware/OS/Service issues • Automation of currently manual tasks • Deployment of new systems & services • Scalability work • Prepare D2.3 report on infrastructure status.

  13. Workplan • Q4/Q5 Unprioritized (next steps and/or resources unclear): • Hardware Virtualisation • WoD (WindowsOnDemand) service, VMWare and/or Xen • Service Monitoring (Service Level Status) • see already http://sls.cern.ch/sls/service.php?id=ETICS • Your feedback is needed • Security issues • Passwords present in the CVS • Public / private resource allocation • A project wants to use ETICS and brings in its private nodes and wants its full power to be private • Steering the jobs to this node, preventing from others landing there • Already supported by NMI/Condor, needs to be documented/customized for ETICS • Steering jobs to/identifying nodes with specific resources • Already supported by NMI/Condor, needs to be documented/customized for ETICS • Documentation • Needs to be updated & improved • ETICS-generic WS installation & configuration docs • CERN/INFN/UW facility-specific configuration & administration docs • Extracting info from Savannah issue DB

  14. Metrics • Bugs, jobs, tasks • 15 open NMI/Condor bugs/issues • 14 closed/addressed bugs/issues • Details available at: • bugs: https://savannah.cern.ch/bugs/?group=etics and select category=NMI • 5 open tasks, 1 closed • Details available at: https://savannah.cern.ch/task/?group=etics select category=NMI

  15. Conclusion • Discussion/Questions/Etc.

More Related