1 / 16

Automatic server registration and burn-in framework

Automatic server registration and burn-in framework. HEPIX ’ 13 28 th October 2013 Speaker: Afroditi XAFI Co-authors: Olof B Ä RRING, Eric BONFILLOU, Liviu VALSAN. Outline. Motivation Preparation Implementation Workflow Results of 1k+ bulk delivery: Network Registration

morth
Download Presentation

Automatic server registration and burn-in framework

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic server registration and burn-in framework HEPIX’13 28th October 2013 Speaker: Afroditi XAFI Co-authors: Olof BÄRRING, Eric BONFILLOU, Liviu VALSAN

  2. Outline • Motivation • Preparation • Implementation • Workflow • Results of 1k+ bulk delivery: • Network Registration • Burn-in & Performance Tests • Conclusions • Future work Automatic server registration and burn-in framework - 2

  3. Motivation • Up to the beginning of this year running acceptance tests meant: • Registering manually the servers in the network database and in the system administration toolkit • Error prone: based on input given by the suppliers in Excel format (cells not in the right format) Not being able to register the servers would prevent the acceptance tests to start • Installing the servers with Linux SLC • For very large deliveries, the parallel installation could fail - the installation servers were overloaded • Reviewing the test results was not straightforward • It was a semi automated log analysis, no dashboards • It required significant effort to follow up a given delivery: • On average one person was assigned full time per delivery • Every single error had to be understood and addressed manually Automatic server registration and burn-in framework - 3

  4. Motivation • Ultimately, the goals we wanted to achieve by automating the process were to: • Reduce the amount of errors at network registration time, and detect them better • Avoid unnecessary installation and early registration in the system administration toolkits • Minimize the amount of effort needed to carry on the acceptance • Ease the analysis of the results • Deliver the resources quicker to the users (provided there are no generic hardware issues) Automatic server registration and burn-in framework - 4

  5. Preparation We had to define more systematically our requirements to the vendors: • Infrastructure requirements prior to delivery: • Sticker of unique ID in barcode format, and location on the chassis to ease asset management • Provided IO ports schema to ease the physical installation and cabling process • Remote access given by the suppliers to the first production systems prior to delivery: • Allows procurement team to define the desired hardware configuration of the systems (e.g. bios settings, boot list order) Purchase Order Serial Number Automatic server registration and burn-in framework - 5

  6. Implementation • Python application running on the live image • Monitors hardware and software failures • Lemon agent running on the live image embedding all the necessary hardware sensors • Reporting events to Splunk • Maintain hardware profile of each server in a DB • x86 architecture, soon ARM Automatic server registration and burn-in framework - 5

  7. Process Steps – Registration Get Certificates Register asset info Register DHCP Discover MAC addresses HW Discovery PXEboot Start burn-in Permanent IP Temporary IP HW Inventory Network DB Load Live image Automatic server registration and burn-in framework - 6

  8. Burn-in & performance tests • Run as part of the live (in memory) image • Memory (memtest) and CPU (burnK7 or burnP6, and burn MMX) endurance tests • Disks endurance tests (badblocks, smart self-tests) • Disk and CPU performance tests (HEP-SPEC06, FIO) • Based on HATS, presented in Hepix Spring ‘13 • Performance tests aimed at certifying the conformance to the technical specifications, quite efficient at finding hardware failures: Automatic server registration and burn-in framework - 7

  9. Results – Registration Automatic server registration and burn-in framework - 8

  10. Results - Registration Automatic server registration and burn-in framework - 9

  11. Results – Registration • Some reasons for the failures and retries in the process: • Faulty cabling, i.e. wrong port cabled, or cable not fully plugged in • Faulty switch ports or settings • Faulty main-board • Not a failure, few racks missing switch up-links at CERN prevented PXE boot of some servers until problem fixed Automatic server registration and burn-in framework - 10

  12. Results – Burn-in & Performance tests • Burn-in tests • HEPSPEC06 • Total Hepspec of ~260k Automatic server registration and burn-in framework - 11

  13. Conclusions Impact to our procurement activities: • The current framework allows to run acceptance tests over a very short period • 1000+ servers and attached storage went through the process in about 1.5 week (instead of 3 to 4 months) • It requires a minimal amount of efforts and resources • One person follows up what is happening using dashboards for about one hour per day – if no errors detected • However it can only work that well if the servers are delivered as requested • Preparation is a key to the success! Automatic server registration and burn-in framework - 12

  14. Future work Functionality that we plan to add in the future to further automate the process: • Integration of a fully automated P2P network test • Better integration of RAID controllers • They require 3rd party tools and specific hardware sensors to detect errors • Automation of the allocation process • If the server is error free, direct registration to Foreman • Decouple it from CERN infrastructure so we can distribute it Automatic server registration and burn-in framework - 13

  15. Thank you  Questions?

  16. contact: it-dep-cf-fpp@cern.ch

More Related