1 / 12

Advanced Fabric Management at CERN: Challenges and Solutions for Large-Scale Installations

This document outlines the strategies employed by CERN's IT team, focusing on advanced fabric management for a large-scale computing environment comprising approximately 2,800 nodes, with projections to grow nearly to 10,000 by 2008. It explores challenges such as mass installations, hardware failures, and the need for effective monitoring and planning. Key tools like Quattor, SMS, and LEAF are highlighted for automatic configuration and management, emphasizing the need for traceable workflows and robust resource management to maximize availability and track hardware status effectively.

bran
Download Presentation

Advanced Fabric Management at CERN: Challenges and Solutions for Large-Scale Installations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Fabric Management Bill Tomlin for CERN IT/FIO GRIDPP 10th Collaboration Meeting June 2004

  2. Managing a large installation • ~2800 nodes in the CERN CC • Approaching 10,000 by 2008 • Frequent mass installs, moves, retirements • Daily failures of hardware • Heterogeneous H/W & S/W • Multiple functionality (batch, disk, tape, DB, web etc.) • Planning required • Data challenges, test-beds, capacity • Not easy to meet needs: • Find things • Know what’s happening • Maximize availability • Resource CC operations

  3. Fabric Management in a nutshell quattor Automatic configuration Automatic installation LEAF SMS High-level control Effectively Managed Fabric HMS Workflow tools Visualization tools Lemon Managed hardware Effective monitoring

  4. = + +  Extremely Large Fabric management system

  5. H T T P RDBMS S Q L S O A P pan Cache XML CCM Quattor: configuration, installation and management GUI CDB CLI Scripts Node Management Agents Node

  6. CDB interfaces…

  7. LEAF – LHC Era Automated Fabric • SMS: State Management System • Issue high level configuration commands • Nodes automatically take themselves into and out of production • Used during software interventions e.g. kernel upgrade for a cluster • Used during hardware interventions e.g. move a rack of machines • Validates state transitions • Keep history – who, when, why • Handles concurrent requests

  8. LEAF – LHC Era Automated Fabric • HMS: Hardware Management System • Result of process reengineering • Provides consistent, traceable workflows • Manages: • Installs • Moves • Renames • Retirements • Repairs • Implemented using Remedy • Web interface available • Allows visualization & searching for objects

  9. Node Use Case: Move rack of machines 1. Import HMS 6. Shutdown work order 10. Install work order 7. Request move Sysadmins Operations 2. Set to standby 11. Set to production 8. Update SMS 9. Update LAN DB 3. Update 12. Update CDB 5. Take out of production 4. Refresh 14. Put into production 13. Refresh

  10. LEAF screenshots

  11. LEAF Status • HMS • In production since late 2002 (installs only) • Rapid evolution – 16 production releases last year • Used successfully to move & install 100’s machines • Fully integrated (LAN DB, CDB, SMS, other workflow apps) • SMS • First production release January (stable CDB) • Now for all quattor-managed nodes (>2000) • All batch and interactive nodes change state automatically

  12. LEAF Next Steps • Consolidate • Evolve smoother processes • Documentation • Populating data (warranties etc.) • Phase-out legacy components • Extend HMS to other equipment types, individual components • Extend SMS for more clusters, states (like shutdown) • Visualization tool to: • Get/set properties and states • Initialize workflows

More Related