1 / 28

Developing & Managing A Large Linux Farm – The Brookhaven Experience

Developing & Managing A Large Linux Farm – The Brookhaven Experience. CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL. Background. Brookhaven National Lab (BNL) is a multi-disciplinary research laboratory funded by US government.

adina
Download Presentation

Developing & Managing A Large Linux Farm – The Brookhaven Experience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL

  2. Background • Brookhaven National Lab (BNL) is a multi-disciplinary research laboratory funded by US government. • BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments. • The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address computing needs of RHIC experiments.

  3. Background (cont.) • BNL has also been chosen as the site of Tier-1 ATLAS Computing Facility (ACF) for the Atlas experiment in CERN. • RCF/ACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc).

  4. Background (cont.) • The Linux Farm is the main source of CPU (and increasingly storage) resources in the RCF/ACF • RCF/ACF is transforming itself from a local resource into a national and global resource • Growing design and operational complexity • Increasing staffing levels to handle additional responsibilities

  5. RCF/ACF Structure

  6. Staff Growth at the RCF/ACF

  7. The Pre-Grid Era • Rack-mounted commodity hardware • Self-contained, localized resources • Resources available only to local users • Little interaction with external resources at remote locations • Considerable freedom to set own usage policies

  8. The (Near-Term) Future • Resources available globally • Distributed computing architecture • Extensive interaction with remote resources requires closer software inter-operability and higher network bandwidth • Constraints on freedom to set own policies

  9. How do we get there? • Change in management philosophy • Evolution in hardware requirements • Evolution in software packages • Different security protocol(s) • Change in access policy

  10. Change in Management Philosophy • Automated monitoring & management of servers in large clusters a must • Remote power management, predictive hardware failure analysis and preventive maintenance are important • High-availability based on large number of identical servers, not on 24-hour support • Increasingly larger clusters only manageable if servers are identical  avoid specialized servers

  11. Evolution in Hardware Requirements • Early acquisitions emphasized CPU power over local storage capacity • Increasing affordability of local disk storage has changed this philosophy • Hardware chosen by optimal combination of CPU power, storage capacity, server density and price • Buy from high-quality vendors to avoid labor-intensive maintenance issues

  12. The Growth of the Linux Farm

  13. Drop in Server Price as a Function of Performance

  14. Drop in Cost of Local Storage

  15. Total Distributed Storage Capacity

  16. Growth of Storage Capacity per Server

  17. Server Reliability

  18. The Factors Enforcing Evolution in Software Packages • Cost • Farm size / scalability • Security • External influences / wide acceptance

  19. Cost • Red Hat Linux →Scientific Linux • LSF →Condor

  20. Farm Size / Scalability • Home built batch system for data reconstruction→ Condor based batch system • Home built monitoring system → Ganglia

  21. Security • Started with NIS/telnet in the 90’s • Cyber-security threats prompted the installation of firewalls, gatekeepers and migration to ssh  scricter security standards than in the past • On-going change to Kerberos 5. Ongoing phase-out of NIS passwords. • Testing GSI  limited support for GSI

  22. Security Changes (cont.) • Authorization & authentication controlled by local site (NIS and Kerberos) • Migration to GSI requires a central CA and regional VO’s for authentication  local sites performs final authentication before granting access • Accept certificates from multiple CA’s? • Difficult transition from complete to partial control over security issues

  23. External Influences / Wide Acceptance • Ganglia – used by RHIC experiments to monitor the RCF and external farms in order to manage their job submission. • HRM / dCACHE – used by other labs • Condor – widely used by Atlas community

  24. Software Evolution - summary

  25. Ganglia at the RCF/ACF

  26. Condor at the RCF/ACF

  27. Summary • RCF/ACF going through a transition from a local facility to a regional (global) facility  many changes • Linux Farm built with commodity hardware is increasingly affordable and reliable • Distributed storage is also increasingly affordable management software issues.

  28. Summary (cont.) • Inter-operability with remote sites (software and services) plays an increasingly important role in our software choices • Transition with security and access issues • Migration will take longer and be more difficult than generally expected  change in hardware and software needs to be complemented by a change in management philosophy

More Related