Developing & Managing A Large Linux Farm – The Brookhaven Experience

Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL

Background • Brookhaven National Lab (BNL) is a multi-disciplinary research laboratory funded by US government. • BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments. • The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address computing needs of RHIC experiments.

Background (cont.) • BNL has also been chosen as the site of Tier-1 ATLAS Computing Facility (ACF) for the Atlas experiment in CERN. • RCF/ACF supports HENP and HEP scientific computing efforts and various general services (backup, e-mail, web, off-site data transfer, Grid, etc).

Background (cont.) • The Linux Farm is the main source of CPU (and increasingly storage) resources in the RCF/ACF • RCF/ACF is transforming itself from a local resource into a national and global resource • Growing design and operational complexity • Increasing staffing levels to handle additional responsibilities

RCF/ACF Structure

Staff Growth at the RCF/ACF

The Pre-Grid Era • Rack-mounted commodity hardware • Self-contained, localized resources • Resources available only to local users • Little interaction with external resources at remote locations • Considerable freedom to set own usage policies

The (Near-Term) Future • Resources available globally • Distributed computing architecture • Extensive interaction with remote resources requires closer software inter-operability and higher network bandwidth • Constraints on freedom to set own policies

How do we get there? • Change in management philosophy • Evolution in hardware requirements • Evolution in software packages • Different security protocol(s) • Change in access policy

Change in Management Philosophy • Automated monitoring & management of servers in large clusters a must • Remote power management, predictive hardware failure analysis and preventive maintenance are important • High-availability based on large number of identical servers, not on 24-hour support • Increasingly larger clusters only manageable if servers are identical  avoid specialized servers

Evolution in Hardware Requirements • Early acquisitions emphasized CPU power over local storage capacity • Increasing affordability of local disk storage has changed this philosophy • Hardware chosen by optimal combination of CPU power, storage capacity, server density and price • Buy from high-quality vendors to avoid labor-intensive maintenance issues

The Growth of the Linux Farm

Drop in Server Price as a Function of Performance

Drop in Cost of Local Storage

Total Distributed Storage Capacity

Growth of Storage Capacity per Server

Server Reliability

The Factors Enforcing Evolution in Software Packages • Cost • Farm size / scalability • Security • External influences / wide acceptance

Cost • Red Hat Linux →Scientific Linux • LSF →Condor

Farm Size / Scalability • Home built batch system for data reconstruction→ Condor based batch system • Home built monitoring system → Ganglia

Security • Started with NIS/telnet in the 90’s • Cyber-security threats prompted the installation of firewalls, gatekeepers and migration to ssh  scricter security standards than in the past • On-going change to Kerberos 5. Ongoing phase-out of NIS passwords. • Testing GSI  limited support for GSI

Security Changes (cont.) • Authorization & authentication controlled by local site (NIS and Kerberos) • Migration to GSI requires a central CA and regional VO’s for authentication  local sites performs final authentication before granting access • Accept certificates from multiple CA’s? • Difficult transition from complete to partial control over security issues

External Influences / Wide Acceptance • Ganglia – used by RHIC experiments to monitor the RCF and external farms in order to manage their job submission. • HRM / dCACHE – used by other labs • Condor – widely used by Atlas community

Software Evolution - summary

Ganglia at the RCF/ACF

Condor at the RCF/ACF

Summary • RCF/ACF going through a transition from a local facility to a regional (global) facility  many changes • Linux Farm built with commodity hardware is increasingly affordable and reliable • Distributed storage is also increasingly affordable management software issues.

Summary (cont.) • Inter-operability with remote sites (software and services) plays an increasingly important role in our software choices • Transition with security and access issues • Migration will take longer and be more difficult than generally expected  change in hardware and software needs to be complemented by a change in management philosophy

Developing & Managing A Large Linux Farm – The Brookhaven Experience