GRIDPP Collaboration Computing at RAL | Cutting-edge Grid Computing Service

CCLRC provides the GRIDPP collaboration (funded by PPARC) with a large computing facility at RAL based on commodity hardware and open source software. The current resources (~1000 CPUs, ~200TB of usable disk, and 180TB of tape) provide an integrated service as a prototype Tier-1 centre for the LHC project and a Tier-A centre for the BaBar Experiment based at SLAC. The service is expected to grow substantially over the next few years. The concept of a Tier-1 centre predates the grid.The Monarc Project described a model of distributed computing for LHC with a series of layers or ‘tiers’ from the Tier0 at CERN which would hold all the raw data through Tier-1s which would each hold all the reconstructed data through Tier2s to the individual institutes and physicists who would hold smaller and smaller abstractions of the data. The model also works partly in reverse with the Tier-1 and 2 centres producing large amounts of simulated data which will migrate from Tier2 to Tier-1 to CERN. • The principle objective of the Tier-1/A is to assist GRIDPP in the deployment of experimental GRID technologies and meet their international commitments to various projects: • The Tier-1 service consists of about 500 CPUs and is only accessible by LCG GRID middleware. This provides a production quality GRID service to a growing number of experiments such as: ATLAS, CMS, LHCB, ALICE, ZEUS and Babar. • The Tier-A service consists of a further 500 CPUs and provides a “classical” computing service to the Babar experiment and a number of other smaller experiments. LCG Job Submission Job submission into the Tier-1/A can be carried out from any networked workstation, having access to the LCG User Interface software. LCG software provides the following tools: 1. Authentication grid-proxy-init 2. Job submission edg-job-submit 3. Monitoring and control edg-job-status edg-job-cancel edg-job-get-output 4. Data publication and replication globus-url-copy, RLS 5. Resource scheduling – use of Mass Storage Systems JDL, sandboxes, storage elements Cluster Infrastructure The Tier-1/A uses Redhat Linux deployed on Pentium III and Pentium 4, rack mounted PC hardware. Disk storage is provided by external SCSI/IDE RAID controllers and commodity IDE disk. An STK Powderhorn Silo provides near-line tape access (see Datastore poster). COMPUTING FARM 15 racks holding 492 dual Pentium III/P4 cpus. TAPE ROBOT Upgraded last year, uses 200GB STK 9940B tapes 65TB current capacity could hold 1PB. DISK STORE 200TByte disk-based Mass Storage Unit after RAID 5 overhead. PCs are clustered on network switches with up to 8x1000Mb ethernet

The service needs to be flexible and able to rapidly deploy staff and hardware to meet the changing needs of GRIDPP. Access can either be by the traditional login and manual job submission and editing or via one of several deployed GRID gateways: There are five independent GRID testsbeds available. These range from a few standalone nodes for EDG work-package developers to ten or more systems deployed into the international EDG and LCG production testbeds which take part in wide area work. A wide range of work runs on the EDG testbed including work from the Biomedicaland Earth observation workpackages. Cluster Management tools Real-time Performance Monitoring On a large, complex cluster, real-time monitoring of system statistics of all components is needed to understand workflow and assist in problem resolution. We use Ganglia to provide us with detailed system metrics such as CPU load, memory utilisation and disk I/O rates. Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It relies on a multicast-based listen/announce protocol to monitor state within clusters . Near Real-time CPU Accounting CPU accounting data is vital to allow resource allocations to be met and monitor activity on the cluster. The Tier-1/A services takes accounting information from the OpenPBS system and loads it into a MYSQL databases for post processing through locally written perl scripts A number of GRID gateways have been installed into the production farm where the bulk of the hardware is available. Some such as the LCG gateways are maintained by Tier-1/A staff, but some such as SAM-Grid are managed by the user community in close collaboration with the Tier-1/A. Once through the gateway, jobs are scheduled within the main production facility. Helpdesk Requestracker is used to provide a web based helpdesk. Queries can be submitted either by email or by web before being queued into subject queues in the system and assigned ticket numbers.Specialists can then take ownership of the problem and an escalation and notification system ensures that tickets do not get overlooked. Behind the gateways, the hardware is deployed into a number of hardware logical pools running 3 separate releases of Redhat Linux, all controlled by the OpenPBS batch system and MAUI schedulers. With so many systems, its necessary to automate as much as possible of the installation process. We use LCFG on the GRID testbeds and a complex Kickstart infrastructure on the main production service. Behind the scenes are a large number of service systems and business processes such as: · File servers · Security scanners and password crackers · RPM package monitoring & update services, · Automation and system monitoring. · Performance monitoring and accounting · System consoles · Change Control · Helpdesk Gradually over the next year its likely that some of the above will be replaced by GRID based Fabric Management tools being developed by EDG. Automate The commercial Automate package coupled to CERN’s SURE monitoring system is used to maintain a watch on the service. Alarms are raised on SURE if systems break or critical system parameters are found to be out of bounds. In the event of a mission critical fault, SURE will notify Automate which in turn will make a decision based on time of day and agreed service level agreement before escalating the problem – for example, out of office by paging a Computer Operator.

GRIDPP Collaboration Computing at RAL | Cutting-edge Grid Computing Service

GRIDPP Collaboration Computing at RAL | Cutting-edge Grid Computing Service

Presentation Transcript

Compiling and Job Submission

Job Submission

Grid job submission using HTCondor

LcgCAF:CDF submission portal to LCG

LcgCAF:CDF submission portal to LCG

AJDL job submission

Job Submission

Job Submission

Whole Node Job Submission

Basic Grid Job Submission

Job Submission

Job Submission

Job Submission

Job Submission

SAM Job Submission

Job Submission

Project Initium: Remote Job Submission

Job Submission

Job Submission

gLite job submission

Compiling and Job Submission

Job Submission II