Large-Scale Data Flow Optimization in Local and GRID Environments for ALICE and LHCb

Large scale data flow in localand GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow

Research objectives Plans: Large scale data flow simulation in local and GRID environment. Done: • Data flow optimization in realistic DC environment ALICE and LHCb MC production • Simulation of intensive data flow during data analysis (CMS-like jobs)

ITEP LHC computer farm (1) main components A. Selivanov (ITEP-ALICE)a head of the ITEP-LHC farm 64  Pentium IV PC modules(01.01.2004)

100 Mbit/s 32 (LCG) + 32 (PBS) ~622 Mbit/s ITEP LHC computer farm (2) BATCH nodes CPU: 64 PIV-2.4GHz (hyperthreading) RAM: 1 GB Disks: 80 GB Mass storage 18 TB disk space on Gbit/s network CERN

ITEP LHC FARM since 2005 ITEP view from GOC Accounting Services 4 LHC experiments are using ITEP facilities permanently till now we were mainly producing MC samples

ALICE and LHCb DC (2004) ALICE • Determine readiness of the off-line framework for data processing • Validate the distributed computing model • 10% test of the final capacity • physics: hard probes (jets, heavy flavours) & pp physics LHCb • Studies of high level triggers • S/B studies, consolidate background estimates, background properties • Robustness test of the LHCb software and production system • Test of the LHCb distributed computing model Massive MC production (100-200) M events in 3 months

ALICE and LHCb DC (2004) ALICE - AliEn LHCb - DIRAC 1 job – 1 event Raw event size: 2 GB ESD size: 0.5-50 MB CPU time: 5-20 hours RAM usage: huge Store local copies Backup sent to CERN 1 job – 500 events Raw event size: ~1.3 MB DST size: 0.3-0.5 MB CPU time: 28-32 hours RAM usage: moderate Store local copies of DSTs DSTs and LOGs sent to CERN Massive data exchange with local disk servers Often communication with central services

Optimization April – start massive LHCb DC 1 job/CPU – everything OK use hyperthreading - 2jobs/CPU - increase efficiency by 30-40% May – start massive ALICE DC bad interference with LHCb jobs often crashes of NFS restrict ALICE queue to 10 simultaneous jobs, optimize communication with disk server June – September smooth running share resources, LHCb - June July, ALICE – August September careful online monitoring of jobs (on top of usual monitoring from collaboration)

Monitoring Often power cuts in summer (4-5 times) -5% all intermediate steps are lost (…) provide reserve power line and more powerful UPS Stalled jobs -10% infinite loops in GEANT4 (LHCb) crashes of central services write simple check script and kill such jobs (bug report is not sent…) Slow data transfer to CERN poor and restricted link to CERN problems with CASTOR automatic retry

DC Summary Quite visible participation in ALICE and LHCb DCs ALICE → ~5% contribution (ITEP part ~70%) LHCb → ~5% contribution (ITEP part ~70%) With only 44 CPUs Problems reported to colleagues in collaborations Today MC production is a routine task running on LCG (LCG efficiency is still rather low)

Data Analysis Distributed analysis – very different pattern of work load CMS LHCb event size: 300 kB CPU time: 0.25 kSI2k/event event size: 75 kB CPU time: 0.3 kSI2k/event Modern CPUs are ~1 kSI2k  4 events/sec. In 2 years from now 2-3 kSI2k  8-12 events/sec. Data reading rate ~ 3MB/sec Many (up to 100) jobs running in parallel Should we expect serious degradation of cluster performance during simultaneous data analysis by all LHC experiments ?

Simulation of data analysis CMS-like job analyses 1000 events in 100 seconds DST files are stored on a single file server Smoothly increase number of parallel jobs, measuring DST reading time increase number of allowed nfs daemons (8 – default value)

Simulation of data analysis 10-15 simultaneous jobs getting data from single file server are running without significant degradation of performance Further increase of jobs number is dangerous Full load of cluster with analysis jobs decreases the efficiency of CPU usage by a factor of 2 (32 CPUs only…) file server load

Summary To analyze LHC data (in 2 years from now) we have to improve our clusters considerably: • use faster disks for data storage (currently 70 MB/s) • use 10 Gbit network for file servers • distribute data over many file servers • optimize structure of cluster

Large-Scale Data Flow Optimization in Local and GRID Environments for ALICE and LHCb

Large-Scale Data Flow Optimization in Local and GRID Environments for ALICE and LHCb

Presentation Transcript

Large scale genomic data mining

Nagios and Mod-Gearman In a Large-Scale Environment

Large-Scale Matrix Operations Using a Data Flow Engine

Large-scale Data Processing Challenges

Large scale genomic data mining

Large- scale Linked Data Management

Large scale data processing

EGEE A Large-scale Production Grid Infrastructure

Local Computations in Large-Scale Networks

Large Scale Grid Infrastructures: Status and Future

Internal environment of large scale organisations

Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing

Project BNB-Grid : solving large scale optimization problems in a distributed environment

Large scale data flow in local and GRID environment

APAC Initiatives for Large-Scale Data Sets and Grid Computing

EGEE – A Large-Scale Production Grid Infrastructure

LARGE SCALE LAND ACQUISITION AND LOCAL COMMUNITIES in CAMEROON.

Large Scale Data Integration

Large Scale Data Analytics

Managing a Large Scale Student Environment:

DS-Grid: Large Scale Distributed Simulation on the Grid

large scale data analysis