Large scale data flow in local and grid environment
This presentation is the property of its rightful owner.
Sponsored Links
1 / 14

Large scale data flow in local and GRID environment PowerPoint PPT Presentation


  • 55 Views
  • Uploaded on
  • Presentation posted in: General

Large scale data flow in local and GRID environment. V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow. Research objectives. Plans: Large scale data flow simulation in local and GRID environment. Done: Data flow optimization in realistic DC environment ALICE and LHCb MC production

Download Presentation

Large scale data flow in local and GRID environment

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Large scale data flow in local and grid environment

Large scale data flow in localand GRID environment

V.Kolosov, I.Korolko, S.Makarychev

ITEP Moscow


Research objectives

Research objectives

Plans: Large scale data flow simulation in local and

GRID environment.

Done:

  • Data flow optimization in realistic DC environment ALICE and LHCb MC production

  • Simulation of intensive data flow during data analysis (CMS-like jobs)


Itep lhc computer farm 1

ITEP LHC computer farm (1)

main components

A. Selivanov (ITEP-ALICE)a head of the ITEP-LHC farm

64  Pentium IV PC modules(01.01.2004)


Large scale data flow in local and grid environment

100 Mbit/s

32 (LCG) + 32 (PBS)

~622 Mbit/s

ITEP LHC computer farm (2)

BATCH nodes

CPU:64 PIV-2.4GHz (hyperthreading)

RAM:1 GB

Disks:80 GB

Mass storage

18 TB disk space on Gbit/s network

CERN


Large scale data flow in local and grid environment

ITEP LHC FARM since 2005

ITEP view from GOC Accounting Services

4 LHC experiments are using ITEP facilities permanently

till now we were mainly producing MC samples


Large scale data flow in local and grid environment

ALICE and LHCb DC (2004)

ALICE

  • Determine readiness of the off-line framework for data processing

  • Validate the distributed computing model

  • 10% test of the final capacity

  • physics: hard probes (jets, heavy flavours) & pp physics

LHCb

  • Studies of high level triggers

  • S/B studies, consolidate background estimates, background properties

  • Robustness test of the LHCb software and production system

  • Test of the LHCb distributed computing model

Massive MC production (100-200) M events in 3 months


Large scale data flow in local and grid environment

ALICE and LHCb DC (2004)

ALICE - AliEn

LHCb - DIRAC

1 job – 1 event

Raw event size: 2 GB

ESD size: 0.5-50 MB

CPU time: 5-20 hours

RAM usage: huge

Store local copies

Backup sent to CERN

1 job – 500 events

Raw event size: ~1.3 MB

DST size: 0.3-0.5 MB

CPU time: 28-32 hours

RAM usage: moderate

Store local copies of DSTs

DSTs and LOGs sent to CERN

Massive data exchange with

local disk servers

Often communication with

central services


Large scale data flow in local and grid environment

Optimization

April – start massive LHCb DC

1 job/CPU – everything OK

use hyperthreading - 2jobs/CPU - increase efficiency by 30-40%

May – start massive ALICE DC

bad interference with LHCb jobs

often crashes of NFS

restrict ALICE queue to 10 simultaneous jobs,

optimize communication with disk server

June – Septembersmooth running

share resources, LHCb - June July,ALICE – August September

careful online monitoring of jobs (on top of usual monitoring from collaboration)


Large scale data flow in local and grid environment

Monitoring

Often power cuts in summer (4-5 times)-5%

all intermediate steps are lost (…)

provide reserve power line and more powerful UPS

Stalled jobs-10%

infinite loops in GEANT4 (LHCb)

crashes of central services

write simple check script and kill such jobs (bug report is not sent…)

Slow data transfer to CERN

poor and restricted link to CERN

problems with CASTOR

automatic retry


Large scale data flow in local and grid environment

DC Summary

Quite visible participation in ALICE and LHCb DCs

ALICE → ~5% contribution (ITEP part ~70%)

LHCb → ~5% contribution (ITEP part ~70%)

With only 44 CPUs

Problems reported to colleagues in collaborations

Today MC production is a routine task running on LCG

(LCG efficiency is still rather low)


Large scale data flow in local and grid environment

Data Analysis

Distributed analysis – very different pattern of work load

CMS

LHCb

event size: 300 kB

CPU time: 0.25 kSI2k/event

event size: 75 kB

CPU time: 0.3 kSI2k/event

Modern CPUs are ~1 kSI2k  4 events/sec.

In 2 years from now 2-3 kSI2k  8-12 events/sec.

Data reading rate ~ 3MB/sec

Many (up to 100) jobs running in parallel

Should we expect serious degradation of cluster performance during

simultaneous data analysis by all LHC experiments ?


Large scale data flow in local and grid environment

Simulation of data analysis

CMS-like job analyses 1000 events in 100 seconds

DST files are stored on a single file server

Smoothly increase number of parallel jobs, measuring DST reading time

increase number of

allowed nfs daemons

(8 – default value)


Large scale data flow in local and grid environment

Simulation of data analysis

10-15 simultaneous jobs getting data from

single file server are running without

significant degradation of performance

Further increase of jobs number is dangerous

Full load of cluster with analysis jobs

decreases the efficiency of CPU usage

by a factor of 2 (32 CPUs only…)

file server load


Large scale data flow in local and grid environment

Summary

To analyze LHC data (in 2 years from now) we have

to improve our clusters considerably:

  • use faster disks for data storage (currently 70 MB/s)

  • use 10 Gbit network for file servers

  • distribute data over many file servers

  • optimize structure of cluster


  • Login