Lhcb on line off line computing
1 / 46

LHCb on-line / off-line computing - PowerPoint PPT Presentation

  • Uploaded on

LHCb on-line / off-line computing. INFN CSN1 Lecce, 24.9.2003. Domenico Galli, Bologna. Off-line computing. We plan LHCb-Italy off-line computing resources to be as much centralized as possible . Put as much computing power as possible in CNAF Tier-1 .

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' LHCb on-line / off-line computing' - lois

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Lhcb on line off line computing

LHCb on-line / off-line computing


Lecce, 24.9.2003

Domenico Galli, Bologna

Off line computing
Off-line computing

  • We plan LHCb-Italy off-line computing resources to be as much centralized as possible.

    • Put as much computing power as possible in CNAF Tier-1.

      • To minimize system administration manpower.

      • To optimize resources exploitation.

    • “Distributed” for us means distributed among CNAF and other European Regional Centres.

    • Virtual drawback: strong dependence on CNAF resource sharing.

  • The improvement following the setup of Tier-3 in major INFN sites for parallel nt-ples analysis should be evaluated later.

LHCb on-line / off-line computing.2

D. Galli

2003 activities
2003 Activities

  • In 2003 LHCb-Italy contributed to DC03 (production of MC samples for TDR).

  • 47 Mevt / 60 d.

    • 32 Mevt minimum bias;

    • 10 Mevtinclusive b;

    • 50 signal samples, whose size is 50 to100 kevt.

  • 18 Computing centresinvolved.

    • Italian contribution:11.5% (should be 15%).

LHCb on-line / off-line computing.3

D. Galli

2003 activities ii
2003 Activities (II)

  • Italian contribution to DC03 has been obtained using limited resources (40kSi2000, i.e. 100 1GHz PIII CPUs).

  • Larger contibutions (Karlsruhe D, Imperial College, UK) come from the huges, dinamically allocated, resources of these centres.

  • DIRAC, LHCb distributed MC production system, has been used to produce 36600 jobs; 85% of them run out of CERN with 92% mean efficiency.

LHCb on-line / off-line computing.4

D. Galli

2003 activities iii
2003 Activities (III)

  • DC03 has also been used to validate LHCb distributed analysis model.

    • Distribution to Tier-1 centres of signal and bg MC samples stored at CERN during production.

    • Samples has been pre-reduced based on kinematical or trigger criteria.

    • Selection algorithms for specific decay channels (~30) has been executed.

    • Events has been classified by means of tagging algorithms.

  • LHCb-Italy contributed to implementation of selection algorithms for B decays in 2 charged pions/kaons.

LHCb on-line / off-line computing.5

D. Galli

2003 activities iv
2003 Activities (IV)

  • To perform high statistics data samples analysis the PVFS distributed file system has been used.

  • 110 MB/s aggregate I/O using 100Base-T Ethernet connection (to be compared with 50 MB/s of a typical 1000Base­T NAS).

LHCb on-line / off-line computing.6

D. Galli

2003 activities v
2003 Activities (V)

  • Analysis work by LHCb-Italy has been included in “Reoptimized Detector Design and Performance” TDR (2 hadrons channel + tagging).

  • 3 LHCb internal notes has been written:

    • CERN-LHCb/2003-123: Bologna group, “Selection of B/Bsh+h- decays at LHCb”;

    • CERN-LHCb/2003-124: Bologna group, “CP sensitivity with B/Bsh+h- decays at LHCb”.

    • CERN-LHCb/2003-115: Milano group, “LHCb flavour tagging performance”.

LHCb on-line / off-line computing.7

D. Galli

Software roadmap
Software Roadmap

LHCb on-line / off-line computing.8

D. Galli

Dc04 april june 2004 physics goals
DC04 (April-June 2004) – Physics Goals

  • Demonstrate performance of HLTs (needed for computing TDR)

    • Large minimum bias sample + signal

  • Improve B/S estimates of optimisation TDR

    • Large bb sample + signal

  • Physics improvements to generators

LHCb on-line / off-line computing.9

D. Galli

Dc04 computing goals
DC04 – Computing Goals

  • Main goal: gather information to be used for writing LHCb computing TDR

    • Robustness test of the LHCb software and production system

      • Using software as realistic as possible in terms of performance

    • Test of the LHCb distributed computing model

      • Including distributed analyses

    • Incorporation of the LCG application area software into the LHCb production environment

    • Use of LCG resources as a substantial fraction of the production capacity

LHCb on-line / off-line computing.10

D. Galli

Dc04 production scenario
DC04 – Production Scenario

  • Generate (Gauss, “SIM” output):

    • 150 Million events minimum bias

    • 50 Million events inclusive b decays

    • 20 Million exclusive b decays in the channels of interest

  • Digitize (Boole, “DIGI” output):

    • All events, apply L0+L1 trigger decision

  • Reconstruct (Brunel, “DST” output):

    • Minimum bias and inclusive b decays passing L0 and L1 trigger

    • Entire exclusive b-decay sample

  • Store:

    • SIM+DIGI+DST of all reconstructed events

LHCb on-line / off-line computing.11

D. Galli

Goal robustness test of the lhcb software and production system
Goal: Robustness Test of the LHCb Software and Production System

  • First use of the simulation program Gauss based on Geant4

  • Introduction of the new digitisation program, Boole

    • With HLTEvent as output

  • Robustness of the reconstruction program, Brunel

    • Including any new tuning or other available improvements

    • Not including mis-alignment/calibration

  • Pre-selection of events based on physics criteria (DaVinci)

    • AKA “stripping”

    • Performed by production system after the reconstruction

    • Producing multiple DST output streams

  • Further development of production tools (Dirac etc.)

    • e.g. integration of stripping

    • e.g. Book-keeping improvements

    • e.g. Monitoring improvements

LHCb on-line / off-line computing.12

D. Galli

Goal test of the lhcb computing model
Goal: Test of the LHCb Computing Model System

  • Distributed data production

    • As in 2003, will be run on all available production sites

      • Including LCG1

      • Controlled by the production manager at CERN

      • In close collaboration with the LHCb production site managers

  • Distributed data sets

    • CERN:

      • Complete DST (copied from production centres)

      • Master copies of pre-selections (stripped DST)

    • Tier1:

      • Complete replica of pre-selections

      • Master copy of DST produced at associated sites

      • Master (unique!) copy of SIM+DIGI produced at associated sites

  • Distributed analysis

LHCb on-line / off-line computing.13

D. Galli

Goal incorporation of the lcg software
Goal: Incorporation of the LCG Software System

  • Gaudi will be updated to:

    • Use POOL (persistency hybrid implementation) mechanism

    • Use certain SEAL (general framework services) services

      • e.g. Plug-in manager

  • All the applications will use the new Gaudi

    • Should be ~transparent but must be commissioned

  • N.B.:

    • POOL provides existing functionality of ROOT I/O

      • And more: e.g. location independent event collections

    • But incompatible with existing TDR data

      • May need to convert it if we want just one data format

LHCb on-line / off-line computing.14

D. Galli

Needed resources for dc04
Needed Resources for DC04 System

  • CPU requirement is 10 times what was needed for DC03

  • Current resource estimates indicate DC04 will last 3 months

    • Assumes that Gauss is twice slower than SICBMC

    • Currently planned for April-June

  • GOAL: use of LCG resources as a substantial fraction of the production capacity

    • We can hope for up to 50%

  • Storage requirement:

    • 6TB at CERN for complete DST

    • 19TB distributed among TIER1 for locally produced SIM+DIGI+DST

    • up to 1TB per TIER1 for pre-selected DSTs

LHCb on-line / off-line computing.15

D. Galli

Resources request to bologna tier 1 for dc04
Resources request to Bologna Tier-1 for DC04 System

  • CPU power: 200 kSI2000 (500 1GHzPIII CPU).

  • Disk: 5 TB

  • Tape: 5 TB

LHCb on-line / off-line computing.16

D. Galli

Tier 1 grow in next years
Tier-1 Grow in Next Years System

LHCb on-line / off-line computing.17

D. Galli

Online computing
Online Computing System

  • LHCb-Italy has been involved in online group to design the L1/HLT trigger farm.

    • Sezione di Bologna

      • G. Avoni, A. Carbone , D. Galli, U. Marconi, G. Peco, M. Piccinini, V. Vagnoni

    • Sezione di Milano

      • T. Bellunato, L. Carbone, P. Dini

    • Sezione di Ferrara

      • A. Gianoli

LHCb on-line / off-line computing.18

D. Galli

Online computing ii
Online Computing (II) System

  • Lots of changes since the Online TDR

    • abandoned Network Processors

    • included Level-1 DAQ

    • have now Ethernet from the readout boards

    • destination assignment by TFC (Timing and Fast Control)

  • Main ideas the same

    • large gigabit Ethernet Local Area Network to connect detector sources to CPU destinations

    • simple (push) protocol, no event-manager

    • commodity components wherever possible

    • everything controlled, configured and monitored by ECS (Experimental Control System)

LHCb on-line / off-line computing.19

D. Galli

Daq architecture

Switch System





Readout Network


























Gb Ethernet

Level-1 Traffic

HLT Traffic

Mixed Traffic

DAQ Architecture



Front-end Electronics















4 kHz

1.6 GB/s


44 kHz

5.5-11.0 GB/s



29 Switches

62-87 Switches

64-137 Links

88 kHz

32 Links




94-175 Links

7.1-12.6 GB/s





~1800 CPUs

LHCb on-line / off-line computing.20

D. Galli

Following the data flow

Switch System





Readout Network


























Level-1 Traffic

Gb Ethernet

HLT Traffic

Mixed Traffic

Following the Data-Flow

Front-end Electronics



























94 Links

7.1 GB/s













~1800 CPUs

LHCb on-line / off-line computing.21

D. Galli

Design studies
Design Studies System

  • Items under study:

    • Physical farm implementation (choice of cases, cooling, etc.)

    • Farm management (bootstrap procedure, monitoring)

    • Subfarm Controllers (event-builders, load-balancing queue)

    • Ethernet Switches

    • Integration with TFC and ECS

    • System Simulation

  • LHCb-Italy is involved in Farm management, Subfarm Controllers and their communication with Subfarm Nodes.

LHCb on-line / off-line computing.22

D. Galli

Tests in bologna
Tests in Bologna System

  • To begin the activity in Bologna we started (August 2003) from scratch by trying to transfer data through 1000Base-T (gigabit Ethernet on copper cables) from PC to PC and to measure performances.

  • As we plan to use an unreliable protocol (RAW Ethernet, RAW IP or UDP) because reliable ones (like TCP, which retransmit datagrams not acknowledged) introduce unpredictable latency, so, together with throughput and latency, we need to benchmark also data loss.

LHCb on-line / off-line computing.23

D. Galli

Tests in bologna ii previous results
Tests in Bologna (II) – Previous results System

  • In IEEE802.3 standard specifications, for 100 m long cat5e cables, the BER (Bit Error Rates) is said to be< 10-10.

  • Previous measures, performed by A. Barczyc, B. Jost, N. Neufeld using Network Processors (not real PCs) and 100 m long cat5e cables showed a BER < 10-14.

  • Recent measures (presented A. Barczyc at Zürich, 18.09.2003), performed using PCs gave a frame drop rate O(10-6).

  •  Many data (too much for L1!) get lost inside kernel network stack implementation in PCs.

LHCb on-line / off-line computing.24

D. Galli

Tests in bologna iii
Tests in Bologna (III) System

  • Transferring data on 1000Base-T Ehernet is not as trivial as it was for 100Base-TX Ethernet.

    • A new bus (PCI-X) and new chipsets (e.g. Intel E7601, 875P) has been designed to support gigabit NIC data flow (PCI bus and old chipsets have not enough bandwidth to support gigabit NIC at gigabit rate).

    • Linux kernel implementation of network stack has been rewritten 2 times since kernel 2.4 to support gigabit data flow (networking code is 20% of the kernel source). Last modification imply the change of the kernel-to-driver interface (network driver must be rewritten).

    • Standard Linux RedHat 9A setup uses back-compatibility stuff and looses packets.

    • No many people are interested in achieving very low packet loss (except for video streaming).

    • Also a DATATAG group is working on packet losses (M. Rio, T. Kelly, M. Goutelle, R. Hughes-Jones, J.P.Martin-Flatin, “A map of the networking code in Linux Kernel 2.4.20”, draft 8, 18 August 2003).

LHCb on-line / off-line computing.25

D. Galli

Tests in bologna results summary
Tests in Bologna. SystemResults Summary

  • Throughput was always higher than expected (957 Mb/s of IP payload measured) while data loss was our main concern.

  • We have understood, first (at least) in the LHCb collaboration, how to send IP datagram at gigabit/second rate from Linux to Linux on 1000Base-T Ethernet without datagram loss (4 datagrams lost / 2.0x1010 datagrams sent).

  • This required:

    • use the appropriate software:

      • NAPI kernel ( 2.4.20 ).

      • NAPI-enabled drivers (for Intel e1000 driver, recompilation with a special flag set was needed).

    • kernel parameters tuning (buffer & queue length).

    • 1000Base-T flow control enabled on NIC.

LHCb on-line / off-line computing.26

D. Galli

Test bed 0
Test-bed 0 System

  • 2 x PC with 3 x 1000Base-T interfaces each

    • Motherboard:SuperMicro X5DPL-iGM

      • Dual Pentium IV Xeon 2.4 GHz, 1 GB ECC RAM

      • Chipset Intel E7501

      • 400/533 MHz FSB (front side bus)

      • Bus Controller Hub Intel P64H2 (2 x PCI-X, 64 bit, 66/100/133 MHz)

      • Ethernet controller Intel 82545EM: 1 x 1000Base-T interface (supports Jumbo Frames)

    • Plugged-in PCI-X Ethernet Card: Intel Pro/1000 MT Dual Port Server Adapter

      • Ethernet controller Intel 82546EB: 2 x 1000Base-T interfaces (supports Jumbo Frames)

  • 1000Base-T 8 ports switch: HP ProCurve 6108

    • 16 Gbps backplane: non-blocking architecture

    • latency: < 12.5 µs (LIFO 64-byte packets)

    • throughput: 11.9 million pps (64-byte packets)

    • switching capacity: 16 Gbps

  • Cat. 6e cables

    • max 500 MHz (cfr 125 MHz 1000Base-T)

  • LHCb on-line / off-line computing.27

    D. Galli

    Test bed 0 ii

    1000Base-T switch System




    Test-bed 0 (II)

    • echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter

      • to use only one interface to receive packet owning to a certain network (131.154.10, 10.10.0 and 10.10.1).

    LHCb on-line / off-line computing.28

    D. Galli

    Test bed 0 ii i
    Test-bed System0 (III)

    LHCb on-line / off-line computing.29

    D. Galli

    Supermicro x5dpl igm motherboard chipset intel e7501
    SuperMicro X5DPL-iGM Motherboard System (Chipset Intel E7501)

    • Chipset internal bandwidth is granted

      • 6.4 Gb/s min

    LHCb on-line / off-line computing.30

    D. Galli

    Benchmark software
    Benchmark Software System

    • We used 2 benchmark software:

      • Netperf 2.2p14 UDP_STREAM

      • Self-made basic sender & receiver programs using UDP & RAW IP

    • We discovered a bug in netperf on Linux platform:

      • since Linux calls setsockopt(SO_SNDBUF) & setsockopt(SO_RCVBUF) set the buffer size to twice the requested size, while Linux calls getsockopt(SO_SNDBUF) & getsockopt(SO_RCVBUF) return the actual the buffer size, then when netperf iterate to achieve the requested precision in results, it doubles the buffer size each iteration, using the same variable for both the sistem calls.

    LHCb on-line / off-line computing.31

    D. Galli

    Benchmark environment
    Benchmark Environment System

    • Kernel2.4.20-18.9smp

    • GigaEthernet driver: e1000

      • version 5.0.43-k1 (RedHat 9A)

      • version 5.2.16 recompiled with NAPI flag enabled

    • System disconnected from public network

    • Runlevel3 (X11 stopped)

    • Daemons stopped (crond, atd, sendmail, etc.)

    • Flow controlon (on both NICs and switch)

    • Numer of descriptors allocated by the driver rings: 256, 4096

    • IP send buffer size: 524288 (x2) Bytes

    • IP receive buffer size: 524288 (x2), 1048576 (x2) Bytes

    • Tx queue length100, 1600

    LHCb on-line / off-line computing.32

    D. Galli

    First results linux redhat 9a kernel 2 4 20 default setup no tuning
    First Results. Linux RedHat 9A, Kernel 2.4.20, Default Setup, no Tuning.

    • First benchmark results about datagram loss showed big fluctuations which, in principle, can due to packet queue reset, other CPU process, interrupts, soft_irqs, broadcast network traffic, etc.

    • Resultingdistribution ismulti-modal.

    • Mean loss:1 datagramlost every20000datagramsent.Too much forLHCb L1!!!

    LHCb on-line / off-line computing.33

    D. Galli

    First results linux redhat 9a kernel 2 4 20 default setup no tuning ii
    First Results. Linux RedHat 9A, Kernel 2.4.20, Default Setup, no Tuning (II)

    • We think that peak behavior is due to kernel queues resets (all queue packets silently dropped when queue is full).

    LHCb on-line / off-line computing.34

    D. Galli

    Changes in linux network stack implementation
    Changes in Linux Network Stack Implementation Setup, no Tuning (II)

    • 2.1  2.2: netlink, bottom halves, HFC (harware flow control)

      • As few computation as possible while in interrupt context (interrupt disabled).

      • Part of the processing deferred from interrupt handler to bottom halves to be executed at later time (with interrupt enabled).

      • HFC (to prevent interrupt livelock): as the backlog queue is totally filled, interrupt are disable until backlog queue is emptied.

      • Bottom halves execution strictly serialized among CPUs; only one packet at a time can enter the system.

    • 2.3.43  2.4: softnet, softirq

      • softirqs are software thread that replaces bottom halves.

      • possible parallelism on SMP machines

    • 2.5.53  2.4.20 (N.B.: back-port): NAPI (new application program interface)

      • interrupt mitigation technology (mixture of interrupt and polling mechanisms)

    LHCb on-line / off-line computing.35

    D. Galli

    Interrupt livelock
    Interrupt livelock Setup, no Tuning (II)

    • Given the interrupt rate coming in, the IP processing thread never gets a chance to remove any packets off the system.

    • There are so many interrupts coming into the system such that no useful work is done.

    • Packets go all the way to be queued, but are dropped because the backlog queue is full.

    • System resourced are abused extensively but no useful work is accomplished.

    LHCb on-line / off-line computing.36

    D. Galli

    Napi new api
    NAPI (New API) Setup, no Tuning (II)

    • NAPI is a interrupt mitigation mechanism constituted by a mixture of interrupt and polling mechanisms:

      • Polling:

        • useful under heavy load.

        • introduces more latency under light load.

        • abuses the CPU by polling devices that have no packet to offer.

      • Interrupts:

        • improve latency under light load.

        • make the system vulnerable to livelock as the interrupt load exceed the MLFFR (Maximum Loss Free Forwarding Rate).

    LHCb on-line / off-line computing.37

    D. Galli

    Packet reception in linux kernel 2 4 19 softnet and 2 4 20 napi
    Packet Reception in Linux kernel Setup, no Tuning (II) 2.4.19 (softnet) and  2.4.20 (NAPI)

    Softnet (kernel  2.4.19)

    NAPI (kernel  2.4.20)

    LHCb on-line / off-line computing.38

    D. Galli

    Napi ii
    NAPI (II) Setup, no Tuning (II)

    • Under low load, before the MLFFR is reached, the system converges toward an interrupt driven system: packets/interrupt ratio is lower and latency is reduced.

    • Under heavy load, the system takes its time to poll devices registered. Interrupts are allowed as fast as the system can process them : packets/interrupt ratio is higher and latency is increased.

    LHCb on-line / off-line computing.39

    D. Galli

    Napi iii
    NAPI (III) Setup, no Tuning (II)

    • NAPI changes driver-to-kernel interfaces.

      • all network drivers should be rewritten.

    • In order to accommodate devices not NAPI-aware, the old interface (backlog queue) is still available for the old drivers (back-compatibility).

    • Backlog queues, when used in back-compatibility mode, are polled just like other devices.

    LHCb on-line / off-line computing.40

    D. Galli

    True napi vs back compatibility mode napi
    True NAPI vs Back-Compatibility Mode NAPI Setup, no Tuning (II)

    NAPI kernel with NAPI driver

    NAPI kernel with old(not NAPI-aware) driver

    LHCb on-line / off-line computing.41

    D. Galli

    The intel e1000 driver
    The Intel e1000 Driver Setup, no Tuning (II)

    • Even in the last version of e1000 driver (5.2.16) NAPI is turned off by default (to allow the usage of the driver also in kernels  2.4.19).

    • To enable NAPI, e1000 5.2.16 driver must be recompiled with the option:make CFLAGS_EXTRA=-DCONFIG_E1000_NAPI

    LHCb on-line / off-line computing.42

    D. Galli

    Best results
    Best Results Setup, no Tuning (II)

    • Maximum trasfer rate (udp 4096 byte datagrams):957 Mb/s.

    • Mean datagram lost fraction (@ 957 Mb/s):2.0x10-10 (4 datagram lost for 2.0x1010 4k-datagrams sent)

      • corresponding to BER 6.2x10-15 (using 1 m cat6e cables) if data loss is totally due to hardware CRC errors.

    LHCb on-line / off-line computing.43

    D. Galli

    To be tested to improve further
    To be Tested to Improve Further Setup, no Tuning (II)

    • kernel 2.5

      • fully preemptive (real time)

      • sysenter & sysexit (instead of int 0x80) for context switching following system calls (3-4 times faster).

      • Asynchronous datagram receiving

    • Jumbo frames

      • Ethernet frames whose MTU (Maximum Transmission Unit) is 9000 instead of 1500. Less IP datagram fragmentation in packets.

    • Kernel Mode Linux (http://web.yl.is.s.u-tokyo.ac.jp/~tosh/kml/)

      • KML is a technology that enables the execution of ordinary user-space programs inside kernel space.

      • Protection-by-software (like in Java bytecode) instead of protection-by-hardware.

      • System calls become function calls (132 time faster than int 0x80, 36 time faster than sysenter/sysexit).

    LHCb on-line / off-line computing.44

    D. Galli

    Milestones Setup, no Tuning (II)

    • 8.2004 – Streaming benchmarks:

      • Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet with loopback cable.

      • Test of switch performance (streaming throughput, latency and packet loss, using standard frames and jumbo frames).

      • Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet for 2 or 3 simultaneous connections on the same PC.

      • Test of event building (receive 2 message stream and send 1 joined messages stream)

    • 12.2004 – SFC (Sub Farm Controller) to nodes communication:

      • Definition of SFC-to-nodes communication protocol.

      • Definition of SFC queueing and scheduling mechanism

      • First implementation of queueing/scheduling procedures (possibly zero-copy).

    LHCb on-line / off-line computing.45

    D. Galli

    Milestones ii
    Milestones (II) Setup, no Tuning (II)

    • OS test (if performance need to be improved)

      • kernel Linux 2.5.53.

      • KML (kernel mode linux).

    • Design and test of bootstrap procedures:

      • Measure of the rate of failure of simultaneous boot of a cluster of PCs, using pxe/dhcp and tftp.

      • Test of node switch on/off and powe cycle using ASF.

      • Design of bootstrap system (rate nodes/proxy servers/servers, sofware alignment among servers)

    • Definition of requirement for the trigger software:

      • error trapping.

      • timeout.

    LHCb on-line / off-line computing.46

    D. Galli