1 / 22

Online Performance Monitoring of the Third ALICE Data Challenge

This paper discusses the online performance monitoring of the third ALICE Data Challenge, including the testbed infrastructure, monitoring system, and performance results. Conclusions and future directions are also presented.

codyl
Download Presentation

Online Performance Monitoring of the Third ALICE Data Challenge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Performance Monitoring of the Third ALICE Data Challenge W. Carena1, R. Divia1, P. Saiz2, K. Schossmaier1, A. Vascotto1, P. Vande Vyvre1 CERN EP-AID1, EP-AIP2 NEC2001 Varna, Bulgaria 12-18 September 2001

  2. Contents • ALICE Data Challenges • Testbed infrastructure • Monitoring system • Performance results • Conclusions NEC2001, 12-18 September 2001

  3. ALICE Data Acquisition ALICE detectors Final system! up to 20 GB/s Local Data Concentrators (LDC) Readout ~300 nodes up to 2.5 GB/s Global Data Collectors (GDC) Event Building ~100 nodes up to 1.25 GB/s CASTOR System Mass Storage System NEC2001, 12-18 September 2001

  4. ALICE Data Challenges • What?Put together components to demonstrate the feasibility, reliability and performance of our present prototypes. • Where?The ALICE common testbed uses the hardware of the common CERN LHC testbed. • When?This exercise is repeated every year by progressively enlarging the testbed. • Who?Joined effort between the ALICE online and offline group, and two groups of the CERN IT division. ADC I: March 1999 ADC II: March-April 2000 ADC III: January-March 2001 ADC IV: 2nd half 2002 ? NEC2001, 12-18 September 2001

  5. Goals of the ADC III • Performance, scalability, and stability of the system (10% of the final system) • 300 MB/s event building bandwidth • 100 MB/s over the full chain during a week • 80 TB into the mass storage system • Online monitoring tools NEC2001, 12-18 September 2001

  6. ADC III Testbed Hardware 80 standard PCs dual PIII@800Mhz Fast and Gigabit Ethernet Linux kernel 2.2.17 Farm Network Disks Tapes 6 switches from 3 manufactures Copper and fiber media Fast and Gigabit Ethernet 8 disk servers dual PIII@700Mhz 20 IDE data disks 750 GB mirrored 3 HP NetServers 12 tape drives 1000 cartridges 60 GB capacity 10 MB/s bandwidth NEC2001, 12-18 September 2001

  7. ADC III Monitoring • Minimum requirements • LDC/GDC throughput (individual and aggregate) • Data volume (individual and aggregate) • CPU load (user and system) • Identification: time stamp, run number • Plots accessible on the Web • Online monitoring tools • PEM (Performance and Exception Monitoring) from CERN IT-PDP  was not ready for ADC III • Fabric monitoring: developed by CERN IT-PDP • ROOT I/O: measures mass storage throughput • CASTOR: measures disk/tape/pool statistics • DATESTAT: prototype development by EP-AID, EP-AIP NEC2001, 12-18 September 2001

  8. Fabric Monitoring • Collect CPU, network I/O, and swap statistics • Send UDP packets to a server • Display current status and history using Tcl/Tk scripts NEC2001, 12-18 September 2001

  9. ROOT I/O Monitoring • Measures aggregate throughput to mass storage system • Collect measurements in a MySQL data base • Display history and histogram using ROOT on Web pages NEC2001, 12-18 September 2001

  10. DATESTAT Architecture DATE v3.7 LDC LDC LDC LDC LDC LDC LDC dateStat.c  top, DAQCONTROL GDC GDC GDC GDC GDC GDC DATE Info Logger Log files (~200 KB/hour/node) Perl script gnuplot script Statistics files C program gnuplot/CGI script MySQL data base http://alicedb.cern.ch/statistics NEC2001, 12-18 September 2001

  11. Selected DATESTAT Results • Result 1: DATE standalone run, equal subevent size • Result 2:Dependence on subevent size • Result 3:Dependence on the number of LDC/GDC • Result 4:Full chain, ALICE-like subevents NEC2001, 12-18 September 2001

  12. Aggregate rate: 304 MB/s Volume: 19.8 TB (4E6 events) Result 1/1 DATE standalone 11LDCx11GDC nodes, 420...440 KB subevents, 18 hours NEC2001, 12-18 September 2001

  13. Result 1/2 DATE standalone 11LDCx11GDC nodes, 420...440 KB subevents, 18 hours LDC rate: 27.1 MB/s LDC load: 12% user, 27% sys NEC2001, 12-18 September 2001

  14. Result 1/3 DATE standalone 11LDCx11GDC nodes, 420...440 KB subevents, 18 hours GDC rate: 27.7 MB/s GDC load: 1% user, 37% sys NEC2001, 12-18 September 2001

  15. Aggregate rate: 556 MB/s Dependence on subevent size Result 2 DATE standalone 13LDCx13GDC nodes, 50…60 KB subevents, 1.1 hours NEC2001, 12-18 September 2001

  16. Result 3 Dependence on #LDC/#GDC DATE standalone Gigabit Ethernet: max. 30 MB/s per LDC max. 60 MB/s per GDC NEC2001, 12-18 September 2001

  17. Aggregate rate: 87.6 MB/s Volume: 18.4 TB (3.7E6 events) Result 4/1 Full chain 20LDCx13GDC nodes, ALICE-like subevents, 59 hours NEC2001, 12-18 September 2001

  18. Result 4/2 Full chain 20LDCx13GDC nodes, ALICE-like subevents, 59 hours GDC rate: 6.8 MB/s GDC load: 6% user, 23% sys NEC2001, 12-18 September 2001

  19. Result 4/3 Full chain 20LDCx13GDC nodes, ALICE-like subevents, 59 hours LDC rate: 1.1 MB/s (60 KB, Fast) LDC load: 0.8% user, 2.7% sys NEC2001, 12-18 September 2001

  20. Grand Total • Maximum throughput in DATE: 556 MB/s for symmetric traffic, 350 MB/s for ALICE-like traffic • Maximum throughput in full chain: 120 MB/s without migration, 86 MB/ with migration • Maximum volume per run: 54 TB with DATE standalone, 23.6 TB with full chain • Total volume through DATE:at least 500 TB • Total volume through full chain:110 TB • Maximum duration per run:86 hours • Maximum events per run:21E6 • Maximum subevent size:9 MB • Maximum number of nodes:20x15 • Number of runs:2200 NEC2001, 12-18 September 2001

  21. Summary • Most of the ADC III goals were achieved • PC/Linux platforms are stable and reliable • Ethernet technology is reliable and scalable • DATE standalone is running well • Full chain needs to be further analyzed • Next ALICE Data Challenge in the 2nd half 2002 • Online Performance Monitoring • DATESTAT prototype performed well • Helped to spot bottlenecks in the DAQ system • The team of Zagreb is re-designing and re-engineering the DATESTAT prototype NEC2001, 12-18 September 2001

  22. Future Work • Polling agent • obtain performance data from all components • keep the agent simple, uniform, and extendable • support several platforms (UNIX, application software) • Transport&Storage • use communication with low overhead • maintain common format in central database • Processing • apply efficient algorithms to filter and correlate logged data • store permanently performance results in a database • Visualization • use common GUI (Web-based, ROOT objects) • provide different views (levels, time scale, color codes) • generate automatically plots, histograms, reports, e-mail, ... NEC2001, 12-18 September 2001

More Related