1 / 13

Status of PDC’05/SC3 System stress test

Status of PDC’05/SC3 System stress test. LCG + ALICE + Site experts ALICE-LCG TF meeting Geneva, December 08, 2005. General running statistics. Event sample (last 2 months running): 22500 jobs completed (Pb+Pb and p+p): Average duration 8 hours, 67500 cycles of jobs

arnoldo
Download Presentation

Status of PDC’05/SC3 System stress test

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Status of PDC’05/SC3System stress test LCG + ALICE + Site experts ALICE-LCG TF meeting Geneva, December 08, 2005

  2. General running statistics • Event sample (last 2 months running): • 22500 jobs completed (Pb+Pb and p+p): • Average duration 8 hours, 67500 cycles of jobs • Total CPU work: 540 KSi2K hours • Total output: 20 TB (90% CASTOR2, 10% Site SEs) • Centres participation (22 total): • 4 T1’s: CERN, CNAF, GridKa, CCIN2P3 • 18 T2’s: Bari (I), Clermont (FR) , GSI (D), Houston (USA) , ITEP (RUS), JINR (RUS) , KNU (UKR), Muenster (D), NIHAM (RO), OSC (USA), PNPI (RUS), SPbSU (RUS), Prague (CZ), RMKI (HU), SARA (NL), Sejong (SK), Torino (I), UiB (NO) Status of PDC’05 – operational issues

  3. General running statistics (2) • Jobs done repartition per site: • T1’s: CERN: 19%, CNAF: 17%, GridKa: 31%, CCIN2P3: 22%: • Very evenly distribution among the T1’s • T2’s: total of 11%: • Extremely good stability at: Prague, Torino, NIHAM, Muenster, GSI, OSC • Some under-utilization of T2 resources – more centres available, could not install the Grid software to use fully Status of PDC’05 – operational issues

  4. Efficiency numbers • Event failures: • 562 jobs persistent (up to 3 retries) AliRoot failure (2.5%) • Errors saving, downloading input files – non-persistent and due to temporary services malfunction • All other errors (application software area not visible, connectivity issues, black holes) are non-existent with the Job agent model – jobs are simply not pulled from TQ. Status of PDC’05 – operational issues

  5. System stress test • Goals of the test: • Central services behaviour: • Many of the past problems (large number of proxies, overload of server machines, etc…): improved with AliEn v.2-5 and through redistribution of services over several servers • Site services behaviour (VO-boxes, interaction with LCG): • Connection to central services, stability, job submission to RB): improved with AliEn v.2-5 • CERN SE behaviour (CASTOR2): • Overflow of xrootd tactical buffer: improved with additional protection in migration scripts Status of PDC’05 – operational issues

  6. System stress test (2) • General targets: • Number of concurrently running jobs: 2500/24 hours (7500 jobs total) • Storage facilities: CASTOR2, 15K files (2 per job), each file is archive of 5 root files, 7.5 TB total • Special target: • GridKa provides 1200 job slots – test of VO-box Status of PDC’05 – operational issues

  7. Results: 2450 jobs • Running job profile: Negative slope: see results(4) Status of PDC’05 – operational issues

  8. Results: • 15 sites CPU utilization (80% T1/ 20%T2): • T1’s: CERN: 8%, CCIN2P3: 12%, CNAF: 20%, GridKA: 41% • T2’s: Bari: 0.5%, GSI: 2%, Houston: 2%, Muenster: 3.5%, NIHAM: 1%, OSC: 2.5%, Prague: 4%, Torino: 2%, ITEP: 1%, SARA: 0.1%, Clermont: 0.5% • Number of jobs: 98% of target number: • Special thanks to Kilian and the GridKa team for making 1200 CPUs available for the test • Duration: 12 hours (1/2 of the target duration) • Jobs done: 2500 (33% of target number) • Storage: 33% of target Status of PDC’05 – operational issues

  9. Results (2): • VO-box behaviour: • No problems with services running, no interventions necessary • Load profile on VO-boxes – in average proportional to the number of jobs running on the site, nothing special CERN GridKA Status of PDC’05 – operational issues

  10. Results(3): • Storage behaviour: • xrootd (interface) and CASTOR2 – no problems: • However the objective was not to stress-test the MSS and network • Central AliEn services behaviour: • Job submission: 3000 jobs (6 master jobs) submitted/split and available in TQ in 2 hours (0.8 jobs/sec) • Jobs starting and running phase - no problem with number of jobs, no special load on proxy. DB or any other service Status of PDC’05 – operational issues

  11. Results (4): • Negative slope on number of jobs plot: • During job saving phase • Post-mortem analysis by experts (Pablo and Predrag) • Prevented us from reaching the target duration of the excercise. Status of PDC’05 – operational issues

  12. Conclusions • These are still preliminary – the exercise de facto ended at 02:00 this morning • VO-boxes model: shows sclability up to 1000 jobs running concurrently at a given site (maximum CPUs available): • We are confident, that it can handle much more than that • Storage CASTOR2 – stable interface and storage beaviour, next target is to test throughput performance. • Central services: • Job submission/splitting – high performance • Starting/running – no problem (limited by the number of available CPUs) • Saving – server (DB) overload – will be analysed and fixed by experts • Should repeat the exercise soon… Status of PDC’05 – operational issues

  13. Acknowledgements • Many thanks to the site experts for the excellent support throughout the PDC’05/SC3 so far • Special thanks: to Kilian and the GridKa team for the making available for the stress test 1200 CPUs • And as usual: Patricia, Stefano, Pablo, Predrag, Andreas and Derek. Status of PDC’05 – operational issues

More Related