Monitoring during production why, what, when, how CERN, May `11

Monitoring during productionwhy, what, when, how CERN, May `11 Vincenzo Innocente Peter Elmer Performance Task Force High Performance Computing for High Energy Physics V.I. -- MultiCore R&D

Motivation • Each CMSSW release is carefully validated both in terms of physics and computing performance • Still reality can differs from controlled tests • Statistics is higher • Environment is not the same • Data/calibration can evolve fast: some “misbehaviors” can be detected earlier in production then in RelVals • In the HLT case we do not have offline a fully realistic sample of input events as they come out of the DAQ • It is also important to compare the performance between the production and the relval environment as often we may miss critical details that affect one and not the other

Monitoring application performances • Non intrusive at system level • Non intrusive at application level • Intrusive application monitoring • Detailed performance profiling

Non intrusive at system level • Modern hardware optimizes resources at system-wide level: it is important to have a global view • The behavior of an application running alone can be very different when running concurrently with either similar or very different other jobs • On NUMA architectures affinity and job scheduling may affect performance • Overbooking of resources is the norm on a MultiCore machine: what is important is to control the overall status, not the single process behavior • What to measure • Memory usage total VSS, RSS, PSS. • Not easy at the moment on SLC5, possible using/adapting smem script or similar • CPU usage, fraction of time processes spend in each “state” • Effective Memory bandwidth and latency (or equivalent) • Effective disk & network bandwidth and latency • Swap rate and size (should be zero!)

A “random example” Tommaso Dixit: we had a strange problem in Pisa, and took us some time to debug. I would like to report it here since it might worth looking into. Baseline: with the GEN-SIM production (4_1_3) launched 3 days ago, we were getting ~ 20% CPU efficiency for no real apparent reason. The system was in state "D" most of the time, not compatible with a GEN-SIM production which has no input dataset (it was Summer11-QCD_BLepEnriched_TuneZ2_7TeV-pythia6-evtgen-START311_V2_413_v1 ). It took us sometime to understand why (also because it was happening on new systems, with brand new and never really tested before 12 cores AMD Opteron(tm) Processor 6174), and in the end it was a filesystem problem (we switched to XFS recently to avoid the huge mkfs times on > 1 TB disks. In summary: that workflow has an highly selective GEN level filter, and thus generates GEN events ~ 50 Hz. Just very few events (less than one a minute) pass the filter and are sent to SIM step. The culprit is the fact that for all the events,_NOT_ only those passing to SIM, the code does open("RandomEngineState.log", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 134 writev(134, [{"<RandomEngineStates>\n<Event>\n<Mo"..., 8189}, {" 147122325", 13}], 2) = 8202 one file per job, each file is opened/closed ~@ 50 Hz

Memory accounting using smem: 15 cms reco processes forked by one master: pretended total virtual memory used: 21GB, real: 5.7GB smem PID User Comm Swap USS PSS RSS 32116 innocent top 0 616 651 1204 31962 innocent -tcsh 0 1552 1789 2532 30747 innocent -tcsh 0 2860 3309 3864 32123 innocent /usr/bin/python /afs/cern.c 0 7216 7257 7880 31911 innocent cmsRun reco_RAW2DIGI_RECO_p 0 84176 137545 940336 31945 innocent cmsRun reco_RAW2DIGI_RECO_p 0 303436 357363 1170280 31936 innocent cmsRun reco_RAW2DIGI_RECO_p 0 304552 358555 1172184 31937 innocent cmsRun reco_RAW2DIGI_RECO_p 0 309060 362986 1175968 31944 innocent cmsRun reco_RAW2DIGI_RECO_p 0 309860 363762 1176520 31931 innocent cmsRun reco_RAW2DIGI_RECO_p 0 311472 365484 1179052 31939 innocent cmsRun reco_RAW2DIGI_RECO_p 0 313060 366972 1179796 31942 innocent cmsRun reco_RAW2DIGI_RECO_p 0 313232 367179 1180212 31943 innocent cmsRun reco_RAW2DIGI_RECO_p 0 313920 367814 1180312 31938 innocent cmsRun reco_RAW2DIGI_RECO_p 0 314840 368784 1181944 31935 innocent cmsRun reco_RAW2DIGI_RECO_p 0 315172 369093 1182048 31934 innocent cmsRun reco_RAW2DIGI_RECO_p 0 315220 369173 1182436 31933 innocent cmsRun reco_RAW2DIGI_RECO_p 0 315520 369491 1182824 31932 innocent cmsRun reco_RAW2DIGI_RECO_p 0 316208 370235 1183892 31940 innocent cmsRun reco_RAW2DIGI_RECO_p 0 318144 372083 1185212 31941 innocent cmsRun reco_RAW2DIGI_RECO_p 0 329432 383356 1196240 --------------------------------------------------------------------------------------------------------------------------------------- 70 1 0 4799548 5662670 18664736 top: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P CODE DATA COMMAND 31931 innocent 20 0 1315m 1.1g 133m R 100.0 4.8 3:27.43 0 108 1.1g cmsRun

System behavior: swap One process needs more memory: the whole system start to swap PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P SWAP DATA nFLT nDRT WCHAN 30694 innocent 20 0 2484m 1.3g 43m R 100.0 16.7 3:39.71 2 1.2g 1.4g 64 0 - cmsRun 30695 innocent 20 0 2507m 1.3g 43m R 100.0 17.1 3:39.15 3 1.1g 1.4g 58 0 - cmsRun 30693 innocent 20 0 2431m 1.2g 43m R 99.9 16.3 3:38.81 0 1.1g 1.3g 72 0 - cmsRun 30696 innocent 20 0 2971m 1.7g 44m R 99.9 23.0 3:36.63 6 1.2g 1.8g 475 0 - cmsRun 30698 innocent 20 0 2402m 1.2g 43m R 99.3 15.6 3:39.87 7 1.2g 1.3g 51 0 - cmsRun 30699 innocent 20 0 2437m 1.2g 43m R 99.3 16.1 3:39.31 4 1.2g 1.3g 79 0 - cmsRun 30697 innocent 20 0 2423m 1.2g 43m R 98.9 16.1 3:38.63 5 1.1g 1.3g 104 0 - cmsRun 30693 innocent 20 0 2346m 1.2g 43m D 6.3 15.3 4:27.06 2 1.1g 1.2g 152 0 sync_pagecmsRun 30694 innocent 20 0 2396m 1.2g 47m D 2.7 16.0 4:29.02 6 1.1g 1.3g 895 0 sync_pagecmsRun 30696 innocent 20 0 2867m 1.7g 47m D 2.7 22.0 4:38.50 4 1.1g 1.7g 826 0 sync_pagecmsRun 30698 innocent 20 0 2298m 1.1g 47m D 2.7 14.7 4:30.68 2 1.1g 1.2g 533 0 sync_pagecmsRun 30695 innocent 20 0 2571m 1.4g 47m D 2.0 18.3 4:31.90 6 1.1g 1.5g 488 0 sync_pagecmsRun 30697 innocent 20 0 2337m 1.2g 47m D 2.0 15.4 4:33.02 0 1.1g 1.2g 603 0 sync_pagecmsRun 30699 innocent 20 0 2357m 1.2g 47m D 1.7 15.5 4:30.13 2 1.1g 1.2g 658 0 sync_pagecmsRun vinavx0] /data/runPU $ ~/w1/smem-0.9/smem -t PID User Command Swap USS PSS RSS 12288 innocent cmsRun 131364 271264 311207 636320 12296 innocent cmsRun 157684 764976 807179 1147128 12298 innocent cmsRun 149720 769936 811987 1151384 12299 innocent cmsRun 146440 769896 812058 1151884 12294 innocent cmsRun 121564 796356 838509 1178536 12295 innocent cmsRun 132348 804020 846115 1186032 12293 innocent cmsRun 116436 830424 872624 1212948 12292 innocent cmsRun 120664 899128 941399 1282416 12297 innocent cmsRun 136224 1084424 1126090 1463356 -------------------------------------------------- 88 1 1216168 7008384 7385552 10431216 [vinavx0] /data/runPU $ ~/w1/smem-0.9/smem -t PID User Command Swap USS PSS RSS 12288 innocent cmsRun 254980 201432 235208 510600 12298 innocent cmsRun 235356 752772 786905 1062456 12296 innocent cmsRun 232752 757788 792474 1072224 12299 innocent cmsRun 222904 769460 803581 1079044 12294 innocent cmsRun 187792 792652 827462 1108592 12293 innocent cmsRun 187028 811596 846568 1128696 12295 innocent cmsRun 214528 829056 863775 1144344 12292 innocent cmsRun 188576 892672 927876 1212080 12297 innocent cmsRun 220588 1249900 1283788 1558444 -------------------------------------------------- 88 1 1948240 7075240 7385942 9897524 [vinavx0] /data/runPU $ ~/w1/smem-0.9/smem -t PID User Command Swap USS PSS RSS 12288 innocent cmsRun 503440 58696 80402 259292 12296 innocent cmsRun 354140 747528 771833 964972 12298 innocent cmsRun 328416 778232 802770 996148 12299 innocent cmsRun 318288 779888 803973 996716 12294 innocent cmsRun 286576 801148 824805 1016988 12293 innocent cmsRun 288980 808276 831947 1024192 12295 innocent cmsRun 317508 824236 847842 1039920 12292 innocent cmsRun 296152 884604 908494 1101028 12297 innocent cmsRun 326036 1465916 1489499 1681460 -------------------------------------------------- 88 1 3023292 7166440 7379865 9101708

Non Intrusive at Process Level • The metrics are the same as in the previous case • VSS,RSS,PSS, CPU, mem/disk/network rates • This information should be readily available as “summary” plots (one entry per job) for various granularities (Dataset, Tier, day/week) • Besides aggregated and average quantities it is extremely useful to be able to have access to the time evolution of the “performance metrics” during a single job. This will allow us to understand • How and when memory is allocated and redeemed • Any particular bottleneck at startup and close-out • How things correlate (I/O and memory for instance)

Intrusive at application level • CMSSW is a complex application still it performs a quite simple task: it processes events sequentially and for each event runs a set of modules in a specified sequence • CMSSW is equipped with very mild instrumentation at framework level that allows to collect simple performance metrics at event and module level • In addition specific instrumentation exists for the I/O (db, frontier, root, rfio, xrootd)

Intrusive at application level • At the moment the performance metrics are written in log and summary files • At the moment they are of difficult access • we shall discuss on how to implement a different mechanisms to collect them that will better match the monitoring infrastructure. • In the years scripts that digest this information and summarize in text or graphics form have been developed by many • If we start to collect a large quantity of those metrics we will of course need new algorithms to consolidate and analyze them • This is not very different than many “analytics” run on data-warehouses for business purposes • The main use of those detailed information is to better pinpoint the origin of any difference in behavior between production and RelVal, between sites, as well as detect early any change in performance due to change in data/calibration content • We can also envisage to use EDM and DQM infrastructures to store and access these metrics (DQM/CorSoftwareClient?)

Detailed performance profiling • Igprof is a very powerful, still very light, cpu/memory profiler • It’s overhead as cpu profiler is minimal • As memory profiler may double the resource required • Igprof results can be analyzed offline, published on the web • A recent development allows to store them in a central database and produce regression plots and comparison tables • We can envisage to run “under igprof” 1% of the jobs (even in HLT), collect regularly the results and analyze them automatically for an even more precise pinpointing of “new” hotspots • Running igprof systematically on the very many different hardware architectures available to us on the Grid will also be a unique opportunity to understand how the hardware itself affects the performance at “cycle” level

Monitoring during production why, what, when, how CERN, May `11