1 / 14

Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…..

Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…. Doug Benjamin (Duke University). Understanding Analysis workflows. User analysis is an interactive activity Some steps are very repetitive Much of the repetition is done

marika
Download Presentation

Metrics & monitoring: Understanding of workflows,DQ2 traces, file read fractions,…..

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metrics & monitoring:Understanding of workflows,DQ2 traces, file read fractions,….. Doug Benjamin (Duke University)

  2. Understanding Analysis workflows • User analysis is an interactive activity • Some steps are very repetitive • Much of the repetition is done • Understanding the work flows requires getting information from users • Long term goal understand pattern all the way down to local analysis cluster • User analysis workflow is evolving • Data volume in 2011 allowed for certain behavior • Data volume in 2012 requires changes from some users • Analyses are migrating from simple cut and count to more sophisticated ones (Multivariate analyses) • Implies an increase in computing resources • Input data products evolve • AOD to D3PD mix changes with time

  3. User analysis interviews • Recently conducted in depth interviews with over 20 people • Goal understand how people actually work • Develop series of questions that can be put into US ATLAS analysis survey and ATLAS wide distributed Analysis survey (both will go out before Software and computing week mid-Oct) • General themes are immerging but there are many different solutions • Solutions vary person to person • Task to task • Some efforts require AOD’s and other group D3PD’s • Most users work both on the grid and off of it. • ~ 10% of user datasets are used as inputs to further processing steps on the grid

  4. Some basic work flows • On grid process AOD’s to produce private ntuples. Private ntuples fetched to local computing (usually with dq2-get) • On grid process AOD’s to produce histograms – histograms fetch to local computing • On grid use group Produced D3PD to skim events by simple cuts, bring skimmed D3PD to local computing • Data volume in 2012 makes this almost impossible • On grid use group Produced D3PD to skim events and slim D3PD by dropping branches. Pull results to local computing for further analysis • On grid input histogram files – run pseudo experiments – output histogram files for further review

  5. Work flows at CERN • EOS system highly successful. – Popular with users • On Sept 8th almost 700 users with EOS space – close 50% of the people using Panda on the grid. • During past to months – almost 900 TB of data read from user space within EOS • EOS can deliver files over the WAN • Interesting work flow at CERN: dq2-get from lxbatch nodes. • Talked with one user doing to understand why he/she is doing it. • Doing trigger studies – code only at CERN and data elsewhere – faster to fetch the data to a batch node than to transfer data via DATRI to scratch disk

  6. Work flow conclusions • Physicists are intelligent and creative people • They have a job do to so they will use creative ways to get the job done • There is no one way to work – any system must be adaptable • Increasing data volumes will provide challenges this year and in the future • Wide area access and caching of the data could and should be part for the solution

  7. DQ2 trace analysis • DQ2 trace data dumps – provided by Thomas Beerman – vital for results presented here • Have dumps for past two weeks, April 2012, May ’12, June ‘12 and July ‘12 Analysis from last 2 weeks of activity • At CERN • 6 users use dq2-get on lxbatch • Datasets not at CERN, 20418 dq2-get requests, 666 unique datasets • 457 users dq2-get on lxplus or other machines at CERN ( 90 US people stationed at CERN – 19.7% of users) • 161 TB transferred – (11.5 TB US users) • 24559 requests ( 925 – US users) 20500 unique DS mostly user datasets • Activity away from CERN • 25210 requests , ~ 70 TB, 428 users, 355 nodes (US #’s , 15.7 TB, 72 nodes, 101 users) • Single site with most activity – Univ. of Chicago Tier 3 (they have a Tier 2 there also) • In July 168 sites (away from CERN) received data via dq2-get

  8. How much data do users read in D3PD’s? • Centrally produced D3PD’s very important part of the most users analysis activities • Group D3PD’s the or of all possible variables • Many thousand branches and users only use a few hundred. • Activity started in many different fronts to determine what branches are really being used. • Using the information collected by Sarah Williams (Indiana University – MWT2) and published on the web. • Can see from the storage system (due to direct read over LAN) point of view what is happening, extremely helpful – incomplete though • EOS team at CERN collected similar information – • Effort between ATLAS – CERN IT to understand the information

  9. Fraction of file read – May ‘12

  10. Fraction of file read Aug/Sep ‘12

  11. Integral (removing peak 100% read – file transfer) • 85% of all reads of group D3PD’s read less than 20% of file • 76% of all reads of user files read less than 20% of file • Files full of a lot of unread data

  12. Efforts to understand what Users are doing. • Users are productive ( proof – Higgs result, paper output of ATLAS – impressive) • Use of prun while very effective missing critical monitoring features • Can not determine what users are actually reading • Simple changes to existing user code can result in significant speed increases • Important to start from the User code vs telling them from on high how to change what they are doing. • Want evolutionary change • Do not want to cause users to be unproductive • In US started activity to review user code and work to speed it up. • Helps the users with faster code and we see what variables they are using

  13. D3PD analysis wiki • Helped at least 5 users see 3-4 times increase in processing speed • Tool for others to read and speed up there own code • As part of message from Panda – plan to tell users of their processing speed (Hz) and compare it to other jobs using same data set. https://twiki.cern.ch/twiki/bin/view/Atlas/D3PDanalysisOptimizations

  14. Conclusions • Users are creative and adaptive – goal oriented – primary goal is to get the science out • Increasing data volumes present a challenge (including this year) • WAN access could be a good tool for the users - needs to be trivial to use and have good performance • Can not expect radical changes in user code – must evolutionary to really happen • Can partial event caching and block IO help us? • Need better monitoring to increase the information density of the files being used – want them to contain the data that users really want and little else.

More Related