1 / 20

Data production using CernVM and lxCloud Dag Toppe Larsen Belgrade 2013-05-28

Data production using CernVM and lxCloud Dag Toppe Larsen Belgrade 2013-05-28. Outline. New data production scripts Virtualised data production Data production manager Next steps. Data production sequence. New data production scripts. New set of scripts prodna61-produce-reaction.sh

fisseha
Download Presentation

Data production using CernVM and lxCloud Dag Toppe Larsen Belgrade 2013-05-28

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data production using CernVM and lxCloudDag Toppe LarsenBelgrade 2013-05-28

  2. NA61/NA49 meeting, Belgrade Outline • New data production scripts • Virtualised data production • Data production manager • Next steps

  3. NA61/NA49 meeting, Belgrade Data production sequence

  4. NA61/NA49 meeting, Belgrade New data production scripts • New set of scripts • prodna61-produce-reaction.sh • prodna61-produce-chunk.sh • prodna61-find-chunk-errors.sh • Details on next slides • Exclusively use xRootd interface to Castor • Initially, the scripts were mainly focused on CernVM, however recent involvement in “normal” data production provided an opportunity to focus on lxBatch as well • Involvement in “normal” data production gave much better overview/understanding of requirements for it • Scripts “work”, but are some issues that need to be addressed for fully automated usage • Will significantly save work/reduce chance for mistakes for data productions even when executed from command line by hand • To be executed from web data production manager

  5. NA61/NA49 meeting, Belgrade New data production scripts • prodna61-produce-reaction.sh <reaction> • e.g. prodna61-produce-reaction.sh BeBe160 • Initiates production of reaction • Get lists of: • chunks from bookkeeping/Castor • software from file system • global keys from KEY DB • Takes latest global key and software by default (additional parameters otherwise) • Submits jobs to batch system (either CernVM/lxBatch) • Jobs run prodna61-produce-chunk.sh script (next slide) • Small differences between lxBatch/CernVM versions (related to different batch systems)

  6. NA61/NA49 meeting, Belgrade New data production scripts • prodna61-produce-reaction.sh • Several parameters defining paths, global key, software versions, etc. • Designed to be flexible, also with regards to processing outside CERN • Configuration parameters like magnetic field, etc., not passed, job will determine this itself • Modifies a template set-up file (prodna61-setup) to fit requirements • Steps: • Get raw file from Castor, unpack it • Run legacy software • Run ROOT61 • Convert legacy to SHOE • Run native Shine reconstruction (PSD) • Merge converted legacy data and native Shine data • Create mini-Shoe • Run Anar's QA (on chunk, intended to be merged later) • Upload files to Castor and/or local disk • Compress log file and store to Castor • The process will be easier after switch to Shine native reconstruction, since most complications are related to legacy chain • Typically not run by user directly (but possible), but submitted to batch system by prodna61-produce-reaction.sh • Same version for CernVM/lxBatch

  7. NA61/NA49 meeting, Belgrade New data production scripts • prodna61-find-chunk-errors.sh • Searches for errors for given chunk • Errors searched for: • Check Castor for too small/empty/non-existent DSPACK, ROOT, SHOE, MINI-SHOE, LOG or QA file • Scan log file for failed events (above given threshold) and job exited/killed/terminated • Intended to be run as acrontab job for production manager web page for all finished jobs • Same version for CernVM/lxBatch

  8. NA61/NA49 meeting, Belgrade Remaining issues for DP scripts • The reconstruction can be run with either “-pA” or “-pp” for fitting of primary vertex • Which one is preferable for given reaction typically depends on target length • Run-script: • run_keys="-d all -256 -pp -keep -minipoint -points -f $setup" • run_keys="-d all -256 -pA -keep -minipoint -points -f $setup" • Setup-file • Exec $v0find -s $DSPACK_SERVER -d all -pp • Exec $v0find -s $DSPACK_SERVER -d all -pA • Exec $xi_fit -s $DSPACK_SERVER -d all -f 13 -pp • Exec $xi_fit -s $DSPACK_SERVER -d all -f 13 -pA • Question: • Would it be possible to have a KEY for this? • Otherwise need to store in separate “database”, increasing complexity • Why is -pp/-pA called both as parameter to run-script and inside set-up file? • Does it ever happen that they are both not used simultaneously?

  9. NA61/NA49 meeting, Belgrade Remaining issues for DP scripts • KEY5 • e.g. KEY5=CALC/STD+ • Question: • Is there any reason why this is set explicitly in the set-up file, and not from the global key? • Residual corrections • e.g. Exec res_corr -s $DSPACK_SERVER -vt1_chris $CORR_DIR/vt1_2009_pp158.corr -vt2_chris $CORR_DIR/vt2_2009_pp158.corr -mtl_chris $CORR_DIR/mtl_2009_pp158.corr -mtr_chris $CORR_DIR/mtr_2009_pp158.corr -p $CORR_DIR_OLD/vdrift_2009.txt • Question: • Can we have a KEY for this as well? • Currently, we only have one set of residual correction files. Are more envisioned?

  10. NA61/NA49 meeting, Belgrade Remaining issues for DP scripts • xRootd replacement for “nsls <path>” is “xrd castorpublic dirlist <path>” • Very slow for either directories with many files or deep directory trees • Several minutes to return data • Not very practical for user interaction • Used for obtaining list of chunks for reactions from Castor • Possible solutions: • Ask IT if it can be improved • Obtain data from bookkeeping database instead • PSD reconstruction for Shine needs different files for different run conditions • PSDReconstructor.xml,PSDCalibXMLConfig.xml • Question: • Can this be done more automatic on the Shine side?

  11. NA61/NA49 meeting, Belgrade Any other parameters? • Set-up file is currently modified from a template for magnetic field, residual corrections and -pp/ -pA • Either by hand for “old” data production scripts • Or automatic for “new” scripts • Question: • Are there any other parameters that need to taken into account for the data production?

  12. NA61/NA49 meeting, Belgrade Create/destroy virtual clusters • Scripts for creating/destroying virtual clusters of virtual machines on lxCloud (or other clouds) created • Will be possible to launch older virtual machines for data preservation (running older versions of data production software) • Need some tuning for latest iteration of test lxCloud • Final lxCloud processing is charged per hour a virtual machine is running (no matter if it does processing or not) • Important to be able to create/destroy virtual clusters on demand • The creating/destroying of virtual machines must be controlled by the web production manager • Some control logic needs to be developed on web production manager side

  13. NA61/NA49 meeting, Belgrade Virtualised data production • So far legacy software v12j used for testing on virtual machine • Now installing v13b (or c?) to be able to use latest versions (also for global key) for test of whole reaction • Can use the modified version of Anar's QA (ratio, difference) to compare the outputs • However, a large contribution of the differences may be due to “random” missing events in either production?

  14. NA61/NA49 meeting, Belgrade Virtualised data production resource estimate • Processing time of chunk depends on reaction • BeBe160 ~1.5h • pp ~45min • Consistent with experience from lxBatch • Cost estimate based on currently processed data • Whole run 15252 (BeBe40, 170 files) produced on virtual machines • 10 Virtual machines • Made sure chunks were staged on castor first • Processing time on test lxCloud ~1h/chunk (slightly less) • Assuming 1h/chunk, 10 000 chunks for reaction, and 2 days “reasonable” processing time for reaction • 10 000 / 24 / 2 = 208 virtual machines for cluster • Have been allocated quota of 200 VMs by IT for testing • The production of a full reaction for data validation will give better estimate • Installing latest legacy software 13b (c?) for this

  15. NA61/NA49 meeting, Belgrade Web data production manager • Dimitije created the current production manager • Since he left NA61, I have stared looking into how it works • Two “parts” • Web page displaying information • Background acrontab jobs updating files with information to be accessed by the web page • Missing/incomplete parts: • Interface to new production scripts • Authentication (to verify user has rights to start production) • Production database (stores information about status of running/finished chunk jobs, initiates search of errors for chunks, and resubmits chunks as needed) • Interface to bookkeeping database (upload of finished reactions) • Interface to create/kill virtual clusters

  16. NA61/NA49 meeting, Belgrade Data production manager database • Needed to keep track of status of ongoing/finished jobs (chunks) for productions • Some initial scripts created • Initiated from acrontab job • Search for finished jobs, update database • Check if jobs were successful, update database • Resubmit failed jobs, update databse • Based on SQLite • Scripts will be back-end for web production manager

  17. NA61/NA49 meeting, Belgrade Automatic update of bookkeeping database • Bookkeeping database (Alexander's) needs to be updated when a production has been finished • Should be done automatic by web production manager (database) • Interface between production manager and bookkeeping database must work for both CernVM/lxBatch • Not depend on AFS • Also work for CernVM processing outside CERN • HTTP-based • First step to create scripts to do update by hand • e.g. prodna61-update-bookkeeping.sh <production details>

  18. NA61/NA49 meeting, Belgrade Next steps • Short-term • Address remaining issues for CernVM/lxBatch unified production scripts • Install legacy software 13b (c?) on CvmFS for further validation • Further investigation of missing events • Sometimes an event can fail, but succeed next time? • By-hand way to upload data to bookkeeping database • Long-term • Web production manager • Interface to production scripts • Database for status of ongoing productions • Automatic upload data to bookkeeping database

  19. NA61/NA49 meeting, CERN Roadmap

  20. NA61/NA49 meeting, Belgrade Volunteers • For participating in the “normal” data production team? • If anybody is interested, we can have a “mini-workshop” later this week

More Related