Dataset publication for analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 8

Dataset publication for analysis PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

Dataset publication for analysis. Nicola De Filippis. Dipartimento Interateneo di Fisica dell’Università e del Politecnico di Bari and INFN. Outline: goals of the procedure preparation of the POOL catalogues attaching of the runs to the META data and the validation conclusions.

Download Presentation

Dataset publication for analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Dataset publication for analysis

Dataset publication for analysis

Nicola De Filippis

Dipartimento Interateneo di Fisica dell’Università e del Politecnico di Bari and INFN

  • Outline:

  • goals of the procedure

  • preparation of the POOL catalogues

  • attaching of the runs to the META data and the validation

  • conclusions

Nicola De Filippis


Dataset publication for analysis

Goals of the procedure

  • to provide an easy and fast access to data locally from the people performing analysis on physics channels

  • to test the handling of MCinfo, Hit, Digis and Pileup informations in a Tier-1/2 site, also via Grid tools

  • to understand the relationship between the META data and the events files in COBRA and POOL (still not so clear!)

  • to gain experience with data management and transfer

Nicola De Filippis


Dataset publication for analysis

Datasets for H  ZZ2e2m analysis

Datasets (PCP04) and POOL xml catalogues with Hits, Digi and pileupfully attached to vergin META data at Bari:

  • Signal events:Background:

eg03_zz_2e2mu

hg03_zbb_2e2mu_compHEP

eg03_tt_2e2mu(copying)

hg03_zbb_cc_2e2mu_compHEP

hg03_zbb_lc_2e2mu_compHEP

(the last two still in production)

  • eg03_hzz_2e2mu_160 eg03_hzz_2e2mu_170

  • eg03_hzz_2e2mu_180 eg03_hzz_2e2mu_190

  • eg03_hzz_2e2mu_200eg03_hzz_2e2mu_250

  • eg03_hzz_2e2mu_300 eg03_hzz_2e2mu_450

  • eg03_hzz_2e2mu_500 eg03_hzz_2e2mu_600

  • hg03_hzz_2e2mu_115a hg03_hzz_2e2mu_120a

  • hg03_hzz_2e2mu_130a hg03_hzz_2e2mu_140a

  • hg03_hzz_2e2mu_150a

  • The samples not available locally were transferred using: castorgrid or SRB

  • A set of scripts was created in order to prepare local cataloguesandattachruns

  • The analysis job ran over a cluster (70 CPUs) with a “hybrid” configuration in order to run CMS production and analysis locally and in gridenvironment

2.

Nicola De Filippis

D. Giordano is one of the responsible of H  ZZ2e2manalysis


Dataset publication for analysis

Kit for preparing event samples (1)

  • (a) Preparation of local POOL xml catalogues in few steps:

  • Downloading “vergin” (without runs) META data from CERN: http://cmsdoc.cern.ch/cms/production/www/cgi/data/METAand preparation of the related POOL xml catalogue

  • Preparation of the POOL xml catalogue of HITs and DIGIs runs by extracting the POOL compressed string of runs from RefDB (pileup data and catalogue assumed already in local)

  • Publishing the POOL catalog of META data, hits, digis and pileup in just one complete POOL file catalogue

  • Changing the physical filename of the files in the catalogue according to the local path of files or rfio path

  • Being sure that the META data are accessed locally and not via rfio

The POOL catalogue is READY to be used

Nicola De Filippis


Dataset publication for analysis

Kit for preparing event samples (2)

  • (b) Attach of runs to vergin META data in few steps:

  • Extracting the CARF:Resume runid of Digis string from RefDB or from summary files if available

  • Attaching the runs, fixing the final collection and checking the META data attached with dsDump

  • Validation by running ExSimHitStatistics and ExDigiStatistics ORCA executables to check the access to hits and digis locally.

The Data sample is READY to be analysed

From my experience the running of ORCA analysis codes needs the attachment of DIGI runs alone and the access to also META data and EVD of hits (not necessary to be attached).

Nicola De Filippis


Dataset publication for analysis

Warnings with POOL catalogs

  • all the procedures are based on the parser of the RefDB web pages and depend strongly on the structure of RefDB tables; those can change due to multiple hit or digis fields used for tests or empty field in tables.

  • it happens that after the decompression of POOL strings some characters (like “41435a”) exist in the pool fragment related to a run; in this case you have to remove them using the sed command already included in the scripts.

  • in addition to the expected META data related to right owners, other ones are sometimes necessary to be downloaded and published in the POOL xml catalogue (mostly “Configuration” files). This problem was related to a sometimes wrong procedure of initialization of Digi META data at CERN; it cannot be avoided. 5-10 % of datasets should be affected by this problem.

  • problems related to an old version of POOL . The catalogue has to migrated into a new one with the command XMLmigrate_POOL1toPOOL1.4

  • FCrenamePFN is very slow with large xml catalogs in replacing dummy path into local path of files!

Nicola De Filippis


Dataset publication for analysis

Warnings with AttachRun

  • Sometimes the CARF:Resume runid string in smry files is different from the one in RefDB because of multiple submitted jobs (in RefDB the tables are only updated the first time you sent the smry file) ….so it is better to extract them from RefDB in order to access to validated informations

  • Sometimes the runid as extracted from RefDB tables is not correct so the script has to be tuned to work properly (the field number has only to be changed).

Nicola De Filippis


Dataset publication for analysis

Conclusions

  • Publishing data for analysis is possible in a Tier 1-2 now!

  • The global procedure can be optimized and automatized (is under discussion in DAPROM)

  • I’m in contact with KA (A. Schmidt) for creating scripts for analysis job submisson in Grid

All the scripts and the documentation are available in a the file kit_for_analysis.tar.gz at link:

http://webcms.ba.infn.it/cms-software/orca

For information mail to: [email protected]

Nicola De Filippis


  • Login