170 likes | 185 Views
This document outlines guidelines for managing large data volumes in the CMS Grid Computing Model. It emphasizes the need for efficient data access, reduction of data processing strain, and data segmentation into manageable datasets for analysis.
E N D
CMS Data Distribution General Thoughts Data Volume Computing Model Data Processing Guidelines The Plan Primary & Secondary Datasets Skims Journées “Grilles France” October 16, 2009 IPN de Lyon Tibor Kurča Institut de Physique Nucléaire de Lyon T.Kurca Grilles France
General Thoughts • Imperative :Data must be accesible to all of CMS quickly initial data distribution model should be simple later will be adapted to reality • How to use the data most efficiently? - large data volumes, mostly background - computing resources distributed at unprecedented level - very broad physics program with a diverse needs • Reduce the amount of data to be processed reduce the strain on computing resources • A core set of data has to be small enough & very representative for a given analysis - most analyses don’t need access to all data split data into datasets easy manageable day-to-day jobs enable full tests of analysis components before running the full statistics allow prioritisation of data analysis T.Kurca Grilles France
CMS Data Volume • Estimated rates/ data volumes - 2009/10 - 70 days running - 300 Hz rate in physics stream, 2.3x 10E9 events - Assume 26% mean overlap • 3.3 PB RAW data (1.5MB/evt) : detector data + L1, HLT results after online formatting • 1.1 PB RECO (0.5MB/evt): reconstructed objects with their associated hits • 220 TB AOD (0.1MB/evt): main analysis format: clusters, tracks, particles id • Multiple re-reco passes • Data placement • RAW/RECO: one copy across all T1, disk1tape1 • Sim RAW/RECO: one copy across all T1, on tape with 10% disk cache • AOD: one copy at each T1, disk1tape1 T.Kurca Grilles France
Tier-0-1-2 7 2 x T0: Prompt reco (24/24), FEVT storage, data distribution T1: Data storage, processing (Re-Reco, skim, AOD extraction), raw data access, Tier-2 support data serving to T2s T2: Analysis, MC production, specialised support tasks, local + group use T.Kurca Grilles France
T1/T2 Associations • Associated Tier-1: hosting MC prod + reference for AOD serving - Full AOD sample at Tier-1 (after T1 T1 transfers for re-recoed AODs) • Stream “allocation” ~ available disk storage at centre T.Kurca Grilles France
T2-PAG,POG,DPG Associations T.Kurca Grilles France
Data Processing Guidelines • we aim for prompt reconstruction and analysis - reduce backlogs • we need the possibility of prioritisation - cope with backlogs without delaying critical data - prompt calibration using low latency data • we are using data streaming based on trigger bits need to understand the trigger and event selection - early events classification allows later prioritisation 1. Express-stream of : ‘hot’ physics events calibration events Data Quality Monitoring (DQM) events 2. Physics stream - propose O(7) ‘primary datasets’ , immutable but can have overlap T.Kurca Grilles France
Commissioning Express Streams T1s T.Kurca Grilles France
Trigger – Dataset Connection The Goal: create Datasets based on triggers for specific physics objects Datasets distributed to central storage at T2s run skims on those Datasets (or skim of a skim …) Purpose- benefits of this kind of Datasets: - group similar events together to facilitate data analyses - data separation (streaming) is based on trigger bits no need for additional processing triggers are persistent and Datasets resilient - recognized drawback – events are not selected with optimal reco & AlCa Obvious consequences: - every trigger has to be asigned to at least one Dataset - increasing Datasets overlaps increasing storage requirements T.Kurca Grilles France
The Plan 1. Primary Datasets (PD) - immutable, based on triggers, split at T-0 - RAW/RECO/AOD PDs distributed (limited) to central storage at T1s 2. Secondary Dataset (SD) - produced from PDs at T1s by dataOps and distributed to central storage at T2s - RECO or AOD format, trigger based as well 3. Central skims - produced at T1s by dataOps - very few initiall for key applications that cannot be served otherwise 4. Group skims - run on datasets stored at T2s; flexibility in choice of event content but provenance must be maintained. - approved by group conveners and expect to have a tool allowing them to be registered in global DBS tested in October excercise ! - subscribable to Group space 5. User analysis skims - a dataset that is no more than skim away from provenance T.Kurca Grilles France
From PD to SD & Skims • The Goal: Provide easy acces to interesting data • PD could be quite large reduce the amount of data to easily manageable sizes • Secondary Datasets (SD) - each SD centrally produced subset of one PD using trigger info …. not more than ~30% of events - RECO format initially, later AOD • Central Skims: - produced centrally at T1s, 1-2 per PD 10% of most interesting events or an uniform subset (prescale 10) • Group skims: - designed by groups run on any data at T2 - could be run also on PD if in the T2 - ideally central skims as input 1% of PD …. manageable day-to-day • User skims: final skims by individual users for their needs • Procedures tested during October exercise T.Kurca Grilles France
Physics Objects Only example Physics objects not well balanced in size combine or split them for balance … keep overlaps low T.Kurca Grilles France
Object DS Primary DS • For rate equalization: • Some are too big Split • Some are too small Merge • From object datasets to PDs: • Splitting based only on trigger bits • Merge correlated triggers • keep unprescaled triggers together • Allow duplication of triggers if • meaningful from physics point of view Example here is for L >>8E29 for L=8 E29 need only 8 PD J. Incandella T.Kurca Grilles France
Object DS Primary DS (2) • Bjet , Lepton+x datasets: very small rates • Bjet: merged with MultiJet dataset • LepX: - they are combined object triggers • split & absorbed into 2 relevant • lepton datasets • same trigger appearing in 2 DS T.Kurca Grilles France
1st Iteration Secondary Datasets 8E29 Primary Datasets(PD) Secondary Datasets (SD): - Dataset/Skim name - Data format (RECO, AOD, reduced AOD, etc) - Prescale wrt to the PD data - Rate in Hz - Fraction of events wrt to the total parent PD - Fraction of disk wrt to the PD size (RECO assumed) T.Kurca Grilles France
Secondary Datasets L=8 E29 Jet SDs • - low pT Jet triggers already prescaled at HLT further prescaled at SD level • full DiJetAve15 stats needed for JEC to be stored in reduced format • keep 50 GeV as lowest unprescaled single jet threshold • full stat for DiJetAve30 again neede for JEC … reduced event size • keep 35 Gev as the lowest unprescaled MET trigger • - keep also events from Btag and HSCP triggers T.Kurca Grilles France
Summary • The Goal of CMS data distribution model is to make data access easier , more reliable and efficient • We have in place many components - Tiers-structure - T2 associations and data transfer tools - trigger tables - Primary Datasets and 1st iteration of SDs based on triggers • PDs & SDs in standard formats will be distributed to T2s (AOD, RECO if possible) • Central and group skims run on PDs accessible at T2s and more manageable • No restrictions on group and user skims … even if special data formats are required • in process post mortem analysis of October exercise where all this was tested before real data taking T.Kurca Grilles France