a first look at citusdb in database physics analysis n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
A first look at CitusDB & in-database physics analysis PowerPoint Presentation
Download Presentation
A first look at CitusDB & in-database physics analysis

Loading in 2 Seconds...

play fullscreen
1 / 18

A first look at CitusDB & in-database physics analysis - PowerPoint PPT Presentation


  • 176 Views
  • Uploaded on

A first look at CitusDB & in-database physics analysis. M. Limper 19/06/2014. Introduction. Physics Analysis is currently file-based Scanning through large datasets can be cumbersome The idea: send jobs to the computing grid

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A first look at CitusDB & in-database physics analysis' - kaipo


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction
Introduction
  • Physics Analysis is currently file-based
  • Scanning through large datasets can be cumbersome
    • The idea: send jobs to the computing grid
    • In practice: bored waiting for grid-job to finish, scientists filter datasets, throwing away data until it fits on the physicist’ laptop
  • What if we could provide access to large datasets via a database?
introduction1
Introduction

In-database physics analysis:

  • SQL goes in, results come out!

J/ψ

Ψ(3686)

my data
My data

Test sample of 127 ntuple-files of collision-data recorded by ATLAS experiment

=> subset re-presenting 3 fat ‘LHC runs’, ~0.5% of total dataset

  • 7.1 million events total
  • 6022 “branches” per event
    • 2053 “scalar”-type branches
    • 3527 “vector”-type branches
    • 379 “vector-of-vector”-type branches
    • 63 “vector-of-vector-of-vector”-type branches
  • ~200 GB of data
ntuple branch examples
Ntuple branch examples
  • 2053 scalar-type variables

MissingEnergy: one value per branchevent

Float_tMET_RefFinal_em_etx;

Float_tMET_RefFinal_em_ety;

Float_tMET_RefFinal_em_phi;

Float_tMET_RefFinal_em_et;

Float_tMET_RefFinal_em_sumet;

Float_tMET_RefFinal_etx;

Float_tMET_RefFinal_ety;

Float_tMET_RefFinal_phi;

Event Filter: one value per branch per event

Bool_t EF_2b55_loose_j145_j55_a4tchad;

Bool_t EF_2e12Tvh_loose1;

Bool_t EF_2e5_tight1_Jpsi;

Bool_t EF_2e7T_loose1_mu6;

Bool_t EF_2e7T_medium1_mu6;

Bool_t EF_2g15vh_medium_g10_medium;

Bool_t EF_2g20vh_medium;

Lots of variables but relatively small fraction of the total dataset

ntuple branch examples1
Ntuple branch examples
  • 3527 vector-type variables

One value per electron per event

vector<float> *el_E;

vector<float> *el_Et;

vector<float> *el_pt;

vector<float> *el_m;

vector<float> *el_eta;

vector<float> *el_phi;

vector<float> *el_px;

vector<float> *el_py;

vector<float> *el_pz;

vector<float> *el_charge;

vector<int> *el_author;

One value per muon per event

vector<unsigned short> *mu_allauthor;

vector<int> *mu_author;

vector<float> *mu_beta;

vector<float> *mu_isMuonLikelihood;

vector<float> *mu_matchchi2;

vector<int> *mu_matchndof;

vector<float> *mu_etcone20;

vector<float> *mu_etcone30;

vector<float> *mu_etcone40;

vector<float> *mu_nucone20;

vector<float> *mu_nucone30;

vector<float> *mu_nucone40;

One value per photon per event

vector<float> *ph_CaloPointing_eta;

vector<float> *ph_CaloPointing_sigma_eta;

vector<float> *ph_CaloPointing_zvertex;

vector<float> *ph_CaloPointing_sigma_zvertex;

vector<float> *ph_HPV_eta;

vector<float> *ph_HPV_sigma_eta;

vector<float> *ph_HPV_zvertex;

vector<float> *ph_HPV_sigma_zvertex;

vector<int> *ph_NN_passes;

vector<float> *ph_NN_discriminant;

Representing the bulk of the data (many particles per event!) Analysis relies heavily on filtering events by selection particles with certain properties

ntuple branch examples2
Ntuple branch examples
  • 379 vector-of-vector type variables

One value per ‘SpaceTime’-measurement on each muon per event:

vector<vector<int> > *mu_SpaceTime_detID;

vector<vector<float> > *mu_SpaceTime_t;

vector<vector<float> > *mu_SpaceTime_tError;

vector<vector<float> > *mu_SpaceTime_weight;

One value per vertex per photon per event:

vector<vector<float> > *ph_vx_px;

vector<vector<float> > *ph_vx_py;

vector<vector<float> > *ph_vx_pz;

vector<vector<float> > *ph_vx_E;

vector<vector<float> > *ph_vx_m;

vector<vector<int> > *ph_vx_nTracks;

Used for certain reconstruction performance studies

To be stored in CLOB or separate table…

ntuple branch examples3
Ntuple branch examples
  • 63 vector-of-vector-vector type variables

One value per track per vertex per photon per event:

vector<vector<vector<int> > > *ph_vx_convTrk_nSiHits;

vector<vector<vector<float> > > *ph_vx_convTrk_TRTHighTHitsRatio;

vector<vector<vector<float> > > *ph_vx_convTrk_TRTHighTOutliersRatio;

vector<vector<vector<float> > > *ph_vx_convTrk_eProbabilityComb;

Not using any of these in my queries, typically used for final corrections or certain in-depth studiesof reconstruction performance

To be stored in CLOB or separate table…

converting ntuples to tables
Converting ntuples to tables
  • Self-made program to convert ntuples into database tables
  • One physics-object is represented by one table
  • Each table still has hundreds of columns!
sql analysis
SQL analysis

SQL analysis involves predicate filtering to select good objects and JOINs to put information from different tables together:

CitusDB+column-store extension looks interesting:

  • Object selection involves only a few out of many columns => would benefit from column storage
  • When preselection passes many objects, JOINs can potentially become huge => would benefit from sharding, with shard-distribution based on EventNumber to reduce JOIN-size
storing ntuple data into to citusdb
Storing ntuple-data into to CitusDB
  • Re-wrote my program to store data in CitusDB
    • Read data from all branches with specific prefix
    • Write data as comma-delimited values in temporary value
    • After csv-file passes 5000 lines of data, store data into CitusDB
      • Program trigger command-line argument for psql to execute psql-macro
      • Psql-macro uses \STAGE command to load data
c reate table statement
Create table statement

CREATE FOREIGN TABLE eventdata203779_c (RunNumber INTEGER NOT NULL,EventNumber INTEGER NOT NULL, lbn INTEGER NOT NULL,"bunch_configID" INT,"timestamp" INT,"timestamp_ns" INT,"bcid" INT,"detmask0" INT,"detmask1" INT,"actualIntPerXing" FLOAT,"averageIntPerXing" FLOAT,"pixelFlags" INT,"sctFlags" INT,"trtFlags" INT,"larFlags" INT,"tileFlags" INT,"fwdFlags" INT,"coreFlags" INT,"pixelError" INT,"sctError" INT,"trtError" INT,"larError" INT,"tileError" INT,"fwdError" INT,"coreError" INT,"streamDecision_Egamma" BOOLEAN,"streamDecision_Muons" BOOLEAN,"streamDecision_JetTauEtmiss" BOOLEAN,"isSimulation" BOOLEAN,"isCalibration" BOOLEAN,"isTestBeam" BOOLEAN,"el_n" INT,"v0_n" INT,"ph_n" INT,"mu_n" INT,"tau_n" INT,"trk_n" INT,"jet_n" INT,"vxp_n" INT,"top_hfor_type" INT,"Muon_Total_Staco_STVF_etx" FLOAT,"Muon_Total_Staco_STVF_ety" FLOAT,"Muon_Total_Staco_STVF_phi" FLOAT,"Muon_Total_Staco_STVF_et" FLOAT,"Muon_Total_Staco_STVF_sumet" FLOAT,"Muon_Total_Staco_STVF_top_etx" FLOAT,"Muon_Total_Staco_STVF_top_ety" FLOAT,"Muon_Total_Staco_STVF_top_phi" FLOAT,"Muon_Total_Staco_STVF_top_et" FLOAT,"Muon_Total_Staco_STVF_top_sumet" FLOAT,"mb_n" INT,"collcand_passCaloTime" BOOLEAN,"collcand_passMBTSTime" BOOLEAN,"collcand_passTrigger" BOOLEAN,"collcand_pass" BOOLEAN)

DISTRIBUTE BY APPEND (EventNumber)

SERVER cstore_server OPTIONS(filename '', compression 'pglz');

  • Create foreign tables stored using column-store extension
  • Distribute shards by EventNumber
  • Keep data from the same event together
  • Facilitate joins between different tables
  • One table per RunNumber:distribute shards by (RunNumber,EventNumber) not possible
stage statement
\STAGE statement

\STAGE eventdata203779_c (RunNumber,EventNumber,lbn,"bunch_configID","timestamp","timestamp_ns","bcid","detmask0","detmask1","actualIntPerXing","averageIntPerXing","pixelFlags","sctFlags","trtFlags","larFlags","tileFlags","fwdFlags","coreFlags","pixelError","sctError","trtError","larError","tileError","fwdError","coreError","streamDecision_Egamma","streamDecision_Muons","streamDecision_JetTauEtmiss","isSimulation","isCalibration","isTestBeam","el_n","v0_n","ph_n","mu_n","tau_n","trk_n","jet_n","vxp_n","top_hfor_type","Muon_Total_Staco_STVF_etx","Muon_Total_Staco_STVF_ety","Muon_Total_Staco_STVF_phi","Muon_Total_Staco_STVF_et","Muon_Total_Staco_STVF_sumet","Muon_Total_Staco_STVF_top_etx","Muon_Total_Staco_STVF_top_ety","Muon_Total_Staco_STVF_top_phi","Muon_Total_Staco_STVF_top_et","Muon_Total_Staco_STVF_top_sumet","mb_n","collcand_passCaloTime","collcand_passMBTSTime","collcand_passTrigger","collcand_pass") FROM '/data1/citus_db/csv/NTUP_TOPEL.00872780.NTUP_TOPEL.00872780._000001.root.1.eventdata.csv' (FORMAT CSV)

\STAGE eventdata203779_c FROM '/data_citusdb/csv/NTUP_TOPEL.00872780.NTUP_TOPEL.00872780._000001.root.1.eventdata.csv' (FORMAT CSV)

  • Gives: \copy: ERROR:  copy column list is not supported
  • I can’t define columns when inserting into foreign tables using \STAGE
  • Too bad, I found it useful to specify the columns as different ntuple can contain different branches: if column is not specified in csv, it should insert null
  • Similarly I’d like to have an option to add columns (is this possible? Didn’t look at it yet)
  • Instead I’ll use simple \STAGE command:
primary key issues
Primary Key issues
  • I’d like to set primary key set on (RunNumber,EventNumber,ObjectNumber).
  • Ntuple-files occassionaly store the same event twice
  • Due to the way experiments records data from ‘streams’, some overlap from different streams
  • Sorting out doubles is yet another hassle for physicists to deal with, using a database with primary-key ensures unique event are store… but:
  • Currently using \STAGE insert of all data in the entire .csv-file fails when it find 1 double among the 50000 lines
  • Work-around= no primary key constraint for now…
other issues
Other issues

While testing, I’m frequently deciding to recreate some tables, but how do I drop FOREIGN table including the shards?

storing ntuple data into to citusdb1
Storing ntuple-data into to CitusDB

Example of insert-program churning through the data…

query test
Query test

15:34 Do I still have time to test something before the call??

to do
To-do
  • Insert all my data:
  • I need to find a good way to use my 18 disks per node (mystery errors were coming from my raid setup)
  • Maybe I just mount each disk separately and run one worker per disk?
  • Get some queries going!