Summary of the First Database Survey

Summary of the First Database Survey J.N. Butler Oct. 11, 2001

Goals of Survey • Get people to begin to think about the requirements in this area -- nothing final or “binding” • Get some idea of scope of needs, sizes of databases, access patterns to begin the discussion of which DBMS’s to use • Explore commonalities

Detector construction tracking databases Calibration databases Configuration databases Monitoring databases 5. Event-related databases 6. Analysis and Simulation support databases 7. Documentation and general information databases Arbitrary Categories

Summary of Questions • purpose of the database • source of the data -- where is the information that populates the database generated? • quantity of the data (e.g. number of channels, data per channel, frequency of the data for each channel, etc) • Anticipated use of the data/access patterns

Pixel EMCAL RICH Muon Forward Silicon Forward Straws Data Acquisition System Trigger Summary of Groups Responding The individual group responses and some contributed opinions are on the web page. Also, on the web page are WORD documents with the responses organized in categories, rather than by group

Summary of Detector Construction Tracking Databases • All construction subprojects plan to have them • They are all likely to be small, <~1 Gbyte bfi( before inflation). • They will be used to track components, subassemblies, to do trending of yields, selection of acceptable components, matching components to their environment, etc • Basic output is probably response on terminal or paper • Query rate will not be high • Personal comment: People will probably demand a GUI for data entries and queries

For each ASDQ channel (~75K) 2 measures of input voltage (+/-) 2 measures of output voltage (+/-) 1 threshold 5 measures of output width 2 measures of calibration pulse for a total of 12 values/channel Muon Response For each tube (~75K channels) Wire used Delrin plug used Brass pin used Who strung the tube When they strung it Where it was strung What the tension is Who measured it When it was measured Where it was measured What the efficiency is Who measured it When it was measured Where it was measured In principle the tension and efficiency for a tube can be measured multiple times, but for most tubes this will be a once-off during For each ASDQ chip we store Serial number Lot Position in card Delivery Date For each plank (~2500 items) we store Its serial number (Barcode) Its length Who made it, Where, When THERE ARE LINKS FROMAN OBJECTS TO IT PARENTS

EMCAL Construction Construction database with 1) purpose : tracking parts like status of a crystal, PMT and electronics for each channel, position in the detector, cable interconnections 2) source of data: crystal status - from quality control DB, PMT and electronics - from some other DB or directly from testing setup, some info may come from an operator. 3) quantity of data: 100bytes x 23000 channels x several stages like testing, installation, going from , going to, etc total ~23 Mbytes. 4) anticipated use of the data/access patterns: reports, later may be used for repairing (what cable goes to what channel of electronics)

Summary of Calibration Databases • People tended to look upon this as meaning initial calibrations, some of which would be done on the bench, or with cosmic rays or with test beam or initial beam exposure, and perhaps repeated at fixed, not too frequent intervals. • Forward silicon saw all calibration done through DAQ, which is consistent with their experience • Trigger and DAQ need access to calibration databases for initialization of hardware and programs, but do not themselves have needs in this area

Configuration Databases • Gets updated once a new calibration indicates there is a need to change threshold values. • Definite relationship with the Calibration Constants database, source of data needed to compute new set of initialization constants • Possible relationship with the Monitor Values database: the latter gets here reference values to check against monitored ones • Possible relationship with the Detector Construction database: initialization might be done only on components declared installed by the latter database. • Size will depend on how often configuration needs to change • Change of configuration may need to be quick once it determined to be necessary • When to make change, how to know when it has occurred

Monitoring Databases • The detector groups saw this as mostly monitoring temperatures, pressures, flow rates and • Monitoring gains, thresholds, pedestals, drift speeds, high voltage, current draw, and occupancy • Trigger and DAQ were interested in data rates, trigger rates, event sizes and • physics indicators, numerous distributions that could be used to track performance from individual channels through fairly high level physics quantities -- like specific “golden B decay modes” • Total database sizes are not large except for the last category. Stability will determine recording frequency

Event Related Databases We plan for 40 Billion events/year. This is a rate of 200 Mbytes/beam-second. An event catalog would probably be pretty hard to search. If we use files catalogs for the archived data, how many files might we have? The answer, for raw data, comes to 20,000 files/year X 100Gbytes/filesize(in Gbytes) This would hold 2 million events X filesize(in Gbytes)/100Gbytes So, there would be of order 100,000 entries/year in a “files” database for the raw data, if we used 20Gbyte files. Although the data might on a file might not represent events taken in an approximately contiguous time, but might be scrambled, if they were, 100 Gbytes would represent ~8 minutes of running. Assume three times this many files for physics analysis and simulation datasets.

Run State Database Another type of events are run transitions. For reconstruction purposes (including L2/L3) - a record of the detector status and Configuration at begin run time, at the time when we enable the trigger (start run) etc. We could also include periodic updates. What is a RUN? Presumably it implies a period which has a stable, well-defined set of constants. Given that the events are only approximately ordered, how do we define such a period. How do we synchronize parameter changes? Pathological Event Database This database would be used to locate events saved in the storage systems. The data that is generated for this database will come from the trigger system, which will tag events that seem unusual and should be studied. Pathological events can provide indications of trigger-system failures that were not caught by the monitoring system, bugs in trigger algorithms, failures in detector components, or hints of unanticipated physics signals.

Analysis and Simulation Support Databases • Analysis and simulation programs clearly need access to many of the other databases • There needs to be a run history database that they access • There will be an electronic logbook, using a database for information storage, with shift information and probably a streamlined version for use by reconstruction and analysis programs • Analysis History and Conditions database. It is very important to know exactly what program was run and with what inputs. For production jobs, we will probably follow the rule of having a status word on each event that records a “code” which defines the complete production code and environment (databases, etc) used in the processing. For physics analysis, this is probably impractical. It may be that a standard use of the electronic logbook for this purpose can be adopted. This is no an easy problem.

Summary of Document Databases • The purpose of the database is to have one central collection of all BTeV documentation. It will maintain a local copy of each document. Documents may be as simple as a picture of the detector. Among other things, we expect that presentations at our meetings will be entered into the database • Most data will be entered by hand via a web interface by members of the collaboration. • If we are entering small “documents”, then I expect that we could easily have 40K or so. I believe that CDF now has 5K documents using a more traditional definition of what a document is. Information to be stored for each document includes title, author list, location, category list (each document is allowed to reside in multiple categories), submission date, revision date, revision number, and abstract or short description. • This will be used continually by collaborators and other physicists wishing to find information about the experiment. All data must be backed up. The documents themselves will not reside in the database, but in a structured directory tree which must be accessible by the web server (and which also must be backed up). We may want to mirror this at a couple of sites (e.g. Italy) for ease of access. However, there should not be more than a handful of mirror sites so that maintenance does not become a headache.

Conclusion and Next Steps • There will be many databases and most will be small in terms of disk size --1 to 100 Gbytes (bfi). • Storage is rarely a consideration but speed of access for typical queries, user interface, and application program interface are issues • Classic database issues all apply • Use patterns not clear but BTeV is committed to facilitating local analysis and distributed analysis so this will be a design consideration

Conclusion and Next Steps • Separation into categories is probably helpful but monitoring and calibration probably need better delineation and “static” vs “active” monitoring may be a useful distinction. • The “category” view of this survey needs to be completed. • We need to capture, as preliminary, whatever numerical information we have although I think at this point it is not very precise • We need to write a document that can trigger another, more complete round of work • We need to think about what a “run” means and how parameters can be changed in anorgnainzed, trackable fashion • We need to at least define “requirement categories” such as speed of access for typical transactions, requirements for backup, mirroring, distribution, access rules and security, etc • We have a lot of work ahead but I think this survey provided a good beginning and I want to thank everyone who contributed

Arbitrary “Categories” • Detector construction tracking databases: These are used to track parts inventories, processing steps including the progress and location of subassemblies, quality assurance test results, etc. • Calibration databases: calibration constants determined in test beam runs, cosmic ray runs, from pulser data, or fromevent data (e.g. in situ calibrations). These could include pedestals, gains, start times, velocities, etc. • Configuration databases: These are constants that must be downloaded to initialize and run the system. These could include hardware configurations such a physical and logical addresses of modules, and such quantities as high voltages, thresholds, pedestals, masks (e.g. to suppress bad channels), trigger masks and trigger configurations/definition and cuts, alignment parameters etc. Some of these may be static and others may be determined from the calibration, monitoring or analysis systems. and may change each run or with some other frequency. • Monitoring databases: These include monitoring of environmental conditions such a temperatures, barometric pressures, luminosity monitoring, trigger rates, data sizes, pedestals, gains, alignments, physics quantities, and physics signals,tracked over the whole time of the experiment.

Arbitrary “Categories” • Event-related databases: At present, BTeV does not plan to have an event database but does plan to have an extensive metadata database, or data catalog, to locate events on the various storage systems. This would catalog datasets associated with raw and reconstructed data and many datasets generated for physics analysis. Simulation data sets would presumably be handled in the same catalog. The metadata catalog might hold information on each event that could be used by itself to make some high level selection of the data. • Analysis and Simulation support databases: These include the geometry database, and various run databases and simulation sample databases, databases associated with the control room logbook, computer processing databases (what has been processed through each step of the analysis chain), etc • Documentation and general information databases: These include catalogs of BTeV and external documentation needed by the group

Summary of the First Database Survey

Summary of the First Database Survey

Presentation Transcript

FIRST Awards Summary

Survey Summary 2011 Member Survey

CSC321/545: Summary of Database Techniques

Summary of OCEA Survey 2012

Summary of Online Survey

Database Programming Summary

Librarian Survey Database

The First National Survey of Medication Aides

CobiT Survey Summary

Survey Summary

Brief Summary of Database/Web Connectivity

Survey Summary

Summary of Survey

University Centers The First Survey

Survey of Graph Database Models

The PaD, first longitudinal survey of Catalonia

Summary of Chapter Survey

Membership Survey - Summary

Database Summary

SHSP Survey Summary

Summary of the 2009 Faculty Experience Survey