1 / 42

Design and Performance of the CDF Experiment Online Control and Configuration System

Design and Performance of the CDF Experiment Online Control and Configuration System. William Badgett, Fermilab for the CDF Collaboration 2006 Computing in High Energy and Nuclear Physics Conference Online Computing Session 2, OC-2, Id 363 February 13, 2006 Mumbai, India. Introduction.

Download Presentation

Design and Performance of the CDF Experiment Online Control and Configuration System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Performance of the CDF Experiment Online Control and Configuration System William Badgett, Fermilab for the CDF Collaboration 2006 Computing in High Energy and Nuclear Physics Conference Online Computing Session 2, OC-2, Id 363 February 13, 2006 Mumbai, India

  2. Introduction CDF Online Configuration and Control • CDF Run IIa & b Status • Brief overview of CDF DAQ • System Configuration and Conditions • Overview of Online Databases • Hardware Database and API • Run Control; Run Configurations & Conditions db • Operational experience during data taking • Performance, availability • Conclusions • Wish list…

  3. Tevatron Upgrades Run IIa,b • Run IIa (2001-2005) Goal 2 fb-1 • New accelerator Main Injector give x5 increase • Recycler ~2 to 3 increase, to preserve anti-protons first physics use in 2004! • More frequent bunch spacing 396 ns give 36 bunches • Higher Beam Energy ~980 GeV (from 900 GeV) • Peak Luminosity goal 2×1032 cm-2sec-1 • Run IIb (2005-2009? LHC?) Goal 15 fb-1 • Electron cooling, crossing angle, anti-proton intensity, electron lens ~2 to 3 increase • Peak Luminosity goal 3.3×1032 cm-2sec-1 • Trigger and DAQ Upgrades

  4. Collecting Luminosity Red: Delivered by the Tevatron 1.55 fb-1 Blue: Recorded by CDF 1.25 fb-1 (Live) Data samples can be further reduced by detector malfunctions according to event selection Nominal “good” data taking starts around March 2002 Data collection now greatly exceeds CDF Run I, also with increased detector sensitivity

  5. Improving the Beam Peak luminosity to date: Luminosity continues to improve… Planning for Run IIb: Compare to Run I peak:

  6. Data Acquisition Overview Data Logger sends data to computing center tape robots; sends a fraction to disk and to online monitors Level 3 farms do offline style processing and cuts Commercial linux boxes Event Builder collects event fragments and forwards them to Level 3 trigger farm for final decision Trigger Supervisor controls the entire operation, communications hub between DAQ and trigger Front end VME crates digitize, time, etc. Subset of data is split off and send to the trigger Plus monitoring and control messages published (ethernet) Optical fibres

  7. Operational Efficiency • Sources of down time • Beam losses too high • High Voltage trips • Detector malfunctions • Beam time calibrations • DAQ or Trigger malfunction • Pipeline jump (sync) • Hardware failure • Software crash or system failure, Database, RunControl • Trigger/DAQ deadtime • Human error Improving with time, then becoming asymptotic with last percentage points becoming exponentially more difficult to recover… • The Silicon Tracker is particularly sensitive to beam losses • Has experienced damage from problematic beam aborts

  8. Detecting & Fixing DAQ Errors Control and configuration messages Data mini-bank RunControl FrontEnd Crates Event Builder Fast recovery; crate reset access; starting and stopping runs Check crate data consistency every L2 accept (fast) Assemble full data event record Regular status and heart beat messages Level 3 Trigger Process many errors sources and send recommended reset or run recovery action to RunControl Use full event to find and trigger data acquisition errors, plus physics triggers ErrorHandler FrontEnd Monitors Error recovery normally completely automatic Convert from offline to online message format and forward errors to all online monitors ConsumerError to Online Interface ConsumerServer DAQ Error Consumer Log data on disk and tape, fan out event samples to DAQ consumers Data to disk and tape Verify and error and determine source; construct error message Build in redundancy and constant cross checking “What goes around, comes around…”

  9. Online Software User Interfaces, control and real-time monitoring • Control and Monitoring in Java JDK v1_4_2 (Sun) • Commercial PCs running FNAL Scientific Linux 3.0.5 • Not limited to CPU, architecture or operating system • Oracle database v9.2 running on Sun 450 Readout Crate Controllers • FrontEnd crates running VxWorks, C language • Simplicity, close to hardware Level 3 and Data Monitoring • Linux, with C++ offline Analysis Control framework • Giving physicists a dangerous weapon

  10. Online Database Schemas • Hardware* • Pseudo-static, slots, delays, basic timing • Δdata style history tables • Run* • Configurations for user selection on RunControl • INPUT tables • Conditions for DAQ and Trigger, rates, latencies, etc. • OUTPUT tables • Trigger • Trigger thresholds and algorithms • Immutable physics objects • Calibration • Detector characterization and correction constants • SlowControls • Record the environmental state of the detector • Voltage, temperatures, etc. *will describe in detail

  11. Database Growth Many application revisions at first to control exponential growth Since then, steady growth except for extended shutdowns

  12. Database Availability • The CDF Data Acquisition operates in close coöperation with the online production Database • CDF runs 24 hours per day, 7 days per week, even during Tevatron shutdown periods • Unscheduled downtimes can lose data, since March 2002: • 1 db Disk failure where raid failed to failover (!) • 1 db memory card failure • 1 big db “human error” • RunControl online Java API bugs, crashes • Maintenance downtimes necessary but painful to schedule • Detector maintenance work requires Database and RunControl up & running

  13. DownTime Impact DownTime events directly attributable to Database or RunControl pathologies only (does not include configuration time triggered by external failures) ΣL ~ 1.5 fb-1

  14. CDF Database Replication • Use Oracle Streams replication: • automatic propagation of DML and DDL in a leap-frog style to unlimited database instances • minimize load on online and offline production instances • essentially instantaneous push of new data DB Color Key: Read+Write ReadOnly Offline Production Offline User Replica Online Run Hardware Trigger Calibration SlowControl Run Hardware Trigger Calibration FileCatalog/SAM Run Hardware Trigger Calibration FileCatalog/SAM … Access for rest of world, direct or via additional instance Remote SAM stations, FronNTier cache • L3 Trigger • Web servlets • ShiftCrew electronic logBook (!) • RunControl • Monitors • Calibrations • Consumers • Offline Production Farms • Luminosity calculations • User analysis farms • General database web browser

  15. Hardware Database • Need complete image of configuration data loaded by RunControl, 30 seconds to load • Updates at a low rate, but critical for operations • Core tables and Java classes describe all electronics Crates, Cards, etc. • All updates to core tables logged in history tables automatically via database triggers; tables grow steadily with time • Java classes read incremental updates before runs, and use reflection methods to update core data image on the fly, quickly and transparently, < few milliseconds • This is a flexible and unified design, used for all detector components at CDF! • Every second counts when configuring a run!

  16. Hardware Database Java API Electronics Card Inheritance Tree hdwdb.Card Boards to configure hdwdb.Tracer Boards to readout hdwdb.BankCard hdwdb.AdMem Hdwdb.AdMemTof Image object containment tree hdwdb.Crate (static Hashtable) Hashtable hdwdb.Card Hashtable hdwdb.Channel Incremental updates from history table used together with Java reflection to dynamically update Java data image in milliseconds

  17. Hardware Database Web Interface • Light-weight Apache+Tomcat servlets for browsing hierarchical database structures • Dynamic links point to other database objects • Read-only policy on the web for security issues • Write access requires kerberos authentication to get inside firewall Crate hardware database details with contained cards data Real time crate data acquisition status Crate hardware database details with contained cards data

  18. CDF RunControl RunControl • Central Control Program directing, configuring and synchronizing the actions of ~150 clients • Real-time Java multi-threaded application, approximately ten threads at any one time • SmartSocketstm commercial TCP/IP name services for communication to and from clients in a publish/subscribe model • Provides run configuration for the hardware and software clients • Closely linked to the database, describing hardware, run options, calibration constants, trigger table, etc. • Front line monitoring and error reporting for the DAQ system • Works with ErrorHandler, an auxiliary process logging errors and making informed decisions as to recovery procedures, automatic and human intervention

  19. CDF RunControl • StateManager • User initiates transitions between different states • Goal is to stay in the Active state until run is complete, taking recovery actions as necessary • Extensibility of the Object Oriented design: • Easy to implement any other diagram, e.g. TDC testing, source runs • Ported for use at FNAL Fixed Target program with few changes Ideas for transitions and state flow diagrams, rf. Zeus Experiment RunControl, Chris Youngman, et al

  20. Transitions • Partition: Select front end crates and clients for the run; configure trigger and return crosspoints • Config/Setup:Configure crates and clients with info that could change run by run, without adding or subtracting RC clients (slowest transition) Most work done here! • Activate: Final step to enable system to take data, fast • End: Normal end of run, produces end of run summaries • Abort: Return to Idle when no other option available • Pause/Resume: Briefly stop data taking (HV trips, flying wires, inhibits) • Halt/Recover/Run: Fast system error recovery, first option to use when an error occurs during data taking; critical to maintaining operational efficiency • Reset: Return to Start state from Idle, or when no other options are available

  21. Typical Transition Performance L3+SVT+μ • Slow pokes • L3 distribution • Silicon Vertex Trigger • Muon-Track Trigger • *L3 Config tail when calibration or trigger executable not cached • Source: • Large L3 farm distribution, and large trigger look-up tables • Need social engineering for each transition time improvement L3 farm* Pathological tails, remotge client software crashes, etc. Client reply time plotted, RunControl setup time < ~ 1 sec

  22. RunConfiguration Selector • Select from predefined run configurations organized hierarchically in folders related to function: • Each entry represents a set of relational entries in several RunConfiguration database tables, mapped onto an Object (Java and C++) using container objects to express relations • Contents change from run to run • Human readable and selectable RunConfigurations are flexible and non-binding • RunConditions contains copy when a run is executed The Run Database, Visualization

  23. Graphical Representation of RunConfiguration object Global DAQ RunType Trigger Table, coupled Run database in turn points to entities in the trigger, calibration, and hardware databases Front end crate selection Move to left to include or right to exclude Java “TomCat” servlets provide web browsable version from anywhere The tabbed panes contain detailed information about the RunConfiguration

  24. Run Database Schema (subset of whole run schema) Run Conditions “Output” tables Record settings, trigger rates, luminosity and background rates, run quality status, etc. Run Configurations “Input” tables Configure DAQ according to type of run and record for posterity

  25. Configuration Messages Structure rc.ConfigMess To every client, with destination specified Contains global common variables runNumber, runType, etc. rc.ReadoutRun To every client with readout to perform, list of banks rc.ReadoutList rc.phys.COTReadoutList Detector component specific configuration details rc.phys.MuonReadoutList rc.phys.CalReadoutList rc.phys.CalSmxrReadoutList • Collate information from Hardware, Run, Trigger, and Calibration databases • Class Inheritance as needed according to type of client (electronics crate or software server application, L3 trigger, etc.) • Pick up desired message dynamically from Hardware database • Java classes generate C code and headers automatically • Unified system avoids much duplicated work!!!

  26. Real Time Monitoring (java swing) Status Summary Monitor may be run anywhere, and also provide HTML web files Tevatron Loss Monitor Crate VxWorks Monitor Publish/subscribe based monitoring allows implementation of easy to read monitor panels, arrayed around the control room Rate Monitor and Dynamic Prescaler And, of course, panic situations will give voice alarms, too

  27. Data Acquisition / Control Room The primary Data Acquisition consoles: RunControl, online monitoring

  28. Web Based Monitoring, RunSummary http://www-cdfonline.fnal.gov/ Follow RunSum and related links Run summary pages are dynamically produced, with almost every quantity hyper-linked, with many of the links drawing plots of the quantity of interest & links to error logs and all run settings Root used for plotting Publicly accessible!

  29. Freeware Experience • Java Experience has been quite positive • Easy to build complex programs without headaches of C and C++ • Extensibility of Java classes has proven invaluable • All CDF RunControl and monitoring applications can run anywhere! Not reliant on a CPU nor operating system! • 100% availability so far • JDK/Linux releases: Sun phasing out v1_4_2 support • Downsides, when you really push Java: • It’s not really platform independent! Various subtle differences (threads, look & feel) • Java Virtual Machine is a complicated creature, with sometimes mysterious and impossible to debug behaviour, crashes

  30. Operating Systems • Linux experience also positive • Linux disk and web servers reliable • Very difficult (impossible?) to get our programs to crash the operating system • Perhaps Linux can replace Sun for the database system • Testing in the offline realm has so far been positive • But we miss that VMS system API (!) • But still have not made leap to Oracle database on Linux for critical servers… • Cannot argue with success – unscheduled database downtime extremely rare • Offline replicas on Linux in good shape

  31. Commercial Software Experience Commercial Software Oracle Database • Generally impervious to crashes, robust, reliable • Fulfills our database and communications needs • Oracle provides a nice support forum (but see below) • Downsides • Money $$ Lots of it • Many people fear it • Can’t see the source; but you probably wouldn’t want to SmartSockets (Talarian/TibCo) • Remarkably good performance for a centralized TCP communications server • Features and support sometimes lacking • Downsides • Again, Money $$$ The price of a single client license keeps going up • Small company, short lifespan • In this case, you probably would like to see the source code • Crashes on VxWork we cannot debug But beware false economies!

  32. Wish List • Cross-experimental and cross-lab development of software could be quite beneficial in some common areas: • IP message passing (multi-platform, multi-language) • Database servers (!) • …other software? • Virtually every experiment needs such beasts, but often effort is duplicated • Avoid expensive licenses, with no source code access • Should tailor to HEP requirements, and provide continuing support (everything is always in development!) • Paw, Root, and data handling, have been successes in common tools • Hearing murmurs … what’s out there?

  33. Conclusions • CDF is running well, taking data during the Tevatron Run II, 2001 through 2009 • We have designed and implemented a set of database schemas and associated Java APIs to configure and control the CDF online data acquisition system in real time • Through object oriented programming, we have created a powerful and flexible approach to run configurations that is used by all components of the experiment • A suite of control and monitoring software, web interfaced, has been developed; shift crew’s job is now easier and more efficient • Through replication, web interfaces and offline database hooks, we have an extensible database available to users world-wide

  34. CDF Related Topics Backup Slides

  35. Resource Allocation Real-time Java color-coded display representing device allocation • Multiple RunControls can run simultaneously: Partitions • Resource Manager controls ownership of front end crates and other virtual resources • Allocation recorded centrally in the Hardware Database • Real-time database event notifications keep all clients informed • Java monitoring Thread listens to events and updates object images

  36. DAQ Performance

  37. 250 4 4 3 3 2 2 1 Bandwidth Usage Maximization Level 1 Trigger rate plot, triggers per second • Dynamic Prescaling • As luminosity decreases, trigger rates also decrease • To maximize usage of DAQ bandwidth, automatically lower prescales of triggers at Level 1 to increase trigger rate during a data acquisition run, within bounds • Use for Level 1 two-track trigger (for B pp), ~85% of Level 1 bandwidth • Heavily prescaled at start of run for safety Red arrows indicate change of prescale values Run is paused, hardware set, run resumed L1 Trigger Cross Section plot, trigger counts normalized by luminosity

  38. Complete Operational Efficiency • Efficiency factors: • Intrinsic system limits: instantaneous deadtime • Limited by system throughput performance • Adjusted through physics choices via trigger cuts • Accelerator beam quality • Losses prevent detector operation, trips and tolerances • little or no experimental control • Operational downtimes • Starting and stopping runs • Failures of services (e.g. database server) • Detector malfunctions • Data acquisition and trigger electronics malfunctions • Test runs, beam time calibrations • Human errors… others

  39. Efficiency Tabulation, to date Downtime occurrences automatically tabulated and linked to shiftcrew’s electronic logbook, each DAQ run and Tevatron store Browse and group by category, lost time, lost luminosity over years’ time scale, category assignment proliferate; operational utility on small time scale ...several smaller categories suppressed

  40. Efficiency Tabulation, intrinsic Category totals, previous Intra-run downtime below Intrinsic dead time during data acquisition runs Runs too small to process Net efficiency

  41. This window indicates the transition status of clients: • Butter yellow: RC has not sent transition • Margarine yellow: RC has send transition, waiting for acknowledgment • Green Client sent successful acknowledgment • Red Client sent error Client MicroManagement Each client is monitored continuously for participation in the run and for possible errors Each client has its own individual control panel; complete resets and recovery are one-touch; all configuration and response information are available here

  42. State Management • RunControl maintains synchronization of activities through the StateManager and its flow • Basic functionality expressed in the base class StateManager • Different run types require differing control flows • Specific StateManagers inherit from base class and extend as necessary • Configuration messages are also easily extensible according to the needs of individual detectors • Avoid duplicating lots of work! TDC Testing Diagram Calorimeter Radioactive Source runs Requires source motion control transitions There’s only one RunControl at CDF

More Related