design and performance of the cdf experiment online control and configuration system n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Design and Performance of the CDF Experiment Online Control and Configuration System PowerPoint Presentation
Download Presentation
Design and Performance of the CDF Experiment Online Control and Configuration System

Loading in 2 Seconds...

play fullscreen
1 / 42

Design and Performance of the CDF Experiment Online Control and Configuration System - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

Design and Performance of the CDF Experiment Online Control and Configuration System. William Badgett, Fermilab for the CDF Collaboration 2006 Computing in High Energy and Nuclear Physics Conference Online Computing Session 2, OC-2, Id 363 February 13, 2006 Mumbai, India. Introduction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Design and Performance of the CDF Experiment Online Control and Configuration System' - anne-holcomb


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
design and performance of the cdf experiment online control and configuration system

Design and Performance of the CDF Experiment Online Control and Configuration System

William Badgett, Fermilab

for the CDF Collaboration

2006 Computing in High Energy and Nuclear Physics Conference

Online Computing Session 2, OC-2, Id 363

February 13, 2006

Mumbai, India

introduction
Introduction

CDF Online Configuration and Control

  • CDF Run IIa & b Status
    • Brief overview of CDF DAQ
  • System Configuration and Conditions
    • Overview of Online Databases
    • Hardware Database and API
    • Run Control; Run Configurations & Conditions db
  • Operational experience during data taking
    • Performance, availability
  • Conclusions
  • Wish list…
tevatron upgrades run iia b
Tevatron Upgrades Run IIa,b
  • Run IIa (2001-2005) Goal 2 fb-1
    • New accelerator Main Injector give x5 increase
    • Recycler ~2 to 3 increase, to preserve anti-protons

first physics use in 2004!

    • More frequent bunch spacing 396 ns give 36 bunches
    • Higher Beam Energy ~980 GeV (from 900 GeV)
    • Peak Luminosity goal 2×1032 cm-2sec-1
  • Run IIb (2005-2009? LHC?) Goal 15 fb-1
    • Electron cooling, crossing angle, anti-proton intensity, electron lens ~2 to 3 increase
    • Peak Luminosity goal 3.3×1032 cm-2sec-1
    • Trigger and DAQ Upgrades
collecting luminosity
Collecting Luminosity

Red: Delivered by the Tevatron 1.55 fb-1

Blue: Recorded by CDF 1.25 fb-1 (Live)

Data samples can be further reduced by detector malfunctions according to event selection

Nominal “good” data taking starts around March 2002

Data collection now greatly exceeds CDF Run I, also with increased detector sensitivity

improving the beam
Improving the Beam

Peak luminosity to date:

Luminosity continues to improve…

Planning for Run IIb:

Compare to Run I peak:

data acquisition overview
Data Acquisition Overview

Data Logger sends data to computing center tape robots; sends a fraction to disk and to online monitors

Level 3 farms do offline style processing and cuts

Commercial linux boxes

Event Builder collects event fragments and forwards them to Level 3 trigger farm for final decision

Trigger Supervisor controls the entire operation, communications hub between DAQ and trigger

Front end VME crates digitize, time, etc.

Subset of data is split off and send to the trigger

Plus monitoring and control messages published (ethernet)

Optical fibres

operational efficiency
Operational Efficiency
  • Sources of down time
  • Beam losses too high
  • High Voltage trips
  • Detector malfunctions
  • Beam time calibrations
  • DAQ or Trigger malfunction
    • Pipeline jump (sync)
    • Hardware failure
    • Software crash or system failure, Database, RunControl
  • Trigger/DAQ deadtime
  • Human error

Improving with time, then becoming asymptotic with last percentage points becoming exponentially more difficult to recover…

  • The Silicon Tracker is particularly sensitive to beam losses
  • Has experienced damage from problematic beam aborts
detecting fixing daq errors
Detecting & Fixing DAQ Errors

Control and configuration messages

Data mini-bank

RunControl

FrontEnd Crates

Event Builder

Fast recovery; crate reset access; starting and stopping runs

Check crate data consistency every L2 accept (fast)

Assemble full data event record

Regular status and heart beat messages

Level 3 Trigger

Process many errors sources and send recommended reset or run recovery action to RunControl

Use full event to find and trigger data acquisition errors, plus physics triggers

ErrorHandler

FrontEnd Monitors

Error recovery normally completely automatic

Convert from offline to online message format and forward errors to all online monitors

ConsumerError to Online Interface

ConsumerServer

DAQ Error Consumer

Log data on disk and tape, fan out event samples to DAQ consumers

Data to disk and tape

Verify and error and determine source; construct error message

Build in redundancy and constant cross checking

“What goes around, comes around…”

online software
Online Software

User Interfaces, control and real-time monitoring

  • Control and Monitoring in Java JDK v1_4_2 (Sun)
  • Commercial PCs running FNAL Scientific Linux 3.0.5
  • Not limited to CPU, architecture or operating system
  • Oracle database v9.2 running on Sun 450

Readout Crate Controllers

  • FrontEnd crates running VxWorks, C language
  • Simplicity, close to hardware

Level 3 and Data Monitoring

  • Linux, with C++ offline Analysis Control framework
  • Giving physicists a dangerous weapon
online database schemas
Online Database Schemas
  • Hardware*
    • Pseudo-static, slots, delays, basic timing
    • Δdata style history tables
  • Run*
    • Configurations for user selection on RunControl
      • INPUT tables
    • Conditions for DAQ and Trigger, rates, latencies, etc.
      • OUTPUT tables
  • Trigger
    • Trigger thresholds and algorithms
    • Immutable physics objects
  • Calibration
    • Detector characterization and correction constants
  • SlowControls
    • Record the environmental state of the detector
    • Voltage, temperatures, etc.

*will describe in detail

database growth
Database Growth

Many application revisions at first to control exponential growth

Since then, steady growth except for extended shutdowns

database availability
Database Availability
  • The CDF Data Acquisition operates in close coöperation with the online production Database
  • CDF runs 24 hours per day, 7 days per week, even during Tevatron shutdown periods
  • Unscheduled downtimes can lose data, since March 2002:
    • 1 db Disk failure where raid failed to failover (!)
    • 1 db memory card failure
    • 1 big db “human error”
    • RunControl online Java API bugs, crashes
  • Maintenance downtimes necessary but painful to schedule
    • Detector maintenance work requires Database and RunControl up & running
downtime impact
DownTime Impact

DownTime events directly attributable to Database or RunControl pathologies only

(does not include configuration time triggered by external failures)

ΣL ~ 1.5 fb-1

cdf database replication
CDF Database Replication
  • Use Oracle Streams replication:
  • automatic propagation of DML and DDL in a leap-frog style to unlimited database instances
  • minimize load on online and offline production instances
  • essentially instantaneous push of new data

DB Color Key:

Read+Write

ReadOnly

Offline

Production

Offline

User Replica

Online

Run

Hardware

Trigger

Calibration

SlowControl

Run

Hardware

Trigger

Calibration

FileCatalog/SAM

Run

Hardware

Trigger

Calibration

FileCatalog/SAM

Access for rest of world, direct or via additional instance

Remote SAM stations, FronNTier cache

  • L3 Trigger
  • Web servlets
  • ShiftCrew electronic logBook (!)
  • RunControl
  • Monitors
  • Calibrations
  • Consumers
  • Offline Production Farms
  • Luminosity calculations
  • User analysis farms
  • General database web browser
hardware database
Hardware Database
  • Need complete image of configuration data loaded by RunControl, 30 seconds to load
  • Updates at a low rate, but critical for operations
  • Core tables and Java classes describe all electronics Crates, Cards, etc.
  • All updates to core tables logged in history tables automatically via database triggers; tables grow steadily with time
  • Java classes read incremental updates before runs, and use reflection methods to update core data image on the fly, quickly and transparently, < few milliseconds
  • This is a flexible and unified design, used for all detector components at CDF!
  • Every second counts when configuring a run!
hardware database java api
Hardware Database Java API

Electronics Card Inheritance Tree

hdwdb.Card

Boards to configure

hdwdb.Tracer

Boards to readout

hdwdb.BankCard

hdwdb.AdMem

Hdwdb.AdMemTof

Image object containment tree

hdwdb.Crate (static Hashtable)

Hashtable hdwdb.Card

Hashtable hdwdb.Channel

Incremental updates from history table used together with Java reflection to dynamically update Java data image in milliseconds

hardware database web interface
Hardware Database Web Interface
  • Light-weight Apache+Tomcat servlets for browsing hierarchical database structures
  • Dynamic links point to other database objects
  • Read-only policy on the web for security issues
  • Write access requires kerberos authentication to get inside firewall

Crate hardware database details with contained cards data

Real time crate data acquisition status

Crate hardware database details with contained cards data

cdf runcontrol
CDF RunControl

RunControl

  • Central Control Program directing, configuring and synchronizing the actions of ~150 clients
  • Real-time Java multi-threaded application, approximately ten threads at any one time
  • SmartSocketstm commercial TCP/IP name services for communication to and from clients in a publish/subscribe model
  • Provides run configuration for the hardware and software clients
  • Closely linked to the database, describing hardware, run options, calibration constants, trigger table, etc.
  • Front line monitoring and error reporting for the DAQ system
  • Works with ErrorHandler, an auxiliary process logging errors and making informed decisions as to recovery procedures, automatic and human intervention
cdf runcontrol1
CDF RunControl
  • StateManager
  • User initiates transitions between different states
  • Goal is to stay in the Active state until run is complete, taking recovery actions as necessary
  • Extensibility of the Object Oriented design:
  • Easy to implement any other diagram, e.g. TDC testing, source runs
  • Ported for use at FNAL Fixed Target program with few changes

Ideas for transitions and state flow diagrams, rf. Zeus Experiment RunControl, Chris Youngman, et al

transitions
Transitions
  • Partition: Select front end crates and clients for the run; configure trigger and return crosspoints
  • Config/Setup:Configure crates and clients with info that could change run by run, without adding or subtracting RC clients (slowest transition) Most work done here!
  • Activate: Final step to enable system to take data, fast
  • End: Normal end of run, produces end of run summaries
  • Abort: Return to Idle when no other option available
  • Pause/Resume: Briefly stop data taking (HV trips, flying wires, inhibits)
  • Halt/Recover/Run: Fast system error recovery, first option to use when an error occurs during data taking; critical to maintaining operational efficiency
  • Reset: Return to Start state from Idle, or when no other options are available
typical transition performance
Typical Transition Performance

L3+SVT+μ

  • Slow pokes
  • L3 distribution
  • Silicon Vertex Trigger
  • Muon-Track Trigger
  • *L3 Config tail when calibration or trigger executable not cached
  • Source:
  • Large L3 farm distribution, and large trigger look-up tables
  • Need social engineering for each transition time improvement

L3 farm*

Pathological tails, remotge client software crashes, etc.

Client reply time plotted, RunControl setup time < ~ 1 sec

runconfiguration selector
RunConfiguration Selector
  • Select from predefined run configurations organized hierarchically in folders related to function:
  • Each entry represents a set of relational entries in several RunConfiguration database tables, mapped onto an Object (Java and C++) using container objects to express relations
  • Contents change from run to run
  • Human readable and selectable RunConfigurations are flexible and non-binding
  • RunConditions contains copy when a run is executed

The Run Database,

Visualization

graphical representation of runconfiguration object
Graphical Representation of RunConfiguration object

Global DAQ RunType

Trigger Table, coupled

Run database in turn points to entities in the trigger, calibration, and hardware databases

Front end crate selection

Move to left to include

or right to exclude

Java “TomCat” servlets provide web

browsable version from anywhere

The tabbed panes contain detailed information about the RunConfiguration

run database schema
Run Database Schema

(subset of whole run schema)

Run Conditions “Output” tables

Record settings, trigger rates, luminosity and background rates, run quality status, etc.

Run Configurations “Input” tables

Configure DAQ according to type of run and record for posterity

configuration messages structure
Configuration Messages Structure

rc.ConfigMess

To every client, with destination specified

Contains global common variables runNumber, runType, etc.

rc.ReadoutRun

To every client with readout to perform, list of banks

rc.ReadoutList

rc.phys.COTReadoutList

Detector component specific configuration details

rc.phys.MuonReadoutList

rc.phys.CalReadoutList

rc.phys.CalSmxrReadoutList

  • Collate information from Hardware, Run, Trigger, and Calibration databases
  • Class Inheritance as needed according to type of client (electronics crate or software server application, L3 trigger, etc.)
  • Pick up desired message dynamically from Hardware database
  • Java classes generate C code and headers automatically
  • Unified system avoids much duplicated work!!!
real time monitoring java swing
Real Time Monitoring (java swing)

Status Summary

Monitor may be run anywhere, and also provide HTML web files

Tevatron Loss Monitor

Crate VxWorks Monitor

Publish/subscribe based monitoring allows implementation of easy to read monitor panels, arrayed around the control room

Rate Monitor and Dynamic Prescaler

And, of course, panic situations will give voice alarms, too

data acquisition control room
Data Acquisition / Control Room

The primary Data Acquisition consoles: RunControl, online monitoring

web based monitoring runsummary
Web Based Monitoring, RunSummary

http://www-cdfonline.fnal.gov/

Follow RunSum and related links

Run summary pages are dynamically produced, with almost every quantity hyper-linked, with many of the links drawing plots of the quantity of interest

& links to error logs and all run settings

Root used for plotting

Publicly accessible!

freeware experience
Freeware Experience
  • Java Experience has been quite positive
    • Easy to build complex programs without headaches of C and C++
    • Extensibility of Java classes has proven invaluable
    • All CDF RunControl and monitoring applications can run anywhere! Not reliant on a CPU nor operating system!
      • 100% availability so far
    • JDK/Linux releases: Sun phasing out v1_4_2 support
    • Downsides, when you really push Java:
      • It’s not really platform independent! Various subtle differences (threads, look & feel)
      • Java Virtual Machine is a complicated creature, with sometimes mysterious and impossible to debug behaviour, crashes
operating systems
Operating Systems
  • Linux experience also positive
    • Linux disk and web servers reliable
    • Very difficult (impossible?) to get our programs to crash the operating system
    • Perhaps Linux can replace Sun for the database system
      • Testing in the offline realm has so far been positive
    • But we miss that VMS system API (!)
  • But still have not made leap to Oracle database on Linux for critical servers…
    • Cannot argue with success – unscheduled database downtime extremely rare
    • Offline replicas on Linux in good shape
commercial software experience
Commercial Software Experience

Commercial Software

Oracle Database

  • Generally impervious to crashes, robust, reliable
  • Fulfills our database and communications needs
  • Oracle provides a nice support forum (but see below)
  • Downsides
    • Money $$ Lots of it
    • Many people fear it
    • Can’t see the source; but you probably wouldn’t want to

SmartSockets (Talarian/TibCo)

  • Remarkably good performance for a centralized TCP communications server
  • Features and support sometimes lacking
  • Downsides
    • Again, Money $$$ The price of a single client license keeps going up
    • Small company, short lifespan
    • In this case, you probably would like to see the source code
    • Crashes on VxWork we cannot debug

But beware false economies!

wish list
Wish List
  • Cross-experimental and cross-lab development of software could be quite beneficial in some common areas:
    • IP message passing (multi-platform, multi-language)
    • Database servers (!)
    • …other software?
  • Virtually every experiment needs such beasts, but often effort is duplicated
  • Avoid expensive licenses, with no source code access
  • Should tailor to HEP requirements, and provide continuing support (everything is always in development!)
  • Paw, Root, and data handling, have been successes in common tools
  • Hearing murmurs … what’s out there?
conclusions
Conclusions
  • CDF is running well, taking data during the Tevatron Run II, 2001 through 2009
  • We have designed and implemented a set of database schemas and associated Java APIs to configure and control the CDF online data acquisition system in real time
  • Through object oriented programming, we have created a powerful and flexible approach to run configurations that is used by all components of the experiment
  • A suite of control and monitoring software, web interfaced, has been developed; shift crew’s job is now easier and more efficient
  • Through replication, web interfaces and offline database hooks, we have an extensible database available to users world-wide
cdf related topics
CDF Related Topics

Backup Slides

resource allocation
Resource Allocation

Real-time Java color-coded display

representing device allocation

  • Multiple RunControls can run simultaneously: Partitions
  • Resource Manager controls ownership of front end crates and other virtual resources
  • Allocation recorded centrally in the Hardware Database
  • Real-time database event notifications keep all clients informed
  • Java monitoring Thread listens to events and updates object images
bandwidth usage maximization

250 4

4 3

3 2

2 1

Bandwidth Usage Maximization

Level 1 Trigger rate plot, triggers per second

  • Dynamic Prescaling
    • As luminosity decreases, trigger rates also decrease
    • To maximize usage of DAQ bandwidth, automatically lower prescales of triggers at Level 1 to increase trigger rate during a data acquisition run, within bounds
    • Use for Level 1 two-track trigger (for B pp), ~85% of Level 1 bandwidth
    • Heavily prescaled at start of run for safety

Red arrows indicate change of prescale values

Run is paused, hardware set, run resumed

L1 Trigger Cross Section plot, trigger counts normalized by luminosity

complete operational efficiency
Complete Operational Efficiency
  • Efficiency factors:
    • Intrinsic system limits: instantaneous deadtime
      • Limited by system throughput performance
      • Adjusted through physics choices via trigger cuts
    • Accelerator beam quality
      • Losses prevent detector operation, trips and tolerances
      • little or no experimental control
    • Operational downtimes
      • Starting and stopping runs
      • Failures of services (e.g. database server)
      • Detector malfunctions
      • Data acquisition and trigger electronics malfunctions
      • Test runs, beam time calibrations
      • Human errors… others
efficiency tabulation to date
Efficiency Tabulation, to date

Downtime occurrences automatically tabulated and linked to shiftcrew’s electronic logbook, each DAQ run and Tevatron store

Browse and group by category, lost time, lost luminosity

over years’ time scale, category assignment proliferate; operational utility on small time scale

...several smaller categories suppressed

efficiency tabulation intrinsic
Efficiency Tabulation, intrinsic

Category totals, previous

Intra-run downtime below

Intrinsic dead time during data acquisition runs

Runs too small to process

Net efficiency

client micromanagement

This window indicates the transition status of clients:

  • Butter yellow: RC has not sent transition
  • Margarine yellow: RC has send transition, waiting for acknowledgment
  • Green Client sent successful acknowledgment
  • Red Client sent error
Client MicroManagement

Each client is monitored continuously for participation in the run and for possible errors

Each client has its own individual control panel; complete resets and recovery are one-touch; all configuration and response information are available here

state management
State Management
  • RunControl maintains synchronization of activities through the StateManager and its flow
  • Basic functionality expressed in the base class StateManager
  • Different run types require differing control flows
  • Specific StateManagers inherit from base class and extend as necessary
  • Configuration messages are also easily extensible according to the needs of individual detectors
  • Avoid duplicating lots of work!

TDC Testing Diagram

Calorimeter Radioactive Source runs

Requires source motion control transitions

There’s only one RunControl at CDF