first operational experience with the cms run control system
Download
Skip this Video
Download Presentation
First operational experience with the CMS Run Control System

Loading in 2 Seconds...

play fullscreen
1 / 31

First operational experience with the CMS Run Control System - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on

First operational experience with the CMS Run Control System. Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group. 17 th IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal. The Compact Muon Solenoid Experiment. Drift-Tube chambers. Iron Yoke. Resistive Plate Chambers. LHC

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' First operational experience with the CMS Run Control System' - judd


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
first operational experience with the cms run control system

First operational experience with the CMS Run Control System

Hannes Sakulin, CERN/PH

on behalf of the CMS DAQ group

17th IEEE Real Time Conference, 24-28 May 2010, Lisbon, Portugal

the compact muon solenoid experiment
The Compact Muon Solenoid Experiment

Drift-Tube chambers

Iron Yoke

Resistive Plate Chambers

LHC

  • p-p collisions, ECM=14 TeV (2010: 7 TeV), heavy ion
  • Bunch crossing frequency 40 MHz

CMS

  • Multi-purpose detector, broad physics programme
  • 55 million readout channels

4 T Superconducting Coil

Cathode Strip Chambers

HadronicCalorimeter

Electromagnetic Calorimeter

  • Trackers
  • Silicon Strip
  • Silcon Pixel
cms trigger and daq design
CMS Trigger and DAQ design
  • First Level Trigger (hardware)
    • up to 100 kHz
  • Central DAQ builds events at 100 kHz, 100 GB/s
    • 2 stages
    • 8 independent event builder / filter slices
  • High level trigger running on filter farm
    • ~700 PCs
    • ~6000 cores
  • In total around 10000 applications to control

FrontendReadout Links

Filter farm

cms control systems
CMS Control Systems

Run Control System

Java, Web Technologies

DCS

Trigger

Tracker

ECAL

DAQ

Trigger Supervisor

XDAQC++

Slice

Slice

data

data

Front-end Electronics

Front-end Drivers, First Level Trigger

Central DAQ

& High Level Trigger Farm

cms control systems1
CMS Control Systems

Detector Control System

Run Control System

Java, Web Technologies

Tracker

ECAL

DCS

Trigger

Tracker

ECAL

DAQ

Trigger Supervisor

XDAQC++

Slice

Slice

PVSS (Siemens ETM)SMI (State Management Interface )

data

data

Low voltageHigh voltage

Gas, Magnet

Front-end Electronics

Front-end Drivers, First Level Trigger

Central DAQ

& High Level Trigger Farm

cms run control system
CMS Run Control System

Run Control World – Java, Web Technologies

Defines the control structure

GUI in a web browser

HTML, CSS, JavaScript, AJAX

Run Control Web ApplicationApache Tomcat Servlet Container

Java Server Pages, Tag Libraries,

Web Services (WSDL, Axis, SOAP)

Function ManagerNode in the Run Control Tree defines a State Machine & parameters

User function managers dynamically loaded into the web application

XDAQ World – C++, XML, SOAP

XDAQ applications control hardware and data flow

XDAQ is the framework of CMS online softwareIt provides Hardware Access, Transport Protocols, Services etc.

data

~10000 applications to control

XDAQ Application

function manager framework
Function Manager Framework

AsynchronousNotifications

Lifecycle

Command

Parameter

State, Errors

Parameters

from/to Parent Function Manager / GUI

Frame-work

code

XX

YY

Custom code

Web Service

Legend

EventProcessor

StateMachineEngine

StateMachineDefinition

ParameterSet

Event Handler

State Machine

Callback

Event Handler

Ev

C. Resource Proxy – PSX

Child Resource Proxy – Run Control

Child Resource Proxy - XDAQ

Child

Resource Proxy

Child

Resource Proxy

Servlet

Lifecycle + Configuration

Web service

Servlet / Web Service

CommandParameter

FunctionManager

Monitor

to / from Child Function Manager

JobControl

to / from DetectorControl System

function manager framework1
Function Manager Framework

AsynchronousNotifications

Lifecycle

Command

Parameter

State, Errors

Parameters

from/to Parent Function Manager / GUI

Frame-work

code

XX

YY

Custom code

Web Service

Legend

Run Info

DB

Conditions

EventProcessor

StateMachineEngine

StateMachineDefinition

ParameterSet

ConfigurationFM + XDAQ

LogCollector

Logs

Event Handler

State Machine

Callback

Resource Service DB

Event Handler

Ev

Monitoring

XDAQMonitoring

& AlarmingSystem

Errors

C. Resource Proxy – PSX

Child Resource Proxy – Run Control

Child Resource Proxy - XDAQ

Child

Resource Proxy

Child

Resource Proxy

DAQ Structure

DB

Servlet

Lifecycle + Configuration

Web service

Servlet / Web Service

CommandParameter

FunctionManager

Monitor

to / from Child Function Manager

JobControl

to / from DetectorControl System

entire daq system structure is configurable
Entire DAQ System Structure is Configurable

High-level tools to generate configurations

ResourceService API

Database

Flow of configuration data

versioning

  • Control structure
    • Function Managers to load (URL)
    • Parameters
    • Child nodes
  • Configuration of XDAQ Executives (XML)
    • libraries to be loaded
    • applications (e.g. builder unit, filter unit) & parameters
    • network connections
    • collaborating applications

SOAP

XML

XML

Job ControlService

cms control tree
CMS Control Tree

GUI (Web browser)

Level-0:

Control and parameterization of Run

Level-0

Level-1: Common state machine andParameters

Trigger

Tracker

DT

RPC

ECAL

DAQ

Level-2:

FEC

FED

TTS

FB

Slice 0

Slice 7

Sub-system specific

Frontend

controller

Frontend

driver

Trigger

ThrottlingSystem

Level-n:

FB

RB

HLT

HighLevelTrigger

Readout

Builder

FED

Builder

Framework and Top-Level Run Controldeveloped by central team

Sub-system Run Control developedby sub-system teams

rcms level 1 state machine simplified
RCMS Level-1 State Machine (simplified)

Creation

Load & start Level-1 Function Managers

Created

Initialization

Start further levels of function managersStart all XDAQ processes on the cluster

Halted

New: Pre-Configuration (trigger only – few seconds)

Sets up the clock and periodic timing signals

Halt

Pre-Configured

Configuration

Load configuration from databaseConfigures hardware and applications

Configured

Stop run

Start run

Running

Pause / Resume

Pauses / resumes the trigger (and trackers which may need to change settings)

Paused

Error

top level run control level 0
Top-Level Run Control (Level-0)
  • Central point of control
    • Global State Machine
  • Level-0 allows to parameterize configuration
    • Sub-system Run Key (e.g. level of zero suppression)
    • First Level Trigger Key / High Level Trigger Key
    • Clock source (LHC / local)
masking of components
Masking of components
  • Level-0 allows to mask out components
    • Remove/add sub-systems from control and readout
    • Remove add detector partitions
    • Remove/add individual Frontend-Drivers (masking)
      • Connection to readout (SLINK)
      • Connection to Trigger Throttling System
    • Mask out DAQ slices ( = 1/8 of central DAQ)
commissioning and first operation
Commissioning and First Operation
  • Independent parallel commissioning of sub-detectors
    • Mini DAQ setups allow for standalone operation
mini daq partitioning
Mini DAQ (“partitioning”)

MiniDAQ Run(heavily used in commissioning phase)

Global Run

  • Dedicated small DAQ setups for most sub-systems
  • Low bandwidth but sufficient for most tests
  • Mini DAQ may be used in parallel to the Global Runs

Level-0

Level-0

LTC

ECAL

MiniDAQ

GlobalTrigger

Tracker

DT

GlobalDAQ

Local Trigger Controller(or Global Trigger)

Slice 0

Slice 7

commissioning and first operation1
Commissioning and First Operation
  • Independent parallel commissioning of sub-detectors
    • Mini DAQ setups allow for standalone operation
  • Run start time
    • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold start)
optimization of run startup time
Optimization of run startup time
  • Globally
    • Optimized the global state model (pre-configuration)
    • Provided tools for parallelization of user code (Parameter handling)
    • Sub-system specific performance improvements
optimization of run startup time1
Optimization of run startup time
  • Globally
    • Optimized the global state model (pre-configuration)
    • Provided tools for parallelization of user code (Parameter handling)
    • Sub-system specific performance improvements
  • Central DAQ
    • Developed tool to analyze log files and plot timelines of all operations
    • Distributed central DAQ control over 5 Apache Tomcat servers (previously 1)
    • Reduced message traffic between Run Control and XDAQ applications
      • combine commands and parameters into single message
    • New startup method for High Level Trigger processes on multi-core machines
      • Initialize and Configure mother process, then fork child processes
      • Reduced memory footprint due to copy-on-write
optimization of run startup time2
Optimization of run startup time
  • Globally
    • Optimized the global state model (pre-configuration)
    • Provided tools for parallelization of user code (Parameter handling)
    • Sub-system specific performance improvements
  • Central DAQ
    • Developed tool to analyze log files and plot timelines of all operations
    • Distributed central DAQ control over 5 Apache Tomcat servers (previously 1)
    • Reduced message traffic between Run Control and XDAQ applications
      • combine commands and parameters into single message
    • New startup method for High Level Trigger processes on multi-core machines
      • Initialize and Configure mother process, then fork child processes
      • Reduced memory footprint due to copy-on-write
run start timing may 2010
Run Start timing (May 2010)
  • Globally 4 ¼ minutes, Central DAQ: 1 ¼ minutes (Initialize, Configure, Start)
  • Configuration time now dominated by frontend configuration (Tracker)
  • Pause/Resume 7x faster than Stop/Start

sub-system

time (seconds)

commissioning and first operation2
Commissioning and First Operation
  • Independent parallel commissioning of sub-detectors
    • Mini DAQ setups allow for standalone operation
  • Run start time
    • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start)
    • Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute
  • Initially some stability issues
    • Problems solved by debugging user code (thread leaks)
commissioning and first operation3
Commissioning and First Operation
  • Independent parallel commissioning of sub-detectors
    • Mini DAQ setups allow for standalone operation
  • Run start time
    • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start)
    • Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute
  • Initially some stability issues
    • Problems solved by debugging user code (thread leaks)
  • Recovery from sub-system faults
    • Control of individual sub-systems from top-level control node
    • Fast masking / unmasking of components (partial re-configuration, only)
commissioning and first operation4
Commissioning and First Operation
  • Independent parallel commissioning of sub-detectors
    • Mini DAQ setups allow for standalone operation
  • Run start time
    • End of 2008: Globally 8.5 minutes, Central DAQ: 5 minutes (Cold Start)
    • Now: Globally < 4 ¼ minutes, Central DAQ: 1 ¼ minute
  • Initially some stability issues
    • Problems solved by debugging user code (thread leaks)
  • Recovery from sub-system faults
    • Control of individual sub-systems from top-level control node
    • Fast masking / unmasking of components (partial re-configuration, only)
  • Operator efficiency
    • Operation is complex
      • Subsystem inter-dependencies when configuring partially
      • Dependencies on internal & external parameters
      • Procedures to follow (Clock change)
    • Operators are no longer DAQ experts but colleagues from the entire collaboration
    • Built-in cross checks to guide the operator
built in cross checks
Built-in cross-checks
  • Built-in cross-checks guide the shifter
    • Indicate sub-systems to re-configure if
      • A parameter is changed in the GUI
      • A sub-system / FED is added/removed
      • External parameters change
    • Enforce correct order of re-configuration
    • Enforce re-configuration of CMS if clock source changed or LHC has been unstable

Improved operator efficiency

operation with the lhc
Operation with the LHC
  • Cosmic run
    • Bring the detector into the desired state (Detector Control system)
    • Start Data Acquisition (Run Control System)
  • LHC
    • Detector state and DAQ state depend on the LHC
    • Want to keep DAQ going before beams are stable to ensure that we are ready

Tracking detector high voltage only

ramped up whenbeams are stable

(detector safety)

Ramp:clock variations

may unlock some links in the trigger

LHC clock stable

LHC dipole current

integration with dcs automatic actions
Integration with DCS & automatic actions

0

Detector Control System

Run Control System

  • In order to keep DAQ going, Run Control needs to be aware of the LHC and detector states
  • Top-level control node is notified about changes and propagates them to the concerned systems (Trigger + Trackers)
    • Trigger masks channels while LHC is ramping
    • Silicon-Strip Tracker masks payload when running with HV off (noise)
    • Silicon-Pixel Tracker reduce gains when running with HV off (high currents)
  • Top-level control node triggers automatic pause/resume when relevant DCS / LHC states change during a run

DCS

PVSS SOAP eXchange

Level-0

LHC

Tracker

ECAL

PSX

DCS

Tracker

DAQ

XDAQservice

automatic actions
Automatic actions

LHC clock stable

LHC dipole current

Automatic actions

Ramp up tracker HV

Ramp down tracker HV

CMS run:

start

stop

ramp startMask

sensitivetrigger

channels

ramp doneUnmask

sensitivetrigger

channels

Tracker HV on

Enable payloadlower thresholds

log HV state in data

Tracker HV off

Disable payloadraise thresholds

log HV state in data

observations
Observations
  • Standardizing the experiment’s software is important for long-term maintenance
    • Almost successful considering the size of the collaboration
    • Run Control Framework was available early in the development of the experiment’s software (2003)
    • Adopted by all sub-systems
    • But some sub-systems built their own framework, underneath
  • Ease-of-use becomes more and more important
    • Run Control / DAQ is now operated by members of the entire CMS collaboration
    • Running with high life-time:> 95 % so far for stable-beam periods in 2010
observations web technology
Observations – Web Technology
  • Operations
    • Typical advantages of a web application: multiple clients, remote login
    • Stability of the server (Apache Tomcat + Run Control Web Application) very good: running for weeks
    • Stability of the GUI depends on third-part products (browser)
      • Behavior changes from one release to the next
      • Not a big problem - GUI can be restarted without affecting the run
  • Development
    • Knowledge of Java and the Run Control Framework sufficient for basic function managers
      • Web-based GUI & web technologies handled by framework
    • Development of complex GUIs such as the top-level control node more difficult
      • Many technologies need to be mastered
      • Modern web toolkits not yet used by Run Control
summary outlook
Summary & Outlook
  • CMS Run Control System is based on Java & Web Technologies
  • Good stability
  • Top-Level Control node optimized for efficiency
    • Flexible operation of individual sub-systems
    • Built-in cross-checks to guide the operator
    • Automatic actionstriggered by detector and LHC state
  • High CMS data-taking efficiency
    • life-time > 95%
  • Next developments
    • Further improve fault tolerance
    • Automatic recovery procedures
    • Auto Pilot

candidate event

ad