d level 3 trigger daq system status n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
DØ Level 3 Trigger/DAQ System Status PowerPoint Presentation
Download Presentation
DØ Level 3 Trigger/DAQ System Status

Loading in 2 Seconds...

play fullscreen
1 / 20

DØ Level 3 Trigger/DAQ System Status - PowerPoint PPT Presentation


  • 114 Views
  • Uploaded on

“More reliable than an airline*”. * Or the GRID. DØ Level 3 Trigger/DAQ System Status. G. Watts (for the DØ L3/DAQ Group). Overview of DØ Trigger/DAQ. Standard HEP Tiered Trigger System. FW + SW. Firmware. Level 1. Level 2. 1.7 MHz. 2 kHz. Commodity. Commodity. Online System. DAQ.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'DØ Level 3 Trigger/DAQ System Status' - cairo-woods


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview of d trigger daq
Overview of DØ Trigger/DAQ

Standard HEP Tiered Trigger System

FW + SW

Firmware

Level 1

Level 2

1.7 MHz

2 kHz

Commodity

Commodity

Online

System

DAQ

L3 Trigger Farm

1 kHz 300MB/sec

100 Hz 30 MB/sec

  • Full Detector Readout After Level 2 Accept
  • Single Node in L3 Farm makes the L3 Trigger Decision
  • Standard Scatter/Gather Architecture
  • Event size is now about 300 KB/event.
  • First full detector readout
    • L1 and L2 use some fast-outs

G. Watts (UW)

overview of performance
Overview Of Performance

System has been fully operational since March 2002.

Tevatron Increases Luminosity

Physicists Get Clever

  • Trigger software written by large collection of non-realtime programmer physicists.
    • CPU time/event has more than tripled.
  • Continuous upgrades since operation started
    • Have added about 10 new crates
    • Started with 90 nodes, now have almost 250, none of them original
    • Single core at start, latest purchase is dual 4-core.
  • No major unplanned outages

# of Multiple Interactions Increase

Trigger List Changes

Increased Noise Hits

More CPU Time Per Event

Rejection Moves to L1/L2

Increased Data Size

Trigger HW Improvements

An Overwhelming Success

G. Watts (UW)

slide4
24/7

Over order of magnitude increase in peak luminosity

Constant pressure: L3 deadtime shows up in this gap!

G. Watts (UW)

basic operation
Basic Operation

Data Flow

Directed, unidirectional flow

Minimize copying of data

Buffered at origin and at destination

Per Event Control Flow

100% TCP/IP

Bundle small messages to decrease network overhead

Compress messages via configured lookup tables

G. Watts (UW)

the daq l3 trigger end points
The DAQ/L3 Trigger End Points

Read Out Crates are VME crates that receive data from the detector.

ROC

ROC

ROC

ROC

ROC

  • Most data is digitized on the detector and sent to the Movable Counting House
    • Detector specific cards in the ROC
  • DAQ HW reads out the cards and makes the data format uniform

Farm Node

Farm Node

Farm Nodes are located about 20m away (electrically isolated)

Farm Node

  • Event is built in the Farm Node
    • There is no event builder
  • Level 3 Trigger Decision is rendered in the node.

Between the two is a very large CISCO switch…

G. Watts (UW)

hardware
Hardware
  • ROC’s contain a Single Board Computer to control the readout.
    • VMIC 7750’s, PIII, 933 MHz
    • 128 MB RAM
    • VME via a PCI Universe II chip
    • Dual 100 Mb ethernet
    • 4 have been upgraded to Gbethernet due to increased data size
  • Farm Nodes: 288 total, 2 and 4 cores per pizza box
    • AMD and Xeon’s of differing classes and speeds
    • Single 100 Mb Ethernet
    • Less than last CHEP!
  • CISCO 6590 switch
    • 16 Gb/s backplane
    • 9 module slots, all full
    • 8 port GB
    • 112 MB shared output buffer per 48 ports

G. Watts (UW)

data flow
Data Flow
  • The Routing Master Coordinates All Data Flow
  • The RM is a SBC installed in a special VME crate interfaced to the DØ Trigger Framework
    • The TFW manages the L1 and L2 triggers
  • The RM receives an event number and trigger bit mask of the L2 triggers.
  • The TFW also tells the ROC’s to send that event’s data to the SBCs, where it is buffered.
    • The data is pushed to the SBC’s

ROC

Farm Node

ROC

Farm Node

ROC

Farm Node

ROC

Routing Master

ROC

L2 Accept

DØ Trigger Framework

G. Watts (UW)

data flow1
Data Flow
  • The RM Assigns a Node
  • RM decides which Farm Node should process the event
    • Uses trigger mask from TFW
    • Uses run configuration information
    • Factors into account how busy a node is. This automatically takes into account the node’s processing ability.
  • 10 decisions are accumulated before being sent out
    • Reduce network traffic.

ROC

Farm Node

ROC

Farm Node

ROC

Farm Node

ROC

Routing Master

ROC

DØ Trigger Framework

G. Watts (UW)

data flow2
Data Flow

ROC

  • The Data Moves
  • The SBC’s send all event fragments to their proper node
  • Once all event fragments have been received, the farm node will notify the RM (if it has room for more events).

Farm Node

ROC

Farm Node

ROC

Farm Node

ROC

Routing Master

ROC

DØ Trigger Framework

G. Watts (UW)

configuration
Configuration

DØ Online Configuration Design

Level 3 Supervisor

Configuration Performance

G. Watts (UW)

the static configuration
The Static Configuration
  • Farm Nodes
    • What read out crates to expect on every event
    • What is the trigger list programming
    • Where to send the accepted events and how to tag them
  • Routing Master
    • What read out creates to expect for a given set of trigger bits
    • What nodes are configured to handle particular trigger bits
  • Front End Crates (SBC’s)
    • List of all nodes that data can be sent to

Much of this configuration information is cached in lookup tables for efficient communication at run time

G. Watts (UW)

the d control system
The DØ Control System

DØ Shifter

“COOR”dinate

(master Run Control)

Configuration Database

Calorimeter

Level 2

Level 3

Online Examines

Standard Hierarchical Layout

Farm Node

Routing Master

  • Intent flows from the shifter down to the lowest levels of the system
  • Errors flow in reverse

Farm Node

Farm Node

Farm Node

Farm Node

SBC

SBC

SBC

SBC

G. Watts (UW)

level 3 supervisor
Level 3 Supervisor

Input

  • COOR sends a state description down to Level 3
    • Trigger List
    • Minimum # of nodes
    • Where to send the output data
    • What ROC’s should be used
  • L3 Component Failures, crashes, etc.
    • Supervisor updates its state and reconfigures is necessary.

Level 3

Current State

Super

Output

  • Commands to sent to each L3 component to change the state of that component.

Desired State

State Configuration

  • Calculates the minimum number of steps to get from the current state to desired state.
  • Complete sequence calculated before first command issued
    • Takes ~9 seconds for this calculation.

Commands to Effect the Change

G. Watts (UW)

general comments
General Comments

“COOR”dinate

(master Run Control)

Boundary Conditions

  • Dead time: beam in the machine and no data flowing
    • Run change contributes to downtime!
    • Current operating efficiency is 90-04%
      • Includes 3-4% trigger deadtime
    • That translates to less than 1-2minutes per day of downtime.
  • Any configuration change means a new run
    • Prescales for luminosity changes, for example.
    • A system fails and needs to be reset
    • Clearly, run transitions have to be very quick!

Level 3

Farm Node

Routing Master

Farm Node

Farm Node

Farm Node

Farm Node

Push Responsibility Down Level

SBC

SBC

Don’t Touch configuration unless it must be changed

SBC

SBC

We didn’t start out this way!

G. Watts (UW)

some timings
Some Timings

Start

Configure – 26 sec

Start Run #nnnnn – 1 sec

Configured

Running

Unconfigure – 4 secs

Stop Run #nnnnn – 1 sec

Pause

1 sec

Resume

1 sec

Paused

G. Watts (UW)

caching
Caching

“COOR”dinate

(master Run Control)

COOR caches the complete state of the system

Single point of failure for the whole system. But it never crashes!

Recovery is about 45 minutes

Can re-issue commands if L3 crashes

Level 3

L3 Supervisor caches the desired state of the system (COOR’s view) and the actual state

COOR is never aware of any difficulties unless L3 can’t take data due to a failure

Some reconfigurations require minor interruption of data flow (1-2 seconds)

Farm Node

Routing Master

Farm Node

Farm Node

Farm Node

Farm nodes cache the trigger filter programming

Farm Node

If a filter process goes bad no one outside the node has to know

Event that caused the problem is saved locally

SBC

SBC

SBC

SBC

Fast Response to problems == less downtime

G. Watts (UW)

upgrades
Upgrades

Farm Nodes

  • We continue to purchase farm nodes at a small increments as old nodes pass their 3-4 year lifetimes.
  • 8 core CPU’s run 8-9 parallel processes filtering event with no obvious degradation in performance.
  • Original plan called for 90 single processor nodes
    • “Much easier to purchase extra nodes than re-write the tracking software from scratch.

SBCs

Finally used up our cache of spares

No capability upgrades required

But we have not been able to make the new model SBC’s operate at the efficiency that is required by the highest occupancy crates.

Other New Ideas

  • We have had lots of varying degrees of craziness.
  • Management very reluctant to make major changes at this point
    • Management has been very smart.

G. Watts (UW)

conclusion
Conclusion
  • This DØ DAQ/L3 Trigger has taken every single physics event for DØ since it started taking data in 2002.
  • 63 VME sources powered by Single Board Computers sending data to 328 off-the-shelf commodity CPUs.
  • Data flow architecture is push, and is crash and glitch resistant.
  • Has survived all the hardware, trigger, and luminosity upgrades smoothly
    • Upgraded farm size from 90 to 328 nodes with no major change in architecture.
  • Evolved control architecture means minimal deadtime incurred during normal operations
  • Primary responsibility is carried out by 3 people (who also work on physics analysis)
    • This works only because Fermi Computing Division takes care of the farm nodes...

G. Watts (UW)