1 / 10

CSC Online Error Monitoring with the DDU

CSC Online Error Monitoring with the DDU. J. Gilmore CSC-DPG #41 July 17, 2008. FMM output port (sTTS). VME FPGA. Input FPGA. Control FPGA. Input FIFOs. SLINK. GbE FIFO. Mezz Board . DDU Overview. Functions Merge data from 15 CSCs

noelle
Download Presentation

CSC Online Error Monitoring with the DDU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC Online ErrorMonitoring with the DDU J. Gilmore CSC-DPG #41 July 17, 2008

  2. FMM output port (sTTS) VME FPGA Input FPGA Control FPGA Input FIFOs SLINK GbE FIFO Mezz Board DDU Overview • Functions • Merge data from 15 CSCs • Perform online data unpacking and status monitoring in real-time (CRC, word count, format quality, BXN, L1A number, buffer status, link status) • Send CSC status to FMM • Large Buffer Capacity • 2.5 MB buffer • Average DDU data volume estimated to be 0.4kB per L1A at LHC (@1034 lumi) • Buffer can hold over 6000 events • Status info accessed via VME 15 Optical Fiber Inputs. Reads a 20-degree slice through an endcap GbE/SPY To Local DAQ

  3. Data Unpacking in the DDU • Scan data for evidence of SEUs, determine if Reset is needed • Data errors are an indicator for SEU • Requires Hard Reset, report it to FMM • Monitor front-end data for event sync loss • Requires Sync Reset, report it to FMM • Watch for buffer warning signals, avoid Overflows! • Set FMM Warning as needed, at half-to-3/4 full (many events!) • Beyond ~90% full DDU will set FMM Busy • As buffers get near empty, DDU returns to FMM Ready • Note that Buffer Overflows will lead to other errors if not Reset • Sync loss, Data corruption, Timeout errors • Diagnose cause and source of problems • Track which CSCs have set which error types • Report “Reset Required” states via VME Interrupt • Tracking for chronic problems in offline log files • Provide VME registers for diagnostics and monitoring • Include status and error information in the DDU Trailer

  4. Reported Error Categories I • Configuration failures • Constants loaded on a board are not correct • Caused by communication errors, bad timing or hardware • Often leads to data errors: Timeout, bad DAV, sync loss, buffer overflow, dead or hot channels, format errors, data corruption • Format error, Consistency error or Not Present • An expected format marker is not detected in the proper position • Can cause DDU to misidentify a board header/trailer word • May show as “missing” board in event • May show as bad L1A, CRC or word count • Caused by config fail, bad hardware or signal timing/quality • Hot/dead channels or Empty/Missing CSC • Caused by HV, config fail, bad hardware or signal timing/quality • Can lead to buffer overflows • Missing CSCs are caused by LV-off or disabled CSCs • DAV-LCT mismatch • A CFEB was triggered but it failed to send data • Caused by config fail, bad hardware or signal timing/quality • Can lead to buffer overflows or Timeout errors

  5. Reported Error Categories II • Full FIFO @DMB (ALCT or CFEB buffer overflow) • Caused by config fail, bad hardware or signal timing/quality • Can cause Sync loss, Data corruption, or Timeout • L1A Number Mismatch Errors • Fundamental sign of sync loss • Caused by problem with hardware or signal timing/quality • Possibly SEU related • CRC error: bit error detected in transmission • Generally a minor concern, affecting only one event • Only serious if it affects multiple Header/Trailer bits • May be an indicator of a deeper problem • CSC electronics have a CRC at every level to detect bit errors • CFEB, ALCT, TMB, DMB and DDU • Overall severity of an error is hard to predict • Cases that appear as “Critical” require a Reset as they usually lead to more errors, but sometimes may be self-correcting

  6. Event Quality Indicators from DDU • The “Single Error” flag in DDU trailer: Do Not Analyze Event • Any events with non-perfect data checks will get this • Minor bit errors or format problems, SCA Full • “Single Warning” if problem might not affect the data payload • Clean single-bit error in a header/trailer-word marker • Fiber receiver/link error that may have occurred between events • DCM phase-lock-loss that may occur between events • The “Critical Error” Sync Lost case: Data Integrity Failure • L1A mismatch detected twice on one CSC • Two different boards in the same event • Separate occurrences in two different events • Buffer Overflow at DMB or DDU • Note: offline analysis might not see the loss in data integrity • At the full point, a buffer still has many “good” events to read out before the compromised data is observed, and sTTS actions can conceal all this • The “Critical Error” Hard Reset case: Unpacker Failure Likely • Anything that corrupts the data irreversibly • Violation of event boundaries, can’t determine end-of-CSC data stream • Anything that “looks” like an SEU…e.g. repeated trivial errors

  7. Summary • The DDU performs online CSC error monitoring in real-time • The monitor status is in the DDU Trailer for every event • The DDU monitoring results are useful for offline data quality checking • Details of DDU monitoring status can be found here: http://www.physics.ohio-state.edu/~cms/ddu/ddu2_pro.html#tr-1

  8. DDU Error Table I [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.

  9. DDU Error Table II [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.

  10. DDU Error Table III • Footnotes for the error table • [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. • [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. • [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.

More Related