1 / 8

DDU Functions

DDU Functions. EMU DDU: not just for data handling Scan data for evidence of SEUs, determine if Reset needed Data format errors a likely indicator of SEU: needs Hard Reset via FMM Monitor front-end data for event synch loss: needs Sync Reset (FMM)

ursala
Download Presentation

DDU Functions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DDU Functions • EMU DDU: not just for data handling • Scan data for evidence of SEUs, determine if Reset needed • Data format errors a likely indicator of SEU: needs Hard Reset via FMM • Monitor front-end data for event synch loss: needs Sync Reset (FMM) • Watch for buffer warning signals, avoid Overflows! • Set FMM Warning as needed, at half-to-3/4 full (many events!) • Beyond ~90% full DDU will set FMM Busy • As buffers get near empty, DDU returns to FMM Ready • Note that Buffer Overflows will lead to other errors • Synch loss, Data corruption, Timeout errors • Diagnose cause and source of errors • Report Reset Request states via VME Interrupt • Provide VME registers for diagnostics and monitoring • Track which CSCs have set which error types • Allows for a discriminated error response in specific cases • One occurrence is just a bad event, no Reset • Several occurrences could indicate SEU, needs Reset • We can apply this for L1A, CRC, DAV, data format errors

  2. DDU Capabilities • Resilient against single bit errors • Bit errors ought to be rare, but an occasional CRC error should never cause a critical problem • Watch out for repetition: can indicate an SEU or hardware problem • Some bit errors can destroy an entire 16-bit word • Fiber data encoded with 8-bit/10-bit protocol • Try to continue operation after this occurs, assumed rare • DDU Firmware adds “filler” words as-needed to the end of CSC data stream to maintain the 64-bit word boundary • Right now sets critical “corrupted data” error, will adjust this • A “stuck” bit can cause critical problems, may indicate SEU • Critical “data corruption” errors require a Reset • e.g. when DDU can not detect the ending of the CSC data stream • Some types of errors may be “single loss” events • Automatic self-recovery, no Reset needed • Such events set “bad event” signal; e.g. bad CRC • Repetition can indicate an SEU or other hardware problem • FMM Errors must be “approved” by the VME IRQ Handler • DDU Error reporting to FMM is disabled until nCSCerrors > nThresh

  3. Recent Error Experience • General error categories • Configuration failures • Caused by communication errors, bad timing or hardware • Causes many error symptoms: Timeout, bad DAV , sync loss, buffer overflow, dead or hot channels, format errors, data corruption • Format errors • Caused by config fail, bad hardware or signal timing/quality • Can cause DDU to misidentify a board header/trailer word • May show as “missing” board in event • May show as bad L1A, CRC or word count • A critical format error can cause data corruption • Hot/dead channels • Caused by config fail, bad hardware or signal timing/quality • Can lead to buffer overflows • DAV-LCT mismatch • Caused by config fail, bad hardware or signal timing/quality • This can cause buffer overflows or timeout errors • Full FIFO @DMB (buffer overflow) • Caused by config fail, bad hardware or signal timing/quality • Overflows can cause Synch loss, Data corruption, or Timeout • CRC errors

  4. Defining an Error at the DDU • Setting the “bad event” signal in the DDU trailer • Any events with non-perfect data checks will get this • Minor bit errors or format problems, SCA Full • Exceptions where data payload may not be affected: • Clean single-bit error in a header/trailer-word marker • Fiber Rx error that may have occurred between events • DCM phase-lock-lost that may occur between events • To add: 64-bit boundary violation (rather than Hard Reset) • Requesting a Sync Reset (via VME IRQ, then FMM) • L1A mismatch detected twice on one CSC • Two different boards in the same event • Separate occurrences in two different events • Either the same board or different boards • Buffer Overflow at DMB or DDU • Requesting a Hard Reset (via VME IRQ, then FMM) • Anything that corrupts the data irreversibly • Anything that “looks” like an SEU…e.g. repeated trivial errors

  5. Limitations & Considerations • We do not know how frequently any particular error may occur • We may need to modify definitions as we get LHC experience • Failure modes can be complex • Obvious error symptoms may be caused by more subtle problems • E.g. we often see a “CFEB problem” which is caused by the corrupted ALCT data that precedes it (bad ALCT headers & 64-bit violations) • We will learn more from LHC experience • We can kill fibers for known, frequent problems • But we don’t want to kill everything! • at some low rate, we must be allowed to request a Reset • We may see spontaneous critical problems that repeat • For these, we may need to automatically set “Ignore Fiber” • This would be temporary, set in real time by DDU logic • Only use in case of a repeated Critical Error from a CSC • Notification of any action is always registered in the data stream • We already send a complete “Live Fiber” list in _every_ event • At next Reset, all “Ignore” settings get cleared to normal state

  6. DDU Error Table I [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.

  7. DDU Error Table II [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.

  8. DDU Error Table III • Notes about the error table • [1] Error bits resulting in RESET REQUIRED persist until the RESET occurs. Questionable cases (in gold) indicate that a reset is only required for mitigation of recurring errors. TBD: sync/hard reset distinctions. • [2] Found inside an event, i.e. between Beginning-Of-Event (=Header1 signature) and End-Of-Event (=combination Trailer1+Trailer2 signatures), at least one of the following: Extra DMB_Header1, Extra DMB_Header2, Lone Word, Extra TMB/ALCT_Trailer, Extra DMB_Trailer1, DMB_Trailer2. • [3] Missing TMB/ALCT_Trailer word, missing DMB Header word, Wrong First word, or Extra Control words.

More Related