1 / 46

CMS Week

DT Electronics Issues and plans. Cristina Fernández Bedoya on behalf of the DT group. CMS Week. December 7th, 2010. 2. C. Fernández Bedoya. December 7 th , 2010. Outline. 1-Failures during 2010 and spares 2-Activities during shutdown 3-DT Upgrade summary. 3. C. Fernández Bedoya.

tkamps
Download Presentation

CMS Week

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DT Electronics Issues and plans Cristina Fernández Bedoya on behalf of the DT group CMS Week December 7th, 2010

  2. 2 C. Fernández Bedoya December 7th, 2010 Outline 1-Failures during 2010 and spares 2-Activities during shutdown 3-DT Upgrade summary

  3. 3 C. Fernández Bedoya December 7th, 2010 DT LV and HV failures during operation LV Changed 1 A3100 and 2 A3050 during winter shutdown. None after that. Other two modules exchanged due to Anderson Power connectors overheating HV In 2010, 12 interventions in UXC to replace A877 boards. 3 interventions in USC to replace A876 boards. • Sep 1st 2010. One A877 YB+2 MB3 S11 • Aug 3rd and 5th 2010. Two A877 YB-1 MB1 S07 (First board changed has a recurrent problem, not seen at CAEN; we decided not to use it anymore) • July 25th 2010. One A877 YB-1 MB4 S05 • May 11th 2010. One A877 YB+1 MB1 S02. • May 1st 2010. Two A877 YB0 MB3 and MB4 S09. • April 8th 2010. One A877. YB+1 MB2 S04. • April 4th-2nd 2010. One A877. YB-1 MB2 S04. • February 15th-16th 2010. One A877 in YB-2 MB4 S07. • February 25th 2010. One A877. YB-1 MB2 S04. • January 12th 2010. One A877. YB-1 MB3 S03. • Aug 2nd 2010. One A876 YB0 S06 • July 14th 2010. One A876 YB+2 S09 • July 13th 2010. One A876 YB+2 S04

  4. 4 C. Fernández Bedoya December 7th, 2010 CAEN DT LV MODULES at P5 Material Barrack LV module Repairs

  5. 5 C. Fernández Bedoya December 7th, 2010 CAEN DT HV MODULES at P5 Material Barrack

  6. 6 C. Fernández Bedoya December 7th, 2010 CAEN DT HV MODULES -The number of failures in A877 has slightly reduced but it is still very high (11 modules sent to CAEN for repair) -The number of spares (15-19) could be adequate, though lots of doubts about the reliability of the spares -Sometimes problems cannot be found by CAEN -Some faults cannot be reproduced even in our test bench -Faulty boards might damage the chambers.. -The testbench in 904 has evolved slowly this year (travelling issues didn´t help) -Cleaning up of 904 took place (thank you very much Franco, Lorenzo) We need a new HV (A877) test bench to reproduce/diagnose the failures observed on the detector.

  7. 7 C. Fernández Bedoya December 7th, 2010 CAEN LV ANDERSON POWER CONNECTORS Module changed Module changed -Problems of overheating in the PP75 Anderson Power connectors of the CAEN LV modules (A3050, A3100, MAO) -Failures in 2009 : 46 (i.e. 3.8/month) -In January 2010 we made a large campaign in all the cables (add soft extension cables, recrimping, spring on housing, Vdrop or temperature monitoring everywhere). It has made the situation better but not solved completely the problem -Failures in 2010: 11 (i.e. 0.8/month) (6 A3050, 1 MAO). Only one of them (YB0S2 MB4) was also problematic last year. -On September 1st 2010, campaign of adding Santovac lubricant in all the positive PP75 in YB0. Not clear improvement, two modules failed again (they had already failed). * As in the past, when moving the connector slightly the temperature drops (though now the channel does not trip) and again all problems found are always in the red (positive) connector *YB0 seems to be worst. First modules installed downstairs and many “movements” due to lack of modules.

  8. 8 C. Fernández Bedoya December 7th, 2010 CAEN LV ANDERSON POWER CONNECTORS A3050 changed: 3051000209002000056 (CAEN 27) : changed on Jun 23 (received Jan 2006) 3051000209002000063 (CAEN 33): changed on Apr 27 (received Jun 2006) A3050 which gave VCon Err: YB-2 MB1S12 (error on Jul 21): 351 20 90 020 00122 (CAEN 124) (received Feb 2008) YB-1 MB3S10 (error on Sep 26): 351 20 90 020 00099 (CAEN 117) (received Nov 2007) YB0 MB2S10 (error on Mar 8): 351 20 90 020 00072 (CAEN 25)  (received Feb 2006) YB0 MB1S11 (error on Aug 24): 351 20 90 020 00075 (CAEN 91) (received Jan 2007) MAO which gave VCon Err: 351 20 90 023 00034: received Feb 2008 Not clear correlation with the production date of the module *Again we have observed once a moderate increase of the MAO contact temperatures with the magnet ramp observed in September *In the 2 cases where the module has been changed, didn´t fail again *The ultimate cause of the failure remains unknown *At least, the improvements made to monitor any problem work very well  *Also, we have observed a few (~4) A3009 small connectors that are at higher temperature (low current, no risk), but worrisome. (Also HCAL has seen this)

  9. 9 C. Fernández Bedoya December 7th, 2010 CAEN LV ANDERSON POWER CONNECTORS PLAN FOR THIS SHUTDOWN: *Measure all the temperatures before power cut (in particular YB0 S10 top A3100 that looks suspicious) *Add lubricant to all of the wheels (does not seem to make it worst, if something can help) *Change the modules that have failed repeatedly and ask CAEN to replace the connector and submit it for analysis *Recrimp A3009 connectors at higher temperature *Keep on monitoring… replacing with bolts does not seem a satisfactory solution for the moment (we may still be seeing problems only in already damaged modules)

  10. 10 C. Fernández Bedoya December 7th, 2010 HV Problems Summary • No New Problems since the end of February but, old YB+1 MB3 S07 came back • The last “real” HV trip (YB+1 MB2S08 SL1) occurred on July 19 during the Magnet ramp down + 2 additional ones in October CHANNELS LOST MUB workshop 27-Sep-10

  11. 11 C. Fernández Bedoya December 7th, 2010 Failures 2010 FrontEnd MB3 S1 W-1 PHI1: Noisy FEB 4 channels lost MB1 S7 W+1 SL3 ALL layers ch 0 to 3 MB2 S1 W-2 SL1 ALL layers ch 53 to 56 2 FEB lost +Minor Testpulse issues YB+2 S5 MB1 SL2 1 SUPERLAYER LOST (sporadic) -Not the first time we have seen this type of problem, in the past: -Usually related with the LV FE connection in the chamber (connector not fully plugged) MB2 S9 W-2 PHI3: 1 SL dead MB2 S12 W+2 PHI3: 1 SL dead

  12. 12 C. Fernández Bedoya December 7th, 2010 Chambers. Spares HVB HVC FEB

  13. 13 C. Fernández Bedoya December 7th, 2010 Minicrate Failures 2010 CCB Link MB2 S1 W+2: primary link receiver amp=0. Secondary OK ROB -ROBUS MB3 S1 W+2: robus ROB 2 MB2 S3 W+1: robus ROB 5 2 ROB lost sporadically MB2 S2 W+2: ROB L1 buffer parity Data OK MB4 S8 W-1: robstate error MB3 S3 W+2: robstate error Only problem in robstate line. All ROB ok MC generic MB4 S4 W+2: Power Ccbid79: Bad contact on CCB power cable Vdd (sporadic) MB1 S11 W+2: pwr RPC MB3 S2 W0: pwr RPC RPC backup comm lost (likely RPC problem) PADC/ALI lost MB4 S7 W+1: no comm. MC-> PADC/ALI

  14. 14 C. Fernández Bedoya December 7th, 2010 Minicrate Failures 2010 Only 3 new BTI errors (low impact) YB+1 S10 MB1 (2 bti errors) YB+1 S10 MB2 (1 bti errors) YB-1 S10 MB4-10(9) (2 bti errors) 11 TRB errors identified with the SEU tool (probably there since long time), not critical (just need to disable SEU test in those BTIs). Higher impact remaining problems: Configuration YB0 S9 MB2 -- 2 bti errors, low efficiency YB0 S10 MB1 -- 8 BTI errors when configuring YB0 S10 MB2 -- 18 BTI errors, low efficiency YB0 S10 MB3 -- 2 bti errors, low efficiency Cables YB0 S11 MB2 TRB2 & TRB3: connection missing. Maybe flex connector. YB+1 S4 MB1 TRB-PHI TRB2 -> TRB3: connection missing. Maybe flex connector. Clock problems YB-1 S1 MB3 -- TRB 0 no clock YB+2 S3 MB3 -- TRB 2 loosing clock sporadically Power problems YB-1 S6 MB1 -- TRB OFF YB-2 S8 MB1 -- TRB 6 sporadically problems powering TRB

  15. 15 C. Fernández Bedoya December 7th, 2010 Chambers and Minicrates Summary Summary of problems in terms of location: -Problems in the system have low impact in the detector performance and tend to be sporadic -Reduced power cycles has improved ROBUS and TRB behaviour -CCBlink problems are also in the past -Slightly increase in the number of FEB failures (2 FEB death this year + 1 SL) -Either case, not problematic to run another year without accessing the detector

  16. 16 C. Fernández Bedoya December 7th, 2010 Minicrates. Spares 146 good BTIM for TRBs reparation (number to be verified)

  17. 17 C. Fernández Bedoya December 7th, 2010 Sector Collector Failures 2010 • In 2009 the two ROS exchanged were due to very similar problems • -GOL problem hopefully improved with new firmware • -CEROS problem not reproduced at lab, related with power distribution in that crate? -1 TIM and 1 Linco problems this year -In general we were lucky with the problems because we had chances to fix them very rapidly -In some cases the reparation was not easy and very time consuming

  18. 18 C. Fernández Bedoya December 7th, 2010 Sector Collector Summary Summary of problems in terms of location:

  19. 19 C. Fernández Bedoya December 7th, 2010 Failures 2010 versus Spares Worst figures (none critical): -A877 and A876 -TSC and ROS (needs to retrieve the ones in reparation) -TRB ? Maybe not anymore? Good recovery of problematic ones

  20. 20 C. Fernández Bedoya December 7th, 2010 DDU. Spares 2 proto DDU can be used as spares (after little work)

  21. 21 C. Fernández Bedoya December 7th, 2010 DCS/DSS. Spares

  22. 22 C. Fernández Bedoya December 7th, 2010 Performance during LHC 2010 • -DT has behaved very well during LHC running • -Clock ramps sensitivity at the beginning of LHC has been solved • -Our downtime has been very low (and not because problems are ignored) we contributed to 1% of the CMS downtime, (DT downtime was less than 0.1% of the total time) Downtime mainly due to the manual Resync commands (1 minute). Automatic resynch enabled by august 27th and since then the downtime is negligible. • More then 30 interventions in the cavern (we may not have that much access in the future!): • Overheating of the CAEN modules LV Anderson power connectors). Rate of failures has decreased during last months and appears focused in some particular modules. • HV modules exchange • Interventions in the SC have been few but painful • - Less than 0.4% of the detector lost. Most of the problems are sporadic. • The failure rate of the electronic modules has been low (also for TRB). • The number of spare boards should be enough to guarantee smooth operations in 2011 and 2012. • To Follow up: HV modules failures, OptoRX monitoring, CAEN LV connectors

  23. 23 C. Fernández Bedoya December 7th, 2010 Activities during this shutdown Many things happening (for a “short” shutdown and detector not opening) • *Centrally: • -Replacing batteries of old vme pcs (Dell PowerEdge 1425SC and 2850) • -Reinstall all pcs in the CMS network (WIPING OUT THE SYSTEM DISC). Should not affect us. • *Replacement of 3 DTTF crates • *New BS firmware • *Study OptoRX JTAG interface problems • *Try Linco with DTTF • *Finish cabling DT Technical Trigger in order to be able to trigger on single chamber for debugging purposes (i.e. study MB4 occupancy?, etc) • *New DDU firmware • *Move from 10 to 5 DDUs • *Test new Linco PCI bridge • *Test new Opto485 board for MC secondary link • *Fix TSC problem in YB-2 S6 • *New ROS firmware • -better monitoring of maximum event size • -Avoid GOL to power off on each configuration • -Implement hardreset for FPGA reloading (configuration will be lost…) • *Plus LV interventions previously mentioned

  24. -Will be done in this shutdown -It will be nice to have the remote firmware tools (manpower needed!!)

  25. 25 C. Fernández Bedoya December 7th, 2010 DTTF • Replacement of 3 DTTF crates with modified power distribution • -Present power regulation in those 3 crates (only) does not work properly and they had to slow it down (as a side effect FPGAs may not load properly at power up) (It is NOT related with our OptoRX problems) • -It should be “straight forward”, meaning: • -uncabling • -removing all PHTF, OptoRX, etc • -putting everything back in place…. CAEN VME PCI boards -High number of failures (optical transmitter?): 5 boards out of 8 in DTTF -Tracker also reported a high number of failures (not so large) -Still waiting an answer from CAEN about the cause -Janos has purchased more spares -We haven´t had problems in the DDU (but we should make sure we have our spare in hands) -Also, CAEN is delivering (soon…) the new CAEN VME PCIexpress board

  26. 26 C. Fernández Bedoya December 7th, 2010 LINCO+OptoRX 5V protection for the Linco VME board in the SC crate *The solution of adding an extender to the connector showed itself as not reliable, so we decided to start the production of a new PCI-VME carrier with active protections on board (commercial carrier was not available any more). *Now we are still in prototype phase and we don't plan to make any intervention during this shutdown. New Linco PCI board *We tested it in the November technical stop but we faced some problems *They have been solved in Padova and will be tested again in this shutdown Test Linco in DTTF crates (Opto RX access) *We haven´t been able to reproduce OptoRX problems at lab *Janos suggested it may be related with the way the CAEN VME controller handles the accesses *We would like to try if we are able to reproduce the same problems using this LINCO controller (could be advantageous for everybody if it goes smooth) *LINCO uses HAL libraries (and is PCIexpress compatible) *Tests at Lab soon and at P5 by the end of January? *DT Trigger Supervisor needs to be modified to use the Linco drivers

  27. RUN 147219 Collisions pp October 5th 2010 Luminosity 44.285 ×1030cm-2s-1 27 C. Fernández Bedoya December 7th, 2010 Move from 10 to 5 DDUs

  28. 28 C. Fernández Bedoya December 7th, 2010 Move from 10 to 5 DDUs -Average event size ~250 bytes (25 MBps), (no significant dependence with luminosity or HI). -With double event size (50 Mbps), we are still well below DDU limit (250 MBps).-If we are to see problems, we want to know the sooner (intervention taking place this winter) -In the meantime, we increase the lifetime of DDU spares -In order to facilitate the recovery from DDU failure, we decided to leave one more DDU in the crate (powered), not plugged to any fiber. -WE NEED TSC zero suppression ENABLED (Trigger Supervisor to do it automatically) * Also New DDU FW with new threshold in the #ROBs blocked to go out-of-sync. Old threshold: 2, 4, 8, 16, 32, 64, 128, 256     New threshold:       3   ==>  very sensible to any kind of problems       9   ==>  one minicrate + one ROB of margin       ~15  ==>  ~two minicrates       >~25  ==> ~one sector      >~75 ==> ~ one quadrant      > 300  ==> never *Test spare DDU crate PS in system crate

  29. 29 C. Fernández Bedoya December 7th, 2010 New 485 BOARD For the secondary (copper) link to the Minicrates (fastest) 485 chain termination & overvoltage protection RS485 38.4Kbps Franco Gonella RS422 link to ADLINK PMC8681 (PMC board already mounted on VME SC crate controller) 230.4Kbits full duplex S1/S7 S2/S8 485 link controllers 1/sector S3/S9 USC Interface controllers S4/S10 S5/S11 A prototype will be tested in the detector during this shutdown. S6/S12 Backup optical link 38.4Kbps for present system compatibility (upgradable to 230.4 full duplex) Power from LV caen module

  30. 30 C. Fernández Bedoya December 7th, 2010 904: facility for present system and upgrades • *DT database in place (thanks to Luca Ciano) • *DAQ (ala P5) will come soon • *Trigger part… a polite reminder • (It may not be priority now, but it is a good investment for the future)

  31. 31 C. Fernández Bedoya December 7th, 2010 CMS PHASE 1 Technical Proposal https://cms-docdb.cern.ch/cgi-bin/DocDB/ShowDocument?docid=2717 And CMS Upgrade week October 25th 2010: http://indico.cern.ch/conferenceDisplay.py?confId=74958 REVIEW OF THE TECHNICAL PROPOSAL BY THE LHCC R&D plans for the new muon trigger electronics look preliminary. DT has been singled out as needing motivation for physics cases (resolution, efficiency...) Dec 15th - Respond to the LHCC questions. January - the upgrade technical proposal will be updated with new studies (SIMULATION) March - a second update of the TP will happen before the March LHCC meeting. This is our last chance to add studies and beef up the Physics case.

  32. 32 C. Fernández Bedoya December 7th, 2010 DT Upgrade Phase 1 Present Proposal for Phase 1 * Build new TRB theta based in FPGA -Gain spares * Move Sector Collector electronics to USC -Simplify future upgrades -Minimize downtime and impact in case of failures * Redesign DTTF system -Get rid present problems of sectors interconnections Obstacles: -Lack of physics motivations (except degraded performance but no simulations to show) -Space for crates in USC close to DTTF -L1A latency -Lack of budget -Lack of manpower

  33. 33 C. Fernández Bedoya December 7th, 2010 DT Upgrade Phase 1: TRB *4 BTI (==1 BTIM) have been satisfactorily integrated into 1 FPGA both Actel A3P3000L-1 and A3PE3000-2 * Timing fully closed with a good margin * They have been tested satisfactorily under radiation * Power scheme identified (radiation test on regulator on going) * First TRB theta prototypes by Q1 2011. * Increased resolution in theta not foreseen (new cabling from Minicrates to balconies) Actel FPGA From F. Montecassiano @ CMS Upgrade http://indico.cern.ch/contributionDisplay.py?sessionId=2&contribId=13&confId=74958

  34. 34 C. Fernández Bedoya December 7th, 2010 DT Upgrade Phase 1: CuOF Present proposal is to make a 1 to 1 channel Cu-OF (Present links are copper based which length cannot be increased without compromising its reliability) Optical fiber 25 @ 240Mbps 25 @ 240 Mbps 32 @ 480Mbps 32 @ 480 Mbps • OF extracted from the back of the SC crate • Power from present PS Copper In the tower racks (substituting present SC) Torino has agreed to take care of this and study possible usage of CERN Versatile Link project CIEMAT will take care of modified ROS (OF to Cu)

  35. 35 C. Fernández Bedoya December 7th, 2010 USC In principle, there is enough space below the false floor in S1 USC to recover extra cable lengths (though it depends on the exact racks to be used). Main problem is to allocate the SC crates in S1: -10 SC crates (11U each) -To minimize L1A latency, they should be close to DTTF racks (S1D01 and S2D02) -In DT racks at present there is only space to allocate 6 SC crates (and not very close to DTTF) (Relocation can be done in batches of half a wheel)

  36. Beam instrumentation terminations BLOWING TECHNIQUE Less fibers, but may require additional rack for the patch panel!! (to be verified) Micro duct cabling at CERN

  37. PLAN A (minimal modifications in TSC boards) Use same SC boards but add OF to CU transducers in the nearby slot

  38. PLAN B (SC+OptoRX+DTTF new unit completely integrated) -With uTCA not all the needed input fibers fit in one board -Not easy to maintain compatibility with present system

  39. PLAN B

  40. 40 C. Fernández Bedoya December 7th, 2010 DT Upgrade Phase 1: USC In any case separation of the Readout and Trigger functionalities is required. This means that we have to rely on DCC to check the correctness of the input signals. Is anyone using DDU data anymore? May be needed with a new DTTF? -The DTTF should go to uTCA (or ATCA). -The proposal to compress 3 sectors (same wedge) in each board looks reasonable since wedge sorter and eta-DTTF could be naturally included in the board -New Barrel Sorter Commercial, slow control DAQ+TTC

  41. 41 C. Fernández Bedoya December 7th, 2010 DT Upgrade Phase 1: USC Conclusion: -Still at discussion stage, not easy to find the best approach given the constrains -Effort in simulation and study of physics cases is missing

  42. BACK UP

  43. TRG/RO. Spares USC October 2009

  44. MC secondary link upgrade for 2012 shutdown Replacement of 485 boards (10) housed in SC crates Improvements in secondary link system done 2 years ago have solved the many RS485 IC ruptures on MC linkboard But: Improvements were realized with many ‘handmade’ patches added to boards The UXC-USC link for half wheel is slow, 38.4Kbps Often 485 boards lost communication with DCS (last week 2 of 10, sometimes more) Recovering requires cycling on/off the SC crate Enhancement of MC communications reliability Integration of all patches on PCB new 485 board Maximization of USC-UXC link speed Boards remain compatible with present hardware Required the modification of part of DCS server software Cost: about 15Keuro. Man power by INFN PD

  45. MC SECONDARY LINK present system after 2008 improvements Primary serial link -> optical fiber MC communication Secondary serial link -> RS485 copper chain Half wheel RS485 board 38.4Kbps Sector 1/7 driver485 Sector 2/8 38.4Kbps driver485 UXC Upper/bottom SC 9U crate Sector 3/9 controller Sector 4/10 38.4Kbps driver485 Sector 5/11 driver485 38.4Kbps Sector 6/12 UXC-USC optical link 38.4Kbps 485 chain termination & overvoltage protection

  46. A Snapshot of DT and RPC (Barrel) Maintenance Work (as of today) • Thanks to Cristina and Gianni for providing the list for the MC maintenance. • Only work requiring access to the detector is entered in the table • Only RPC maintenance that requires moving the chambers is included MUB workshop 27-Sep-10

More Related