1 / 42

C-RORC PRR

C-RORC PRR. ALICE / ATLAS ROS team. Agenda. Introduction ALICE, by H. Engel ATLAS Concluding remarks. Introduction. C-RORC: hardware design of ALICE Types of firmware

elsa
Download Presentation

C-RORC PRR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C-RORC PRR ALICE / ATLAS ROS team C-RORC PRR

  2. Agenda • Introduction • ALICE, by H. Engel • ATLAS • Concluding remarks C-RORC PRR

  3. Introduction • C-RORC: hardware design of ALICE • Types of firmware • Test firmware used during production, mainly developed by ALICE, test procedures discussed between ALICE and ATLAS (loopback connector used for tests developed by ATLAS) • ALICE specific • ATLAS specific • RobinNP: C-RORC to become the Gen-III ROS ROBIN • “Dozolar”: data source for 12 S-links, for testing S-link inputs of RobinNP • RoIBuilder: if C-RORC replaces VME based RoI Builder specific firmware may be needed, not excluded that RobinNP firmware can be used C-RORC PRR

  4. C-RORC • C-RORC picture with some explanation Could be removed to improve air circulation (will be discussed later) C-RORC PRR

  5. This review • Production Readiness of C-RORC hardware • First prototypes produced by Cerntech(Hungary) • PCBs from Exception PCB, UK • After tendering production contract was awarded to Hapro (Norway) • PCBs from Suntak, China • 20 pre-production cards under test since mid February C-RORC PRR

  6. Hapro and Cerntech C-RORC • PCB build different, copper balancing on Cerntech board (better spread of heat during manufacturing of board) • Cooler + fan different, Hapro board within PCIe height limit • HaproFPGA: commercial grade (0 – 85 0C), CerntechFPGA: industrial grade (-40 – 100 0C) Cerntech Hapro C-RORC PRR

  7. Pre-Series test at contractor’s siteand tests by ALICE Described in presentation by H. Engel C-RORC PRR

  8. Tests performed by ATLAS • Visual inspection of the pre-series cards • Already mentioned by H. Engel: on one card 3 LEDs only soldered on one side • Fixed by CERN SMD Workshop • Card at Nikhef: some VIAs filled with solder Hapro Cerntech Hapro Cerntech C-RORC PRR

  9. Tests performed by ATLAS I • With RobinNP firmware: • robinnpbistprogram: • Checks register contents • Measures FPGA temperature • Sets clock frequencies for S-Links • On-board memory tests • DMA speed tests • Interrupt tests, including performance benchmarking • Tests of speed and data integrity for page handling and transfers into buffer memory • Temperature measurements, readout via PCIe or via JTAG using Chipscope C-RORC PRR

  10. Tests performed by ATLAS II • Standard data taking environment using ReadoutApplication: • “Indexing” incoming data and managing buffer memory pages • Receiving requests via network from the ROSTesterprogram • Forwarding requests for data to RobinNP • Sending data via network to the ROSTesterprogram • Data generated by internal test generator or by DOLARs or MDT RODs. • For short fragments (50 words) stable running has been seen over periods of 11 hours (limited by ROSTester) • Fragments larger than ~180 words cause a lockup of the firmware for a request fraction of 100% after a short time (10 – 50 s). A logic error in the internal arbitration in the FPGA for access to shared resources is causing this. There is no obvious dependence on features of the C-RORC hardware. A fix for the lockups has been found, consisting of minor (but clearly significant) changes to a couple of state transitions in the Memory Controller's Finite State Machines. The memory is being operated at 303 MHz DDR, it is likely that with more work this can be scaled up C-RORC PRR

  11. Test setups Nikhef: Intel dual CPU server, 2 C-RORCs, 2 dual-port 10 GbE NICs, 1 40 GbE NIC RHUL, single CPU server, 2-C-RORCs, 2 dual port 10 GbE NICs • CERN: • 2 C-RORCs used as Dozolar • 2 GEN-III candidate PCs with 2 RobinNPs each • 1 PC with 3 DOLAR cards C-RORC PRR

  12. Observations • Current for a few cards ~10% higher than for the other cards, but cards do function normally • Boards from Cerntech seem to be less sensitive to air flow • With good air flow and functioning fan temperature of FPGA not a problem (< ~65 0C) C-RORC PRR

  13. C-RORC FPGA Core Temperatures Accuracy FPGA temperature sensor: ± 4 0C RHUL: 2xDDR3@606 Single Rank, 100 MHz oscillator Measurements at RHUL for system without lid Nikhef: 100 MHz oscillator 1 subROB configuration 2 subROBs: ~ +5 0C C-RORC PRR

  14. Infrared photos Hapro C-RORC in machine with Supermicro MB at Nikhef, 4 U high machine with lid open ALICE test firmware RobinNP firmware FPGA sensor: ~ 70 0C FPGA sensor: ~ 64 0C C-RORC PRR

  15. Temperature • High data rates: no significant change of temperature of FPGA • No relation with presence or absence of QFSPs • Reporting and monitoring of fan failure & over temperature: via Ichinga (Nagios), automatic flushing of FPGA configuration to reduce power dissipation. • To be implemented • Discuss common solution with ALICE C-RORC PRR

  16. Identification of cards • DNA id of FPGA: unique number • Hapro serial number printed on PCB of card • ATLAS number • No registration of all QSFPs / memory modules, but ROS team will keep a record of malfunctioning devices C-RORC PRR

  17. Number of cards to be produced • Total, including pre-production: 210 for ATLAS, 170 for ALICE • ATLAS: • Sub-detectors have been asked if they would like to purchase C-RORCs for test setups, deadline for requests: 15 April. Two requests received so far • ATLAS with 210 C-RORCs: about 10% spares + ~10 cards for validation system at CERN and test systems at developer labs • Need for a (small) additional batch of C-RORCs, to be discussed • Plan to have complete Gen-III ROS PCs availableas spares (at least 4, depends on plans with pre-series) C-RORC PRR

  18. Testing upon arrival • Repeat Hapro test on small sample • Subset of Hapro test for all cards (no loopback, no FMC) • robinnpbisttesting with RobinNP firmware • Run a test partition with Dolars or Dozolar sending test data to C-RORC under test and ReadoutApplication and ROSTester programs • After installation of Gen-III ROS PCs run again with test partition and verify that loading new firmware is OK C-RORC PRR

  19. Deployment environment • USA-15 • 2 U high server PCs • 2 C-RORCs + 2 dual port 10 GbE NICs per PC • Purchase contract for PCs not yet awarded, tendering closed, two candidate PCs under test in bdg. 4 C-RORC PRR

  20. S-Link tolerance test • QSFP related: • Set up a ROL between a DOLAR and a RobinNP and measure with a variable attenuator at what attenuation the link starts to fail • LC-MPO fan out will be tested at the same time C-RORC PRR

  21. Schedule slippages • There have been some significant slippages in the schedule. In particular: • Delivery of the Pre-Series C-RORC cards was delayed, initially by a change of FPGA fan (to meet the PCIe thickness spec) and then more significantly by changes in the PCB build requested by the company (NB: without the efforts of Tivadar Kiss these probably could not have been solved). • The RobinNP firmware has taken longer to produce than expected and although it now all exists, there are still issues remaining and a fix has been found for the issue in the buffer handling for full-size fragments from multiple channels, optimization and further checking of the firmware is needed. • Procurement of the GEN-III ROS PCs has been delayed – mainly in getting the tender launched - so that tests of the candidate PCs are only just starting C-RORC PRR

  22. Effect on testing of schedule slippages • Thus not yet able to start a long-duration stability test using pre-series cards in the final configuration • But there is a growing body of evidence from tests by ALICE and ourselves in CERN, at RHUL and NIKHEF that theC-RORC H/W works reliably • Thus we no longer plan to run a long-duration (6-week) stability test prior to the main C-RORC production - the risk by not running the test is small and outweighed by the consequence of the extra delay it would cause C-RORC PRR

  23. Support • 5 years warranty by Hapro • Test setup at CERN for first diagnosis, remote access by experts possible • Test setups at RHUL and Nikhef for further investigations C-RORC PRR

  24. Installation schedule • Boundary conditions: • The ROS system has to be stable and tested by 1 February 2015 • In case of a major problem with the GEN-III re-installing and re-testing the GEN-II H/W takes ~6 weeks C-RORC PRR

  25. Concluding remarks • RobinNP firmware not yet finalized, but to the best of our knowledge there are no hardware related issues • ALICE is happy with starting the production • If we do not start production now the deployment of the Gen III ROS for 2015 is not likely to be possible C-RORC PRR

  26. Backup C-RORC PRR

  27. C-RORC PRR

  28. C-RORC PRR

  29. Cerntech Cerntech Hapro Hapro C-RORC PRR

  30. Test machine at Nikhef: Intel server with S2600CP motherboard C-RORC PRR

  31. Test setup at Nikhef Intel server with 2 C-RORCs VME crate with 12 MRODs and SBC Rack with Gen-I and Gen-II ROS PCs with Dolars and 10 GbE NICs and with E5-1620 based machine with 40 GbE dual-port NIC and 10 GbE NICs C-RORC PRR

  32. Test machine at RHUL with SupermicroX9SRL-F board C-RORC PRR

  33. Machine with SuperMicro MB at CERN Fan may not be optimally positioned for max. air flow over PCIe cards Picture from Supermicro web site, machine at CERN has 1 CPU C-RORC PRR

  34. Test setup configuration (Nikhef) 1 word = 4 Bytes Gen II ROS PC running ROSTester ROS PC running ReadoutApplication 2 x E5-2690 CPU (only 1 CPU used) SLC6 64-bit Gen-I ROS PC Intel 2-port 10 Gb/s NIC PC with E5-1620 CPU DOLAR Cerntech C-RORC Intel 2-port 10 Gb/s NIC Intel 2-port 10 Gb/s NIC DOLAR Hapro C-RORC Intel 2-port 10 Gb/s NIC DOLAR Gen II ROS PC running ROSTester 12 S-links 1 subROB Intel 2-port 10 Gb/s NIC C-RORC PRR 34

  35. Test with fix for lock up 10% readout fraction, 12 x 150 word fragments 4 10**9 events generated: ROSTesterstops C-RORC PRR

  36. Test with fix for lock up 55% readout fraction, 12 x 350 word fragments C-RORC PRR

  37. Test with fix for lock up 55% readout fraction, 12 x 250 word fragments C-RORC PRR

  38. Test with fix for lock up 55% readout fraction, 12 x 200 word fragments C-RORC PRR

  39. Test with fix for lock up 4 10**9 events 45% readout fraction, 12 x 200 word fragments C-RORC PRR

  40. Test with fix for lock up 40% readout fraction, 12 x 200 word fragments C-RORC PRR

  41. Test with fix for lock up 70%*) readout fraction, 12 x 200 word fragments *) 1 ROSTester requesting 100% of fragments, other ROSTester requesting 40% of fragments Slide corrected on 15 April C-RORC PRR

  42. FPGA temperature for test of previousslide C-RORC PRR

More Related