advanced monitoring techniques for the atlas tdaq network n.
Skip this Video
Loading SlideShow in 5 Seconds..
Advanced Monitoring Techniques for the ATLAS TDAQ Network PowerPoint Presentation
Download Presentation
Advanced Monitoring Techniques for the ATLAS TDAQ Network

Advanced Monitoring Techniques for the ATLAS TDAQ Network

136 Views Download Presentation
Download Presentation

Advanced Monitoring Techniques for the ATLAS TDAQ Network

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Advanced Monitoring Techniques for the ATLAS TDAQ Network Matei Ciobotaru CERN University of California, Irvine “Politehnica” University of Bucharest on behalf of the ATLAS Networking Group: B. Martin, A. Al-Shabibi, S. Batraneanu, S. Stancu, L. Leahu, L. Darlea, M. Ivanovici

  2. The ATLAS TDAQ Network – Role • The ATLAS Trigger and Data Acquisition Network (TDAQ) handles the data transfers from the ATLAS detector to the analysis and storage nodes • Built with Gigabit Ethernet switches and routers • Sustained rates of 150 Gbit/s • The experiment relies on the network to function 24/7 with a minimal number of failures ATLAS detector TDAQ system

  3. The ATLAS TDAQ Network – Photos • Almost 3000 devices and 5000 network connections… • How to make sure everything is working correctly? 2500 computers installed in 90 racks 2 concentrator switches per rack 5 “big” chassis-based devices at the core

  4. Inside this talk • Requirements in terms in network management • Commercial software we are using • Tools we developed in-house • Services for users, integration with ATLAS • Plans for the future • The big picture

  5. ATLAS Requirements • Installation • Ease the equipment registration, inventory and verification • Configure the devices • Operation • Check the state of health of devices and links • Monitor traffic conditions, raise alarms when needed • Assist the user in navigating the realm of information • Integration with the ATLAS TDAQ software • Diagnostics • Provide aids to the admin in case something goes wrong • Be able to suggest solutions to problems • Manage a large local area network which has to be very reliable and which has very high throughput requirements Complexity

  6. Equipment registration • ATLAS equipment needs to be registered in four databases • Only some databases support batch registrations, others require manual intervention  may lead to inconsistencies • Developed a web application to cope with this situation • Central place for querying all the information about a device • Ability to cross-check the data across all databases  detect incomplete/incorrect registrations

  7. Equipment inventory • Network diagrams for ATLAS are made in Microsoft Visio using the NetDesign package • We created tools which discover what really exists in the network (what is connected where) Visio Network Discovery • Developed an application which compares the two data sources (Visio and Auto-discovery)  mismatches are detected and corrected in the field if necessary • For the network documentation – we also generate automatically a printable “report” with all the connectivity

  8. Network configuration (1) • In ATLAS we have more than 200 switches • Different vendors • Different mechanisms for configuration and monitoring (telnet, SNMP, web) • Q: How to access all devices in a transparent manner? • A: Bring them all under a common denominator (common interface) • Q: How to automatize network management tasks? • A: Write scripts (little programs) switches + scripting = sw_script • sw_script = Set of Python modules which can be used as building blocks for network management solutions • Common programming interface to all devices (object-oriented) • “Intelligent” tools for configuration and monitoring can be developed

  9. Interactive session with sw_script # Start the Python interpreter $ python2.5 # Load the sw_script module >>> import sw_script # Create an object associated with the switch (a Cisco device in this case) >>> sw = sw_script.Cisco_Catalyst_6500_Switch(ip_address = “"); # List the ports available on this device >>> sw.get_port_names(); [’1/1’, ’1/2’, ’1/3’, ’1/4’, .... # Get all the information available for an interface >>> sw.get(“1/4"); [(’rx_packets’, 519.0), (’rx_bytes’, 127937.0), (’rx_discards’, 0.0), (’rx_errors’, 0.0), (’tx_packets’, 11199.0),(’tx_bytes’, 1111661.0), (’tx_discards’, 0.0), (’tx_errors’, 0.0), (’description’, ’GigabitEthernet1/4’), (’link_state’, ’up’), (’mac_addr’, [’00:90:27:8F:94:E3’])] # Set the description (ifAlias) of an interface >>> sw.set_interface_alias(“1/4”, “Uplink to Core Router”) # Show the serial number of this device >>> print sw.get_serial_number() FOC0913U075 sw_script is responsible for more than a half of our network management toolbox • Features • Supports devices from different vendors • Network topology auto-discovery • Can do traffic monitoring in real-time • Works as a module, can be easily embedded into other apps

  10. Network configuration (2) • In ATLAS, we have programs which use sw_script to perform configuration changes on devices: • defining VLANs • enabling protocols: spanning tree, time synchronization, etc. • setting interface aliases (descriptions) • We use Python scripts to perform unattended firmware upgrades • For keeping track of configuration files we plan to use ZipTie (open-source software)

  11. Basic monitoring • Spectrum from Computer Associates  software for device health and traffic monitoring (used by the CERN IT department) • Monitors devices, raises alarms in case of failures • Auto-discovery for almost all network connections • Historical info – Gathers statistics from all devices • Throughput and error rates saved every 30 seconds • Limitations • The Spectrum GUIis hard to use • It is not easy to integrate with 3rd party apps • Limited support for network performance monitoring • Basic support for querying historical traffic data • No support for device configuration • Virtually no features for diagnostics Spectrum GUI • We developed software to fill-in the gaps

  12. Navigating in the realm of monitoring data • Spectrum produces 3 plots for each network interface. We shall have 5000 ports and 15000 plots to look at… • We developed tools to browse, query and analyze the traffic plots.

  13. Network browser

  14. Searching and aggregating plots

  15. Scanning for traffic events

  16. Integration with ATLAS software • Network Panel • Shows network monitoring information relevant to an ATLAS data acquisition run • Alarm Watcher • Forwards alarms from Spectrum into the ATLAS “official” messaging channels • IS Feeder • Publish network statistics to the Information Services, a monitoring sub-system in ATLAS The network Panel

  17. Network visualization – 2D approach • Application which shows a topological map of the network • Colors the connections in real-time in function of their state and usage • The overloaded links are detected easily • Good navigation features (zoom, pan) • Based on GUESS, a Java application for visualizing graphs • • We developed a network monitoring plug-in for GUESS

  18. Network visualization – 3D approach (1) • Each object contains a panel with traffic information (updated in real-time) • Containers (racks, rooms) show aggregate values • Technologies used: X3D, Java and the Octaga Player • 3D model of the network • Racks, switches and computers  Furniture in the 3D space • Navigation similar to Google Earth

  19. Network visualization – 3D approach (2)

  20. Real-time traffic monitoring Real-time global top (most active connections) Connections for one switch (with traffic values) The ATLAS applications running now in the network

  21. Diagnostics • For immediate response, we look in Spectrum and in the sw_script web pages • Human inspection of traffic plots (aggregates) – we search for abnormal patterns and correlations between plots • We have a collection of scripts to test different things • Checking that machines are configured properly and connections are ok • For bandwidth-related issues we use iperf • All the network operations are documented in a knowledge base (wiki)

  22. Plans for the future • Better visualization techniques for traffic plots • Analysis tools for monitoring data. Pattern detection and recognition (periodic events, monotonic variations, etc.) • Add support for sFlow, the standard for statistical sampling – very useful to diagnose network congestion • Design and implement an expert system which will help us troubleshoot network issues

  23. The big picture Browse, search and aggregate 2D and 3D network visualization Dynamic web-pages Historical traffic data Real-time traffic info Spectrum sw_script & co. Device health monitoring ATLAS software – network status and alarms Equipment auto-discovery, inventory and registration Equipment configuration Commercial package In-house development