1 / 30

DoE Enterprise Performance Monitoring Status Report

This report discusses the implementation of network monitoring tools to troubleshoot and analyze network and data issues within the DOE community. It includes information on the deployment of perfSONAR in the WAN and the need for LAN information for a complete end-to-end solution. The report also outlines deliverables for Year 1 and Year 2, and potential deliverables for Year 3.

sherronj
Download Presentation

DoE Enterprise Performance Monitoring Status Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DoE Enterprise Performance MonitoringStatus Report • Dantong Yu (BNL), Yee-Ting Li (SLAC), Susan Hicks (ORNL) • ASCR Meeting - March 2012

  2. Goal • Promote the use of network monitoring to aid troubleshooting of network and data issues • perfSONAR deployed in WAN, now also need information about the LAN for complete end-to-end solution • Make available such data in an effective, secure and applicable means to the DOE community

  3. Participants • BNL • ORNL • SLAC • Fermilab • ESnet

  4. Year 1: Deliverables • Complete • full perimeter perfSONAR deployment at each site • passive: border snmp statistics • active: owamp, bwctl, ping, traceroute to ESnetPoP and other sites • support on-demand owamp and bwctl at perimeter • coordinate data collection of WAN performance data for e-Center • draft site support document for AA for e-Center • Ongoing: • draft risk assessment for perfSONAR services • draft training document for local staff

  5. Year 2: Deliverables • Ongoing: • leverage and adapt enterprise monitoring tools • configure internal network monitoring & topology service • provide snmp data for internal network devices • implement persistent full mesh active measurements between sites • document site security policies for topology and performance data • finalize user support document for local admins • help develop and feedback into e-Center tools and presentation • implement site local AA with e-Center

  6. (Potential) Deliverables • Year 3 (unfunded): • GridFTP MA • build interactive and integrated web GUI(s) to allow users to navigate, retrieve, visualize, manage and control data presentation • visualization tools to run large amounts of monitoring data into easy-to-understand forms for human visual analytics • turn data into knowledge and intelligence • conduct performance and behavior analysis of network traffic and build profiles of each application and site performance • identify bottlenecks in site-to-site performance that may lead to system-wide failures • traffic analysis to identify highly-utilized network resource and nodes, provide recommendations for improvement

  7. Tying it all together... (1) SLAC BNL ATLAS scientist transfers file from BNL to SLAC - notices that recent transfers are taking longer than they did last week

  8. Tying it all together... (2) SLAC BNL Use GridFTP MA, see all transfer from BNL to SLAC are slow

  9. Tying it all together... (3) ESnet Core ESnet West Coast PoP ESnet East Coast PoP SLAC BNL SRM dCache Door Use Traceroute MA service to determine end-to-end path

  10. Tying it all together... (4) ESnet Core ESnet West Coast PoP ESnet East Coast PoP SLAC BNL GridFTP GridFTP Use SNMP-MA to determine site perimeters looks okay

  11. Tying it all together... (5) ESnet Core ESnet West Coast PoP ESnet East Coast PoP SLAC BNL GridFTP GridFTP Use SNMP-MA to ensure ESnet POPs and cores looks okay

  12. Tying it all together... (6) ESnet West Coast PoP SLAC core GridFTP farmcore border farm02 Use SLAC’s Topology Service to determine Internal network path

  13. Tying it all together... (7) ESnet West Coast PoP SLAC core GridFTP farmcore border farm02 Use SNMP-MA to determined that SLAC farm02 is overloaded

  14. Internal Deployment Challenges • (Cyber) security • identification of classes of data and who should be allowed to view it • different sites use different internal tools to collect performance and topology information • integrate with site’s AA policies and procedures

  15. (Data) Security Policies • Group data policies according to • Internal Information (LAN) • External Information (WAN) • And whether the data should be available to • “any” - user (ie anonymous) • “trusted” - someone with an affiliation with site • “peer” - institution (eg. other lab) • “local” - site local user account only • https://docs.google.com/document/d/18ExHiXjcw9708emE1zh8XGN5NHr1E22zDcK2SWiZQUw/edit?hl=en_US

  16. perfSONAR - Data Availability • I - Internal data • E - External data • A - “all” users • T - "trusted" users • P - “peer” institution • L - “local” users • All external data available to “any” • Except ORNL - SNMP only to “trusted” users and Topology to “peer”Only FNAL allowing internal data to “any” • other sites: only to “local” users

  17. ECenter - Data Availability • The use of E-center would not change the availability policies of both internal and external network data • I - Internal data • E - External data • A - “all” users • T - "trusted" usersP - “peer” institutionL - “local” users

  18. ECenter - Authentication/Authorization • S - supported • R - required • D - desired • No standard AA mechanism available across Labs • only ACLs • Moot as only “local” users would be allowed access

  19. Security Policies • BNL, SLAC and ORNL • cybersecurity: do not share internal data with others - only to “local”users • BNL will provide summary data (netflow, statistics, but not raw)

  20. Enterprise Monitoring • Integrated, consistent and easy to use UI • Dashboards • Graphs • Reports • (Automated) Alerts and Notifications • Emails, paging etc • Typically focused on application/service performance • With perfSONAR, this becomes a presentation and transformation layer service • Data Access Layer will use perfSONAR client API • GUI(s), alerts

  21. Enterprise Monitoring • Network Centric • MRTG • Cacti • NeDi • Netflow • collectd • Host Centric • Nagios • Groundworks • OpenView • Spectrum • Tivoli

  22. GUI’s • build interactive and integrated web GUI(s) to allow users to navigate, retrieve, visualize, manage and control data presentation • visualisation tools to run large amounts of monitoring data into easy-to-understand forms for human visual analytics • Provide useful dashboard of perfSONAR data for US-ATLAS community (our biggest user) • Interactive perfSONAR data presentation

  23. USATLAS Monitoring

  24. IEPM-Sonar

  25. Provide Support Tools • Problem: Current SNMP MA requires RRDs to be local on perfSONAR machine • Most sites already have RRDs of performance data for routers/switches • via MRTG/Cacti/In-house etc. • Solution: Provide socket support to read RRDs via RRDSRV • enables SNMP MA to remote access RRD data • Patch for existing SNMP MA code • Status: Complete; to be integrated into pS-Toolkit

  26. Configure xinetd FILE: /etc/xinetd.d/rrdsrvservice rrdsrv{ disable = no socket_type = stream wait = no user = netmon server = /usr/local/bin/rrdtool server_args = - /srv/snmp/rrd only_from = 172.18.192.0/24 log_on_failure += USERID } FILE: /etc/services rrdsrv 390/tcp # RRD server

  27. Modify SNMP Store XML <nmwg:data xmlns:nmwg="http://ggf.org/ns/nmwg/base/2.0/" id="data1" metadataIdRef="rtr-border1.slac.stanford.edu:in"> <nmwg:key id="key:rtr-border1.slac.stanford.edu:in"> <nmwg:parameters id="param:rtr-border1.slac.stanford.edu:in"> <nmwg:parameter name="type"> rrdsrv </nmwg:parameter> <nmwg:parameter name="valueUnits">Bps</nmwg:parameter> <nmwg:parameter name="file"> rrd://lanmon-store01.slac.stanford.edu:391//srv/tops/rrd/all/rtr-border1.slac.stanford.edu/rtr-border1.slac.stanford.edu_rfc2863_1.rrd </nmwg:parameter> <nmwg:parameter name="dataSource">inOctets</nmwg:parameter> <nmwg:parameter name="eventType"> http://ggf.org/ns/nmwg/characteristic/utilization/2.0 </nmwg:parameter> </nmwg:parameters> </nmwg:key> </nmwg:data>

  28. GridFTP • Problem: Need visibility into a scientist's remote file transfer performance • Instrument DoE experiment transfers • Solution: Develop a File Transfer MA to provide statistical information regarding the history performance records of file transfers • Status: Beta

  29. Components

  30. Future Work • Determine best balance between security policies and desire to aid collaborative monitoring • Apply Filters that conform to site security policy for different users • Integration of Enterprise Monitoring into perfSONAR • SNMP MA • GridFTP • GUI’s • E-Center use and improvement • Custom dashboards • Documentation • Best practice papers for deployment and security analysis

More Related