1 / 53

Grid services monitoring EMI IPv6 testbed

Grid services monitoring EMI IPv6 testbed. Dusan Klinec Supervisor: Andrew Elwell IT-GT-SL. Outline. Brief introduction to IPv6 enabled server code Nagios dual stack service monitoring Nagios probes extension to support dual stack check EMI IPv6 Testbed. IPv6 enabled server.

eenriquez
Download Presentation

Grid services monitoring EMI IPv6 testbed

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid services monitoringEMI IPv6 testbed Dusan Klinec Supervisor: Andrew Elwell IT-GT-SL

  2. Outline • Brief introduction to IPv6 enabled server code • Nagios dual stack service monitoring • Nagios probes extension to support dual stack check • EMI IPv6 Testbed

  3. IPv6 enabled server

  4. Dual stack Server binds both 0.0.0.0, :: addresses 05/09/2012 4

  5. socket(PF_INET6, SOCK_STREAM, IPPROTO_TCP) = 4 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 fcntl(4, F_GETFL) = 0x2 (flags O_RDWR) fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0 bind(4, {sa_family=AF_INET6, sin6_port=htons(12366), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 Listen(4, 100) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0 setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 bind(5, {sa_family=AF_INET, sin_port=htons(12366), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 listen(5, 100) = 0 select(6, [4 5], [4 5], [4 5], {30, 0}) = 0 (Timeout) select(6, [4 5], [4 5], [4 5], {30, 0}) = 1 (in [5], left {12, 929341}) fcntl(5, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 accept(5, {sa_family=AF_INET, sin_port=htons(34736), sin_addr=inet_addr("127.0.0.1")}, [16]) = 6 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0 setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0 recvfrom(6, "POST / HTTP/1.1\r\nHost: 127.0.0.1"..., 65536, 0, NULL, NULL) = 606

  6. Dual stack Server binds :: address, using IPv4 mapped addresses (RFC 4038) ...but: IPv4 mapped addresses – not supported by OpenBSD, Windows {2000, XP, 2003} 05/09/2012 6

  7. Dual stack Server binds only 0.0.0.0 => Service is unavailable on IPv6 socket Problem if: • Host has AAAA DNS record • Client prefers IPv6 to IPv4 • In case of fail client doesn't try IPv4 OK if: Clients are aware of this (MySQL) => no try for IPv6 connection 05/09/2012 7

  8. Getaddrinfo() • IP protocol version agnostic • Used for DNS queries • Used for acquiring addresses for bind() • Returns linked list • The sorting function used within getaddrinfo() is defined in RFC 3484 • The order can be tweaked for a particular system by editing /etc/gai.conf 05/09/2012 8

  9. Real world example 1 [root@gtv6-emi14 ~]# cat /etc/redhat-release Scientific Linux release 6.3 (Carbon) [root@gtv6-emi14 ~]# uname -a Linux gtv6-emi14 2.6.32-131.2.1.el6.x86_64 #1 SMP Thu Jun 2 09:49:26 \ CDT 2011 x86_64 x86_64 x86_64 GNU/Linux [root@gtv6-emi14 ~]# ./getaddrinfo.app #00: IPv4 address: 0.0.0.0 (-) #01: IPv6 address: :: (-) [root@gtv6-emi14 ~]# netstat -tunap | grep srm Tcp 0 0 0.0.0.0:8446 0.0.0.0:* LISTEN 1092/srmv2.2

  10. Consequence Client is unable to connect via IPv6 SYN: 14:02:57.041337 IP6 2001:1458:301:a873::100:140.53729 > 2001:1458:301:a87c::100:1cd.8446: S 117254423:117254423(0) win 5760 <mss 1440,sackOK,timestamp 2159405331 0,nop,wscale 7>RESET: 14:02:57.041923 IP6 2001:1458:301:a87c::100:1cd.8446 > 2001:1458:301:a873::100:140.53729: R 0:0(0) ack 117254424 win 0

  11. Consequence detail

  12. Quick hack [root@gtv6-emi14 ~]# cat /etc/gai.conf label ::/0 0 label 0.0.0.0/0 1 precedence ::/0 40 precedence 0.0.0.0/0 10 [root@gtv6-emi14 ~]# ./getaddrinfo.app #00: IPv6 address: :: (-) #01: IPv4 address: 0.0.0.0 (-) [root@gtv6-emi14 ~]# netstat -tunap | grep srm Tcp 0 0 :::8446 :::* LISTEN 4215/srmv2.2

  13. Nagios dual stack monitoring

  14. Nagios checks dual stack Before using NCG patch After using NCG patch 05/09/2012 14

  15. Nagios configuration 05/09/2012 15

  16. NCG Hash.pm • Metric configuration change to declare dual stack check support: $WLCG_SERVICE->{'org.sam.SRM-All'}->{native} = "Nagios"; $WLCG_SERVICE->{'org.sam.SRM-All'}->{config} = {%{$SERVICE_TEMPL->{60}}}; $WLCG_SERVICE->{'org.sam.SRM-All'}->{probe} = 'org.sam/SRM-probe'; $WLCG_SERVICE->{'org.sam.SRM-All'}->{metricset} = "org.sam.SRM"; $WLCG_SERVICE->{'org.sam.SRM-All'}->{dependency}->{"hr.srce.SRM2-CertLifetime"} = 1; $WLCG_SERVICE->{'org.sam.SRM-All'}->{dependency}->{"hr.srce.GridProxy-Valid"} = 0; # line declaring that this service supports --4 and --6 switches $WLCG_SERVICE->{'org.sam.SRM-All'}->{flags}->{'DEFAULTDUALSTACK'} = 1; AAAA DNS record is required For host to use dual stack checks 05/09/2012 16

  17. Probe extension to support dual stack tests

  18. Problem We want probes to support --4, --6 switches • If no switch from {--4,--6} is provided: -> default behavior (let system resolver decide) • If --4 is provided: -> probe MUST use SOME IPv4 address from DNS response for testing service • If --6 is provided: -> probe MUST use SOME IPv6 address from DNS response for testing service

  19. Probes Some probes uses clients with support for particular IP stack test => no need to hack in order to test particular stack. hr.srce.check_nmap_tcp uses nmap (network scanner) [root@gtv6-emi02 ~]# nmap gtv6-emi14.cern.ch -p 8446 -v Starting Nmap 4.11 ( http://www.insecure.org/nmap/ ) at 2012-09-04 14:53 Initiating ARP Ping Scan against 128.142.136.156 [1 port] at 14:53 Initiating SYN Stealth Scan against gtv6-emi14.cern.ch (128.142.136.156) Interesting ports on gtv6-emi14.cern.ch (128.142.136.156): PORT STATE SERVICE 8446/tcp open unknown MAC Address: 00:15:5D:FF:53:79 (Microsoft)

  20. Probes Some probes uses clients with support for particular IP stack test => no need to hack in order to test particular stack. hr.srce.check_nmap_tcp uses nmap (network scanner) [root@gtv6-emi02 ~]# nmap gtv6-emi14.cern.ch -p 8446 -v -6 Starting Nmap 4.11 ( http://www.insecure.org/nmap/ ) at 2012-09-04 14:55 Machine 2001:1458:301:a87c::100:1cd is actually LISTENING on probe port 80 Initiating Connect() Scan against gtv6-emi14.cern.ch \ (2001:1458:301:a87c::100:1cd) [1 port] at 14:55 Discovered open port 8446/tcp on 2001:1458:301:a87c::100:1cd Interesting ports on gtv6-emi14.cern.ch (2001:1458:301:a87c::100:1cd): PORT STATE SERVICE 8446/tcp open unknown

  21. Wrapping Wrapping does not work with many probes (BDII lookup): 2012-09-04T12:44:35Z Querying BDII ldap://emiipv6bdiit.cern.ch:2170 2012-09-04T12:44:35Z No information for [base: o=grid; filter: (|(&(GlueChunkKey=GlueSEUniqueID=128.142.136.156) (|(GlueSAAccessControlBaseRule=dteam) (GlueSAAccessControlBaseRule=VO:dteam))) (&(GlueChunkKey=GlueSEUniqueID=128.142.136.156) (|(GlueVOInfoAccessControlBaseRule=dteam) (GlueVOInfoAccessControlBaseRule=VO:dteam))) (&(GlueServiceUniqueID=*://128.142.136.156*) (GlueServiceVersion=2.*) (GlueServiceType=srm*))); attribute(s): ['GlueServiceEndpoint', 'GlueSAPath', 'GlueVOInfoPath']] in [ldap://emiipv6bdiit.cern.ch:2170 [128.142.140.128]]. CRITICAL: METRIC FAILED [org.sam.SRM-GetSURLs]: CRITICAL: No information for [attribute(s): ['GlueServiceEndpoint', 'GlueSAPath', 'GlueVOInfoPath']] in [ldap://emiipv6bdiit.cern.ch:2170 [128.142.140.128]]. 05/09/2012 21

  22. Resolver

  23. Resolver

  24. Resolver

  25. Resolver

  26. Resolver

  27. Resolver

  28. Resolver

  29. Resolver

  30. Using resolver With --4 switch 2012-09-04T11:50:55Z Querying BDII ldap://gtv6-emi03.cern.ch:2170 2012-09-04T11:50:55Z GlueServiceEndpoint: httpg://gtv6-emi14.cern.ch:8446/srm/managerv2 Resolving gtv6-emi14.cern.ch to 128.142.136.156 SRM endpoint(s) to test: srm://128.142.136.156:8446/srm/managerv2?SFN= /dpm/cern.ch/home/dteam With –-6 switch 2012-09-04T12:06:17Z Querying BDII ldap://gtv6-emi03.cern.ch:2170 2012-09-04T12:06:17Z GlueServiceEndpoint: httpg://gtv6-emi14.cern.ch:8446/srm/managerv2 GlueVOInfoPath: /dpm/cern.ch/home/dteam Resolving gtv6-emi14.cern.ch to [2001:1458:301:a87c::100:1cd] SRM endpoint(s) to test: srm://[2001:1458:301:a87c::100:1cd]:8446/srm/managerv2?SFN= /dpm/cern.ch/home/dteam 05/09/2012 30

  31. Gridmon python probes Framework for Nagios probes • Central metric invocation => suitable for extension Resolver object: resolver.setRecord('cern.ch' , '127.0.0.1' ) resolver.resolve('cern.ch' ) # will return 127.0.0.1 resolver.unsetRecord('cern.ch' ) resolver.resolve('cern.ch' ) # will return cern.ch Methods in base class: def setResolver (self , resolver): """Set another resolver to probe""" def resolveHost(self, host): """Resolve host with internal resolver, if none - use identity""" Usage in metric (base class takes care about resolving): endpoint2=endpoint.replace(self.hostName, \ self.resolveHost(self.hostName))

  32. Gridmon perl probes Library for Nagios probes • Lack of central metric invocation, only helper classes Usage: use GridMon::DualStackUtils qw( &getResolver ); use GridMon::DualStackResolver; use Socket qw( AF_INET AF_INET6 ); my $resolver = getResolver( $plugin->opts->hostname, Socket::AF_INET); $ENV{ DPNS_HOST} = $resolver->resolve( $plugin->opts->hostname);

  33. Want to know more? See documentation: https://tomtools.cern.ch/jira/browse/GTSL-32https://tomtools.cern.ch/jira/browse/GTSL-33

  34. EMI IPv6 testbed

  35. Testbed Site name: cert-tb6-cern VO: emiipv6

  36. Testbed

  37. Java IPv6 compliance “Using IPv6 in Java is easy; it is transparent and automatic. Unlike in many other languages, no porting is necessary. In fact, there is no need to even recompile the source files.” [ http://docs.oracle.com/javase/1.5.0/docs/guide/net/ipv6_guide/index.html ] => java based services should work with high probability

  38. Java based services

  39. Does it really work?

  40. Service testing • All daemons are running, according to Systemadministrator guide or service reference card. • No critical errors or important warnings were foundin log files • SAM Nagios for all services -> no critical problem reported • Tested by user interface (UI) client applications (if applicable -> service has all needed dependencies to run) • no error found, everything works • test of high-level services (FTS, CE) documentedin testing protocol ({packet trace, log files, system call trace} available)

  41. FTS test coverage

  42. CE Test coverage

  43. Testbed summary • All deployed services are running • Majority of services was tested by client programs • Services are being tested by Nagios probes • Grid testbed is usable - submit job, transmission job, dpm, voms, etc... (no problem was found) But it is still not perfect...

  44. IPv4 only services

  45. Strange binding svcs ARGUS: tcp 0 0 ::ffff:128.142.18.55:8150 :::* LISTEN 55256/java tcp 0 0 ::ffff:128.142.18.55:8152 :::* LISTEN 61197/java tcp 0 0 ::ffff:128.142.18.55:8154 :::* LISTEN 61102/java Cause: YAIM configuration uses host FQDN to specify socket to bind -> by DNS resolved to public IP address [root@gtv6-emi13 ~]# telnet 127.0.0.1 8150 Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused [root@gtv6-emi13 ~]# telnet 128.142.18.55 8150 Trying 128.142.18.55... Connected to 128.142.18.55. Escape character is '^]'.

  46. Tools for testing

  47. Manipulated DNS server Hostname: emiipv6dns.cern.ch Recursive DNS server + ... Uses Response Policy Zone (RPZ) mechanism in order to answer on DNS queries from foreign zone with definedanswer – ability to tamper DNS responses. [root@emiipv6dns dklinec]# cat /var/named/rpz $TTL 60 @ IN SOA localhost. root.localhost. ( 100 ; serial 10m ; refresh 10m ; retry 10m ; expiry 10m) ; minimum IN NS localhost. ; just testing record non-existing-domain.com CNAME www.cern.ch. emi-ipv6-ce.cern.ch IN A 137.138.163.53 ; hide IPv6 record ;emi-ipv6-ce.cern.ch IN AAAA 2001:1458:201:b30a:215:5dff:feff:449b

  48. Getaddrinfo() res. order Returns result of getaddrinfo() suitable for binding Using PF_UNSPEC, AI_PASSIVE Usage: [root@gtv6-emi14 ~]# ./getaddrinfo.app #00: IPv6 address: :: (-) #01: IPv4 address: 0.0.0.0 (-)

  49. Port binding check Helps to reveal IPv4-Only services, not properly configured services, firewall configuration problems ################################################################################ #Netstat analysis host: gtv6-emi13.cern.ch ################################################################################ All listening services: tcp 2170 (8076/slapd) |W: 0.0.0.0 1 :: 0 tcp 8150 (55256/java) |W: 0.0.0.0 0 :: 1 tcp 8152 (61197/java) |W: 0.0.0.0 0 :: 1 tcp 8154 (61102/java) |W: 0.0.0.0 0 :: 1 IPv6 Only services: tcp 8150 (55256/java) |W: 0.0.0.0 0 :: 1 tcp 8152 (61197/java) |W: 0.0.0.0 0 :: 1 tcp 8154 (61102/java) |W: 0.0.0.0 0 :: 1 IPv4 Only services!!! : tcp 2170 (8076/slapd) |W: 0.0.0.0 1 :: 0 Results: L4 Protocol: tcp ! Problem with tcp 8150 (55256/java) on :: IPv4: 0 IPv6: 1; Closed port on 2001:1458:301:a868::100:2a ! Problem with tcp 8152 (61197/java) on :: IPv4: 0 IPv6: 1; Closed port on 2001:1458:301:a868::100:2a ! Problem with tcp 8154 (61102/java) on :: IPv4: 0 IPv6: 1; Closed port on 2001:1458:301:a868::100:2a

  50. Artifact collection (wrapper) $ ./wrapper.py --cmd './longtest.sh' --strace --tcpdump --cmdid longtest \ --destdir /tmp/longtest/ --prefix alfa --suffix t0 --wait 10 ## Starting TCPDump: /usr/bin/sudo /usr/sbin/tcpdump -w "/tmp/longtest//alfa-tcpdump-longtestt0. pcap" ## Starting work job: ./longtest.sh ## Starting blocking operation: ['/usr/bin/strace', '-f', '-s', '512', '-v', '-o', '/tmp/longtest//alfa-strace-longtest-t0', '--', './longtest.sh'] ## Thread should be stopped now: ['/usr/bin/strace', '-f', '-s', '512', '-v', '-o', '/tmp/longtest//alfa-strace-longtest-t0', '--', './longtest.sh'] ## Work finished! ## Stdout+stderr (/tmp/longtest//alfa-stdout-longtest-t0): ================================================================================ PING 128.142.18.54 (128.142.18.54) 56(84) bytes of data. 64 bytes from 128.142.18.54: icmp_req=1 ttl=58 time=0.965 ms 64 bytes from 128.142.18.54: icmp_req=2 ttl=58 time=1.24 ms 64 bytes from 128.142.18.54: icmp_req=3 ttl=58 time=1.08 ms --- 128.142.18.54 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2003ms rtt min/avg/max/mdev = 0.965/1.097/1.243/0.117 ms OK Ending Now! ================================================================================ ## Going to sleep for 10 seconds. ^C## Exception reported, ending waiting, ## Stopping tcpdumps ## Stopping dumper <cmdRunner(Thread-2, started 140725401147136)> ## Stopping tailers ## Thread should be stopped now: ['/usr/bin/sudo', '/usr/sbin/tcpdump', '-w', '/tmp/longtest//alfatcpdump- longtest-t0.pcap']

More Related