530 likes | 551 Views
Grid services monitoring EMI IPv6 testbed. Dusan Klinec Supervisor: Andrew Elwell IT-GT-SL. Outline. Brief introduction to IPv6 enabled server code Nagios dual stack service monitoring Nagios probes extension to support dual stack check EMI IPv6 Testbed. IPv6 enabled server.
E N D
Grid services monitoringEMI IPv6 testbed Dusan Klinec Supervisor: Andrew Elwell IT-GT-SL
Outline • Brief introduction to IPv6 enabled server code • Nagios dual stack service monitoring • Nagios probes extension to support dual stack check • EMI IPv6 Testbed
Dual stack Server binds both 0.0.0.0, :: addresses 05/09/2012 4
socket(PF_INET6, SOCK_STREAM, IPPROTO_TCP) = 4 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 fcntl(4, F_GETFL) = 0x2 (flags O_RDWR) fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0 setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0 bind(4, {sa_family=AF_INET6, sin6_port=htons(12366), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 Listen(4, 100) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 setsockopt(5, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0 setsockopt(5, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0 setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 bind(5, {sa_family=AF_INET, sin_port=htons(12366), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 listen(5, 100) = 0 select(6, [4 5], [4 5], [4 5], {30, 0}) = 0 (Timeout) select(6, [4 5], [4 5], [4 5], {30, 0}) = 1 (in [5], left {12, 929341}) fcntl(5, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 accept(5, {sa_family=AF_INET, sin_port=htons(34736), sin_addr=inet_addr("127.0.0.1")}, [16]) = 6 setsockopt(6, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0 setsockopt(6, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0 setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0 recvfrom(6, "POST / HTTP/1.1\r\nHost: 127.0.0.1"..., 65536, 0, NULL, NULL) = 606
Dual stack Server binds :: address, using IPv4 mapped addresses (RFC 4038) ...but: IPv4 mapped addresses – not supported by OpenBSD, Windows {2000, XP, 2003} 05/09/2012 6
Dual stack Server binds only 0.0.0.0 => Service is unavailable on IPv6 socket Problem if: • Host has AAAA DNS record • Client prefers IPv6 to IPv4 • In case of fail client doesn't try IPv4 OK if: Clients are aware of this (MySQL) => no try for IPv6 connection 05/09/2012 7
Getaddrinfo() • IP protocol version agnostic • Used for DNS queries • Used for acquiring addresses for bind() • Returns linked list • The sorting function used within getaddrinfo() is defined in RFC 3484 • The order can be tweaked for a particular system by editing /etc/gai.conf 05/09/2012 8
Real world example 1 [root@gtv6-emi14 ~]# cat /etc/redhat-release Scientific Linux release 6.3 (Carbon) [root@gtv6-emi14 ~]# uname -a Linux gtv6-emi14 2.6.32-131.2.1.el6.x86_64 #1 SMP Thu Jun 2 09:49:26 \ CDT 2011 x86_64 x86_64 x86_64 GNU/Linux [root@gtv6-emi14 ~]# ./getaddrinfo.app #00: IPv4 address: 0.0.0.0 (-) #01: IPv6 address: :: (-) [root@gtv6-emi14 ~]# netstat -tunap | grep srm Tcp 0 0 0.0.0.0:8446 0.0.0.0:* LISTEN 1092/srmv2.2
Consequence Client is unable to connect via IPv6 SYN: 14:02:57.041337 IP6 2001:1458:301:a873::100:140.53729 > 2001:1458:301:a87c::100:1cd.8446: S 117254423:117254423(0) win 5760 <mss 1440,sackOK,timestamp 2159405331 0,nop,wscale 7>RESET: 14:02:57.041923 IP6 2001:1458:301:a87c::100:1cd.8446 > 2001:1458:301:a873::100:140.53729: R 0:0(0) ack 117254424 win 0
Quick hack [root@gtv6-emi14 ~]# cat /etc/gai.conf label ::/0 0 label 0.0.0.0/0 1 precedence ::/0 40 precedence 0.0.0.0/0 10 [root@gtv6-emi14 ~]# ./getaddrinfo.app #00: IPv6 address: :: (-) #01: IPv4 address: 0.0.0.0 (-) [root@gtv6-emi14 ~]# netstat -tunap | grep srm Tcp 0 0 :::8446 :::* LISTEN 4215/srmv2.2
Nagios checks dual stack Before using NCG patch After using NCG patch 05/09/2012 14
Nagios configuration 05/09/2012 15
NCG Hash.pm • Metric configuration change to declare dual stack check support: $WLCG_SERVICE->{'org.sam.SRM-All'}->{native} = "Nagios"; $WLCG_SERVICE->{'org.sam.SRM-All'}->{config} = {%{$SERVICE_TEMPL->{60}}}; $WLCG_SERVICE->{'org.sam.SRM-All'}->{probe} = 'org.sam/SRM-probe'; $WLCG_SERVICE->{'org.sam.SRM-All'}->{metricset} = "org.sam.SRM"; $WLCG_SERVICE->{'org.sam.SRM-All'}->{dependency}->{"hr.srce.SRM2-CertLifetime"} = 1; $WLCG_SERVICE->{'org.sam.SRM-All'}->{dependency}->{"hr.srce.GridProxy-Valid"} = 0; # line declaring that this service supports --4 and --6 switches $WLCG_SERVICE->{'org.sam.SRM-All'}->{flags}->{'DEFAULTDUALSTACK'} = 1; AAAA DNS record is required For host to use dual stack checks 05/09/2012 16
Problem We want probes to support --4, --6 switches • If no switch from {--4,--6} is provided: -> default behavior (let system resolver decide) • If --4 is provided: -> probe MUST use SOME IPv4 address from DNS response for testing service • If --6 is provided: -> probe MUST use SOME IPv6 address from DNS response for testing service
Probes Some probes uses clients with support for particular IP stack test => no need to hack in order to test particular stack. hr.srce.check_nmap_tcp uses nmap (network scanner) [root@gtv6-emi02 ~]# nmap gtv6-emi14.cern.ch -p 8446 -v Starting Nmap 4.11 ( http://www.insecure.org/nmap/ ) at 2012-09-04 14:53 Initiating ARP Ping Scan against 128.142.136.156 [1 port] at 14:53 Initiating SYN Stealth Scan against gtv6-emi14.cern.ch (128.142.136.156) Interesting ports on gtv6-emi14.cern.ch (128.142.136.156): PORT STATE SERVICE 8446/tcp open unknown MAC Address: 00:15:5D:FF:53:79 (Microsoft)
Probes Some probes uses clients with support for particular IP stack test => no need to hack in order to test particular stack. hr.srce.check_nmap_tcp uses nmap (network scanner) [root@gtv6-emi02 ~]# nmap gtv6-emi14.cern.ch -p 8446 -v -6 Starting Nmap 4.11 ( http://www.insecure.org/nmap/ ) at 2012-09-04 14:55 Machine 2001:1458:301:a87c::100:1cd is actually LISTENING on probe port 80 Initiating Connect() Scan against gtv6-emi14.cern.ch \ (2001:1458:301:a87c::100:1cd) [1 port] at 14:55 Discovered open port 8446/tcp on 2001:1458:301:a87c::100:1cd Interesting ports on gtv6-emi14.cern.ch (2001:1458:301:a87c::100:1cd): PORT STATE SERVICE 8446/tcp open unknown
Wrapping Wrapping does not work with many probes (BDII lookup): 2012-09-04T12:44:35Z Querying BDII ldap://emiipv6bdiit.cern.ch:2170 2012-09-04T12:44:35Z No information for [base: o=grid; filter: (|(&(GlueChunkKey=GlueSEUniqueID=128.142.136.156) (|(GlueSAAccessControlBaseRule=dteam) (GlueSAAccessControlBaseRule=VO:dteam))) (&(GlueChunkKey=GlueSEUniqueID=128.142.136.156) (|(GlueVOInfoAccessControlBaseRule=dteam) (GlueVOInfoAccessControlBaseRule=VO:dteam))) (&(GlueServiceUniqueID=*://128.142.136.156*) (GlueServiceVersion=2.*) (GlueServiceType=srm*))); attribute(s): ['GlueServiceEndpoint', 'GlueSAPath', 'GlueVOInfoPath']] in [ldap://emiipv6bdiit.cern.ch:2170 [128.142.140.128]]. CRITICAL: METRIC FAILED [org.sam.SRM-GetSURLs]: CRITICAL: No information for [attribute(s): ['GlueServiceEndpoint', 'GlueSAPath', 'GlueVOInfoPath']] in [ldap://emiipv6bdiit.cern.ch:2170 [128.142.140.128]]. 05/09/2012 21
Using resolver With --4 switch 2012-09-04T11:50:55Z Querying BDII ldap://gtv6-emi03.cern.ch:2170 2012-09-04T11:50:55Z GlueServiceEndpoint: httpg://gtv6-emi14.cern.ch:8446/srm/managerv2 Resolving gtv6-emi14.cern.ch to 128.142.136.156 SRM endpoint(s) to test: srm://128.142.136.156:8446/srm/managerv2?SFN= /dpm/cern.ch/home/dteam With –-6 switch 2012-09-04T12:06:17Z Querying BDII ldap://gtv6-emi03.cern.ch:2170 2012-09-04T12:06:17Z GlueServiceEndpoint: httpg://gtv6-emi14.cern.ch:8446/srm/managerv2 GlueVOInfoPath: /dpm/cern.ch/home/dteam Resolving gtv6-emi14.cern.ch to [2001:1458:301:a87c::100:1cd] SRM endpoint(s) to test: srm://[2001:1458:301:a87c::100:1cd]:8446/srm/managerv2?SFN= /dpm/cern.ch/home/dteam 05/09/2012 30
Gridmon python probes Framework for Nagios probes • Central metric invocation => suitable for extension Resolver object: resolver.setRecord('cern.ch' , '127.0.0.1' ) resolver.resolve('cern.ch' ) # will return 127.0.0.1 resolver.unsetRecord('cern.ch' ) resolver.resolve('cern.ch' ) # will return cern.ch Methods in base class: def setResolver (self , resolver): """Set another resolver to probe""" def resolveHost(self, host): """Resolve host with internal resolver, if none - use identity""" Usage in metric (base class takes care about resolving): endpoint2=endpoint.replace(self.hostName, \ self.resolveHost(self.hostName))
Gridmon perl probes Library for Nagios probes • Lack of central metric invocation, only helper classes Usage: use GridMon::DualStackUtils qw( &getResolver ); use GridMon::DualStackResolver; use Socket qw( AF_INET AF_INET6 ); my $resolver = getResolver( $plugin->opts->hostname, Socket::AF_INET); $ENV{ DPNS_HOST} = $resolver->resolve( $plugin->opts->hostname);
Want to know more? See documentation: https://tomtools.cern.ch/jira/browse/GTSL-32https://tomtools.cern.ch/jira/browse/GTSL-33
Testbed Site name: cert-tb6-cern VO: emiipv6
Java IPv6 compliance “Using IPv6 in Java is easy; it is transparent and automatic. Unlike in many other languages, no porting is necessary. In fact, there is no need to even recompile the source files.” [ http://docs.oracle.com/javase/1.5.0/docs/guide/net/ipv6_guide/index.html ] => java based services should work with high probability
Service testing • All daemons are running, according to Systemadministrator guide or service reference card. • No critical errors or important warnings were foundin log files • SAM Nagios for all services -> no critical problem reported • Tested by user interface (UI) client applications (if applicable -> service has all needed dependencies to run) • no error found, everything works • test of high-level services (FTS, CE) documentedin testing protocol ({packet trace, log files, system call trace} available)
Testbed summary • All deployed services are running • Majority of services was tested by client programs • Services are being tested by Nagios probes • Grid testbed is usable - submit job, transmission job, dpm, voms, etc... (no problem was found) But it is still not perfect...
Strange binding svcs ARGUS: tcp 0 0 ::ffff:128.142.18.55:8150 :::* LISTEN 55256/java tcp 0 0 ::ffff:128.142.18.55:8152 :::* LISTEN 61197/java tcp 0 0 ::ffff:128.142.18.55:8154 :::* LISTEN 61102/java Cause: YAIM configuration uses host FQDN to specify socket to bind -> by DNS resolved to public IP address [root@gtv6-emi13 ~]# telnet 127.0.0.1 8150 Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused [root@gtv6-emi13 ~]# telnet 128.142.18.55 8150 Trying 128.142.18.55... Connected to 128.142.18.55. Escape character is '^]'.
Manipulated DNS server Hostname: emiipv6dns.cern.ch Recursive DNS server + ... Uses Response Policy Zone (RPZ) mechanism in order to answer on DNS queries from foreign zone with definedanswer – ability to tamper DNS responses. [root@emiipv6dns dklinec]# cat /var/named/rpz $TTL 60 @ IN SOA localhost. root.localhost. ( 100 ; serial 10m ; refresh 10m ; retry 10m ; expiry 10m) ; minimum IN NS localhost. ; just testing record non-existing-domain.com CNAME www.cern.ch. emi-ipv6-ce.cern.ch IN A 137.138.163.53 ; hide IPv6 record ;emi-ipv6-ce.cern.ch IN AAAA 2001:1458:201:b30a:215:5dff:feff:449b
Getaddrinfo() res. order Returns result of getaddrinfo() suitable for binding Using PF_UNSPEC, AI_PASSIVE Usage: [root@gtv6-emi14 ~]# ./getaddrinfo.app #00: IPv6 address: :: (-) #01: IPv4 address: 0.0.0.0 (-)
Port binding check Helps to reveal IPv4-Only services, not properly configured services, firewall configuration problems ################################################################################ #Netstat analysis host: gtv6-emi13.cern.ch ################################################################################ All listening services: tcp 2170 (8076/slapd) |W: 0.0.0.0 1 :: 0 tcp 8150 (55256/java) |W: 0.0.0.0 0 :: 1 tcp 8152 (61197/java) |W: 0.0.0.0 0 :: 1 tcp 8154 (61102/java) |W: 0.0.0.0 0 :: 1 IPv6 Only services: tcp 8150 (55256/java) |W: 0.0.0.0 0 :: 1 tcp 8152 (61197/java) |W: 0.0.0.0 0 :: 1 tcp 8154 (61102/java) |W: 0.0.0.0 0 :: 1 IPv4 Only services!!! : tcp 2170 (8076/slapd) |W: 0.0.0.0 1 :: 0 Results: L4 Protocol: tcp ! Problem with tcp 8150 (55256/java) on :: IPv4: 0 IPv6: 1; Closed port on 2001:1458:301:a868::100:2a ! Problem with tcp 8152 (61197/java) on :: IPv4: 0 IPv6: 1; Closed port on 2001:1458:301:a868::100:2a ! Problem with tcp 8154 (61102/java) on :: IPv4: 0 IPv6: 1; Closed port on 2001:1458:301:a868::100:2a
Artifact collection (wrapper) $ ./wrapper.py --cmd './longtest.sh' --strace --tcpdump --cmdid longtest \ --destdir /tmp/longtest/ --prefix alfa --suffix t0 --wait 10 ## Starting TCPDump: /usr/bin/sudo /usr/sbin/tcpdump -w "/tmp/longtest//alfa-tcpdump-longtestt0. pcap" ## Starting work job: ./longtest.sh ## Starting blocking operation: ['/usr/bin/strace', '-f', '-s', '512', '-v', '-o', '/tmp/longtest//alfa-strace-longtest-t0', '--', './longtest.sh'] ## Thread should be stopped now: ['/usr/bin/strace', '-f', '-s', '512', '-v', '-o', '/tmp/longtest//alfa-strace-longtest-t0', '--', './longtest.sh'] ## Work finished! ## Stdout+stderr (/tmp/longtest//alfa-stdout-longtest-t0): ================================================================================ PING 128.142.18.54 (128.142.18.54) 56(84) bytes of data. 64 bytes from 128.142.18.54: icmp_req=1 ttl=58 time=0.965 ms 64 bytes from 128.142.18.54: icmp_req=2 ttl=58 time=1.24 ms 64 bytes from 128.142.18.54: icmp_req=3 ttl=58 time=1.08 ms --- 128.142.18.54 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2003ms rtt min/avg/max/mdev = 0.965/1.097/1.243/0.117 ms OK Ending Now! ================================================================================ ## Going to sleep for 10 seconds. ^C## Exception reported, ending waiting, ## Stopping tcpdumps ## Stopping dumper <cmdRunner(Thread-2, started 140725401147136)> ## Stopping tailers ## Thread should be stopped now: ['/usr/bin/sudo', '/usr/sbin/tcpdump', '-w', '/tmp/longtest//alfatcpdump- longtest-t0.pcap']