The SG Cluster with Load Balance and Fault Tolerance

The SG Cluster with Load Balance and Fault Tolerance Shang Rong Tsai Department of Electrical Engineering National Cheng-Kung University 2001 Nov. 20

What is a SG Cluster • The SG cluster is a mixture of load balancing cluster and high availability cluster. It enables you to create load balancing, fault tolerant and high availability cluster for most existing applications. • A typical SG Cluster contains one or two load balancers and several back-end application servers. Using more than one load balancer can tolerate the faults of a load balancer. • It was developed at the DSLab EE.NCKU.

Features of SG Clusters • Client Transparent • A group of back-end application servers that may run on different platforms appear as a single server to the client • Scalable • system service capacity can be increased by adding new servers to the cluster • Extensible • various read/write models • make existing applications into a scalable system with little or no modification. • Manageable • simple to install (single floppy) and easy to administer (web interface)

Features of SG Clusters (continued) • Load Balancing • incoming requests are routed to the least loaded servers based on various policies for optimal performance. • Fault Tolerant • load balancer monitors the availability of back-end servers and only routes client's requests to those alive ones. • More than one load balancers can be setup to avoid the single point of failure in the whole system. • High Availability • SG cluster can mask the faults on load balancer or back-end servers if there are sufficient redundancies. It can also keep service available when doing system upgrade. • Robust to Denial-of-service attack

A Typical Physical Wiring 192.168.1.1 140.116.72.114 192.168.1.2 SG Load balancer (primary) Internet access 192.168.1.3 HighSpeed Switch HighSpeed Switch 192.168.1.4 SG Load balancer (backup) 192.168.1.5 Server Pool

Logical View SG Load balancer 192.168.1.1:* 140.116.72.219 192.168.1.2:* 192.168.1.2:23 user request 140.116.72.114:* 192.168.1.3:* 192.168.1.3:23 user request 140.116.72.115:23 192.168.1.4:23 192.168.1.5:23 Virtual Server

Network Address Translation • A technique to convert IP address fields in an IP packet between a private IP and public IP • Same private IP addresses can be reused in private networks at the same time, thus save IP addresses needed • Private IP • 10.0.0.0-10.255.255.255 • 172.16.0.0-172.31.255.255 • 192.168.0.0-192.168.255.255 • Applications embedding IP addresses in protocol contents may have problems • Private IP addresses are generally used by client only hosts

The Operational Principle of NAT Host with IPd3 address S-port3 D-port3 IPd3, IP pub Host with IPd1 address Internet S-port1 D-port1 IPd1, IP pub S-port1 D-port1 IPd1, IPs1 Client-only host IP pub IPd1,D-port1 IPs1,S-port1 Client-only host Host with IPs1 (a private IP address) NAT device IPd3,D-port3 IPs3,S-port3 IPs2 Client-only host S-port3 D-port3 IPd3, IPs3 IPs3

The Overall Architecture Heartbeat to other bidds for SG failover AP Server Load Balancer 140.116.72.219 bidd Alive? 192.168.1.1:* NATD mrouted IP packet AP Server 192.168.1.2:* Server Group Properties SGmon Server Alive? 140.116.72.114:* Alive? feedback protocol SGhb SGctrld feedback protocol feedback protocol SGcmd 192.168.1.3:*

The Major Components in SG Cluster • Bidd • Used for the election of a new primary. Bidd on the primary load balancer generates heartbeats and Bidds on the backups monitor the heartbeats. • Using bidding model • Each server gives a price (a unique value) to bid, the • server giving the highest price becomes the new primary • Fully symmetric, each node could have exactly the same configuration • Independent with the service to support

The Major Components in SG Cluster (continued) • Server Group Properties • This is a block of shared memory accessed by SG processes. It contains the membership, the load balancing policy, properties of each server group and statistical information of all servers • NATD • The key component of SG load balancer, it is responsible for changing the IP address in the IP packet header based on the Server Group Properties • mrouted • SG cluster supports not only the "select one" model of service but also the "write all" model of service. A write request under "write all" model will be multicast to all servers in the server group. This modified mrouted is used to support multicast service in SG cluster

The Major Components in SG Cluster (continued) • sgctrld • Sgctrld provides an interface for processes outside the load balancer to modify the "Server Group Properties". Processes can use "feedback protocol" to communicate with the sgctrld to make changes on the “Server Group Properties”. For example, an application server can feed back current load to the load balancer for specific load balancing policy. • sgcmd • This is a client of sgctrld and provides a command line interface of the "feedback protocol". It can be used by shell script or by the user interactively.

The Major Components in SG Cluster (continued) • sgmon • Normally, NATD can detect the failure of a server if the server does not response to the the client's request. But NATD won't find out the failure if no request is coming at all. Furthermore, NATD won't detect the recovery of dead servers since no request would be sent to a dead server. Sgmon monitors the failure and recovery of servers by periodically sending requests to application servers. • sghb • The is optional. It is a little monitor process executed on application servers. Since not all server components are network reachable, sghb can be used to monitor those quiet servers and generate heartbeat to SG load balancer

Load Balancing • Balancing Type • whole server • a specific service port • Balancing Policy • by RoundRobin • by Connection Count • by Packet Traffic • by External Counter • Application service program can make its own load definition and update it to this external counter • Weighted on above counter

Load Balancing (continued) Link creation • Load balancing is done by making selection on the target servers when a link is created. • A link becomes active when a response packet is found on this link • Once a link is active, the mapping of this link won’t be changed until it is closed or removed • If the target server for a link is dead before the link becomes active, the load balancer will remap this link to another target server 192.168.1.1:* 192.168.1.1:23 request 192.168.1.2:* 140.116.72.114:* 140.116.72.118:1029 140.116.72.114:23 192.168.1.3:* Creation of link (140.116.72.118:1029, 140.116.72.114:23, 192.168.1.1:23)

Load Balancing (continued) Keep Same Server • The target server for a new created link is chosen based on the balancing policy. But sometimes two different links are actually related and the packets from a particular client should be redirected to the same target server • Example: • Port mapper: a RPC client will ask port mapper which port a specific service is bound to and the client then sends its request to that port • Squid: a squid proxy will use ICP to query its neighbors and parent for a specific object and use HTTP to get that object from others if any cache hit. • “Keep Same Server” will redirect packets to the same target server if any link from a particular client is still available in the SG internal link table.

Examples to use ‘Keep Same Server’ 2. Which port is server ZZ? RPC client port mapper 3. server ZZ is port 2345 1. ZZ is port 2345 4. request sent to port 2345 RPC server ZZ packets to port mapper and packets to RPC server ZZ are related 1. ICP request: do you have xx.html squid squid 2. ICP reply: yes 3. HTTP: get xx.html packets in ICP and packets in HTTP are related

Read/Write Model Supported by SG • ReadAny • for TCP/UDP readonly service • data in each application server is identical with one another • Using unicast to forward requests • ReadOne/WriteAll • for UDP read/write service • data in each application server is identical with one another • Using multicast to forward requests • ReadFirst/WriteAll • for UDP read/write service • data maybe partitioned in the application server cluster • Using multicast to forward requests

ReadAny 192.168.1.1:* SG Load balancer 192.168.1.2:* request 140.116.72.114:* 192.168.1.3:* Virtual Server • Operation • When connection is created, SG tries to select a real server to serve this request • Benefit • TCP, UDP and ICMP are supported • No modification of the service program is required • Requirement/Limitation • The data must be fully identical on all servers • If any data modification is required, it must be handled by using a centric database or file server on the backend

ReadOne/WriteAll 192.168.1.1:* SG Load balancer read request 192.168.1.2:* write request 140.116.72.114:* 192.168.1.3:* Virtual Server • Operation • ReadOne • If no write is processing, read any • If other write is processing, change to read preferred to guarantee a consistency view from clients • WriteAll • Multicast write • collect all replies • Any one replying failure is turned off immediately • Servers reply success will be grouped based on the return value • return the one that majority of servers agree MulticastGroup 234.116.72.114

ReadOne/WriteAll (continued) • Benefit • Support both read and write operations • Application Service programs don’t have to care the membership of the service group • Requirement/Limitation • Data must be identical on all servers • Any session key (to uniquely identify an application session, e.g. The transaction id in an RPC) generated by service program must be deterministic • Service program needs little modification • Join multicast group at startup • When serving a write request, an application server has to set IP option to represent the return status (SG uses this info to determine whether a write request is successful or failed) • New packet analyzer needs to be implemented to support protocols other than RPC-type service (such as NFS service).

ReadFirst/WriteAll 192.168.1.1:* SG Load balancer read request 192.168.1.2:* write request 140.116.72.114:* 192.168.1.3:* Virtual Server • Operation • ReadFirst • Multicast read • Return the earliest reply to the client • WriteAll • Multicast write • collect all replies • Any one replying failure will be turned off immediately • Servers reply success will be grouped based on the return value • return the reply that majority of servers agree MulticastGroup 234.116.72.114

ReadFirst/WriteAll (continued) • Benefit • Support both read and write operations • Data can be partitioned in the server cluster • Requirement/Limit • Service program have to know the membership for job assignment • Any session key generated by a service program must be deterministic • Service program needs little modification • Join multicast group at startup • When serving a read request, an application server uses membership information to determine which server is responsible for this request. (an application server not responsible for this request just drops it) • When serving a write request, an application server has to set IP option to represent the return status (server not responsible for this request just returns ok) • New packet analyzer needs to be implemented to support protocols other than RPC-type services.

Mcast Service support Routine These routines are used by an application service program to set the status and return value of a mcast request into the IP option which will be inspected by the load balancer to determine whether a reply is successful or not (in Read/Write Models). • int sock_joingroup(int sockfd, struct in_addr groupaddr, int ttl); • join a sockfd into the groupaddr • used after the creation of server socket • int prepare_ipopt_mcast(u_short type, int retval); • set the return type and return value into a global variable • used before return from a wrtie function of a mcast service • int sock_set_ipopt_mcast(int sockfd); • set ip option with the value set in prepare_ipopt_mcast • used before send reply • int sock_clear_ipopt_mcast(int sockfd) • clear ip option • used after send reply

Packet Analyzer API A packet analyzer is used by NATD to parse the request/reply packet of a multicast service, return the unique id of them and check whether a request is write or not. A designer (for read/write request models) should implement the following API to be called by the NATD: • int mcast_init_xxxx(void); • Initialize internal data structure • int mcast_check_port_xxxx(u_short port); • return whether the servce is located on a special port • int mcast_check_request_xxxx(struct ip *pip, int *id, int *rwmode); • validate the structure of a request packet • get unique id and read/write mode of this request • int mcast_check_reply_xxxx(struct ip *pip, int *id); • Validate the structure of a reply packet • Get unique id of this reply

Feedback Protocol The feedback protocol is designed for updating group or server properties stored on the shared memory of the load balancer • UDP based • Command message id handle class op group server property datalen data… • Result message id status datalen data… • A library libsgmsg.a is available for application server developers, which eases the use of feedback protocol. • An executable sgcmd is available for system administrator, which can be used by shell script, so existing application can make use of feedback protocol too. • A web interface for feedback protocol is also available for an interactive administrator

Fault Tolerance Support • Fault detection • Packet snoop • Port test • Heartbeat monitor • Multicast write result comparison • Fault Recovery • The recovery happened on the real server, so SG system can just wait the recovery to complete • Recovery detection • Packet snoop • Port test • Heartbeat monitor • Triggered by server through feedback protocol

Server status transition Alive Keyport pkt delta in > P && timeout > T sgmon_porttest_error >E Keyport pkt responsed heartbeat timeout > H Sgmon_porttest ok mcast_errort > M heartbeat received Pending User recovery Dead Keyport pkt delta in > P && timeout > 2T sgmon_porttest_error > 2E heartbeat timeout > 2H mcast_errort > 2M D: packet delta threshold T: response timeout threshold E: porttest_error_threshold H: heartbeat timeout threshold M: mcast_error_threshold

Server status transition (continued) • A server has three state: Alive, Pending or Dead. • Various fault/recovery detecting mechanisms are used in SG system. The server status is calculated by sorting all fault/recovery events with timestamp. The latest event would decides the result. • Candidate for load balancing selection • Alive: default candidate • Pending: the candidate if no alive is available • Dead: the candidate if all server are dead • Why Pending State? • A server not responding to client’s request or monitor’s test may crashed or be busy in serving others under heavy load. We put a server into pending state at the beginning instead of dead state to expect it to come back later.

Fault Recovery • The load balancer does not handle recovery of application servers • The recovery happened on the real server, the SG load balancer can just wait the recovery to complete • The recovering server should not response to requests before the recovery is done • Since the detection in SG is targeted on failed-stop fault • The group should be in RDonly mode when doing state transfer • For read only service, the dead server can do state transfer from alive server directly. • For readone/writeall service, the dead server should turn server group into readonly before state transfer and turn server group back to readwrite mode when transfer is complete.

Deny-Of-Service attack Attacks on Unix systems typically come in two ways : • Process saturation • Some servers have a limitation on the TCP connections it can handle, it will stop response to client if this limit is reached. • Servers using fork() to handle new connections would consume system resource (ex:process table) • Mbuf exhaustion • A connection related mbuf won’t be released if the connection stays in FIN_WAIT_1 state • a BSD machine have only 1536 mbuf when maxuser=64 • a Linux machine doesn’t has such a limit, but since mbuf and mbuf cluster are non-pageable, an evil client can lock out lots of physical memory from others

Protection against DOS attack • Per client limitation • Max connections • Max connection rate • Max TCP connections in FIN_WAIT_1 state. • Any client breaking the above limitation will be denied for new connections. The deny interval can be specified by SG admin. • Per server based ACL • Allow/deny client’s requests based on its IP/subnet addresses • Servers in same group can have different ACL to provide differential service for different clients. • Ex: reserving the best computer in a group for internal use in a computing cluster

A distributed NFS file server cluster based on SG • UDP service • Based on synchronized RPC • ReadOne/WriteAll • Modification • make filehandle from pathname, this guarantees same handle will map to the same file on different servers • After server socket creation • sock_joingroup(int sockfd, struct in_addr groupaddr, int ttl) • At end of each write function, • prepare_ipopt_mcast(MCAST_SUCCESS, return_value) • before send reply • sock_set_ipopt_mcast(int sockfd) • After send reply • sock_clear_ipopt_mcast(int sockfd)

Performance Test SG Load balancer 192.168.1.2:* 581.44K/s 559.77K/s 192.168.1.3:* 140.116.72.114:* 421.13K/s 140.116.72.128 100Mb/s lan 373.40K/s 192.168.1.4:* MulticastGroup 234.116.72.114 NFS Write Efficience 373.40/421/13=88.66%

Performance Test SG Load balancer 0.489ms 0.293ms 140.116.72.115:* 192.168.1.1:* 140.116.72.128 100Mb/s lan 0.891ms Ping Echo Efficience (0.293+0.489)/0.891=87%

Performance Test SG Load balancer 2.67MB/s 4.33MB/s 140.116.72.115:* 192.168.1.1:* 140.116.72.128 100Mb/s lan 2.24MB/s Ftp download Efficience 2.24/2.67=83.89%

Some Other Application Examples • Web Proxy Server Cluster • Web Server Cluster • Telnet Server Cluster • Mail Server Cluster

Proxy Server Cluster cache 192.168.1.1:* SG Load balancer cache 192.168.1.2:* request 140.116.72.114:* cache 192.168.1.3:* Virtual Proxy Server • Configuration • Each proxy server has its own disk for the cache pool • Each proxy server set others as its sibling. • Data Consistence • each proxy server uses ICP protocol to query objects on other’s cache pool and fetch the object from others if needed

Web Server Cluster dsk DB 192.168.1.1:* SG Load balancer DB server dsk 192.168.1.2:* dsk request 140.116.72.114:* NFS server dsk 192.168.1.3:* Virtual Web Server • Configuration • Each web server has its own disk to store static data (web pages, images) • Common db server and nfs server in backend to store dynamic data (customer input, session…) • Data Consistence • Multiple copies of static data are maintained by administrators • There is only one copy of dynamic data in central db/nfs server, no maintenance is required

Telnet Server Cluster account 192.168.1.1:* SG Load balancer NIS server (accounts) 192.168.1.2:* request dsk 140.116.72.114:* NFS server 192.168.1.3:* Virtual Telnet Server • Configuration • NIS is server used to store accounts for users • NFS server is used to store the user home directory and mail spool(/var/mail) • Data Consistence • There is only one copy of user data/mail, no maintenance is required

Mail Server Cluster Sendmail account 192.168.1.1:* SG Load balancer Sendmail NIS server 192.168.1.2:* request dsk Sendmail 140.116.72.114:* NFS server 192.168.1.3:* Virtual Mail Server • Configuration • NIS is server used to store accounts for users • NFS server is used to store the user home dir and mail spool(/var/mail) • Sendmail daemon on each server must accepts mails targeted on virtual mail server • Sendmail daemon on each server does masquerade on each outgoing mail as they were sent from virtual mail server

Mail Server Cluster (continued) • Data Consistence • There is only one copy of user data/mail, no maintenance is required • Sendmail Setup • Accept mails targeted on virtual mail server • Search sendmail.cf, find a line like Fw-o /etc/mail/sendmail.cw • Add the hostname of virtual mail server to sendmail.cw • Masquerade outgoing mails as they sent from virtual mail server • Seach sendmail.cf, find a line containing only ‘DM’ • Change the line to ‘DM you.virtual.mail.server’

Epilogue • A free working clustering system, all required binary codes are packed in a 1.4M floppy (http://turtle.ee.ncku.edu.tw/sgcluster/) • Some good features • Load balance with various policies • Fault tolerance support for both application servers and the load balancer. • Support readany, readany/writeall, readfirst/writeall models • Enabling quick application developments for load balance and high availability clusters • The Bidding algorithm supports the election of a primary server in a symmetrical way. It is used for the fault tolerance of the load balancer • Flexibility in application cluster configuration • Support deny-of-service and access control • Feedback protocol permits customized policy control and administration • The SG cluster has been used by 台南市教育網路中心for one year to support proxy cluster

The SG Cluster with Load Balance and Fault Tolerance

The SG Cluster with Load Balance and Fault Tolerance

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance