1 / 33

Isis 2 Design Choices

Isis 2 Design Choices. A few puzzles to think about when considering use of Isis 2 in your work. A Service With Mobile Clients. Suppose that you are creating a service that will have external clients using web apps or browsers.

ellis
Download Presentation

Isis 2 Design Choices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Isis2 Design Choices A few puzzles to think about when considering use of Isis2 in your work

  2. A Service With Mobile Clients • Suppose that you are creating a service that will have external clients using web apps or browsers. • Your goals are to load-balance the requests over your service, but your service depends upon some form of dynamically evolving replicated state. • The questions that follow relate to how best to use Isis2 as a tool in solving this kind of problem A Your server will run in a cloud-hosted data center Your users are remote and mobile B C

  3. A Service with Mobile Clients • True or False: A good use of Isis2 would be for direct communication of updates between the client systems.

  4. A Service with Mobile Clients • False: Isis2 is poorly suited to P2P settings, where there can be a wide variety of communication barriers. The best use of Isis2 is internal to a data center, where the server runs. Direct peer to peer connectivity is often difficult due to firewalls, network address translation and slow links. This is an issue even within a single household! A B We can count on fairly good connections back to the hosted server in the data center C

  5. Client to Server Connectivity • Which of these is not a good choice? • Connect the clients to the data center using a prebuilt web services solutions, such as the RESTFUL service architecture. • Employ Visual Studio and tell it you want to create a new WCF application. Build on the automatically created WCF client and server templates. • Launch Isis2 in all systems, but have the client applications use the built-in “Client of a group” API in Isis2, and have the group run purely on nodes inside the data center.

  6. Client to Server Connectivity • C is a poor choice. By default, Isis2 probably won’t even start correctly in this setting. • It uses IP multicast to find peers during its start protocol. • Using ISIS_HOSTS and ISIS_UNICAST_ONLY you can help Isis2 start in this setting, but the overheads of doing so would be pretty high compared to the WCF or RESTFUL approach. • The “Client” API internal to Isis2 is intended for cases where one group is using services from another group, not for mobile external users.

  7. Isis2 can help with… • Maintaining seamless connectivity, so that the mobile users never see a disconnection. • Maintaining the game state, so that every user sees a consistent, dynamically updated state even when connected to different server instances. • Real-time coordination, so that activities like multiuser battles are easier to script.

  8. Isis2 won’t even know about the network links, which will probably use TCP. The Cornell TCP-R technology offers unbreakable TCP links. You could deploy it side by side with Isis2 to create seamless connectivity Isis2 can help with… • Maintaining seamless connectivity, so that the mobile users never see a disconnection. • Maintaining the game state, so that every user sees a consistent, dynamically updated state even when connected to different server instances. • Real-time coordination, so that activities like multiuser battles are easier to script. Isis2 is fast, but it is not a real-time technology. By synchronizing clocks (e.g. using NTP with a good-quality NTP stratum 0 time source) on your servers, you could employ Isis2 as part of a real-time system

  9. The best option for guaranteed actions is… • Suppose a mobile user does some action and we want to guarantee that it will be performed exactly once. We’re running Isis2 within our data center on the game servers. • Isis2 can automatically handle this through a form of primary-backup coordination • Isis2 lacks a solution to this but provides tools that can be used to create a solution in any of several ways, depending on your specific goals.

  10. Isis2 won’t even know about the incoming requests since they will arrive as WCF or REST events, delivered as upcalls to individual group members. Also, Isis2 lacks a built in “do this fault-tolerantly” option. The best option for guaranteed actions is… • Suppose a mobile user does some action and we want to guarantee that it will be performed exactly once. We’re running Isis2 within our data center on the game servers. • Isis2 can automatically handle this through a form of primary-backup coordination • Isis2 lacks a solution to this but provides tools that can be used to create a solution in any of several ways, depending on your specific goals.

  11. Take an action fault-tolerantly • Suppose our group has members {P,Q,R} • Some request arrives at member P from a client, and we wish to perform it exactly once even if failures occur. Which option is best? • P should relay the request to the whole group, e.g. using g.OrderedSend(). If a client timeout occurs, the client can reissue the request • We will need to use the Isis2g.SafeSend() disk durability option to solve this problem.

  12. Take an action fault-tolerantly • Suppose our group has members {P,Q,R} • Some request arrives at member P from a client, and we wish to perform it exactly once even if failures occur. Which option is best? • P should relay the request to the whole group, e.g. using g.OrderedSend(). If a client timeout occurs, the client can reissue the request • We will need to use the Isis2g.SafeSend() disk durability option to solve this problem. The SafeSend() protocol in Isis2 is used when a group is employed as a “wrapper” around replicas of a database external to the group (e.g. a replicated mySQL database, or an Oracle database). For gaming applications, running over a replicated durable database would be too slow, so this is not a good design for the application we have In mind.

  13. Relaying a Request • In the previous question we decided that P should relay the request, but if P fails, that the mobile client system might reissue it. • In this situation, Isis2 would automatically sense a reissued request. Thus if P uses OrderedSend to relay client request X, but then the client asks Q to relay the same request, it will only be delivered once. • Isis2 cannot sense this form of duplication. Application code of your own would be needed to sense duplicate requests and perform them just once.

  14. Relaying a Request • In the previous question we decided that P should relay the request, but if P fails, that the mobile client system might reissue it. • In this situation, Isis2 would automatically sense a reissued request. Thus if P uses OrderedSend to relay client request X, but then the client asks Q to relay the same request, it will only be delivered once. • Isis2 cannot sense this form of duplication. Application code of your own would be needed to sense duplicate requests and perform them just once. When designing your gaming application, give each request a unique id. Then, if the group receives a duplicated request, you can just replay the same response under the assumption that the mobile application timeout out for some reason and missed the original response.

  15. Sending Failures • The best way to sense failures would be • Let Isis2 do this automatically. You are unlikely to do better and Isis2 will be very fast in any case. • One by one ask what failures can occur. For each case try and design a super-fast failure handling solution, which could include telling Isis2 that one of the group members has failed. • Connect your service to the Amazon EC2 fault sensing and reporting framework.

  16. Sending Failures Isis2 rapidly senses and resends lost messages internal to the data center, so that one case will be handled automatically. But outright failures of the group members will be sensed slowly, after 45-90s by default. • The best way to sense failures would be • Let Isis2 do this automatically. You are unlikely to do better and Isis2 will be very fast in any case. • One by one ask what failures can occur. For each case try and design a super-fast failure handling solution, which could include telling Isis2 that one of the group members has failed. • Connect your service to the Amazon EC2 fault sensing and reporting framework. Surprisingly, there is no EC2 fault sensing and reporting framework. Most gaming applications end up designing a rapid sensing framework of their own.

  17. Real-Time In Isis2 • Your gaming system needs a kind of real-time “pulse” that will trigger periodic actions by all the members. But you want consistency! • Have one leader track the time and then use g.Send() to trigger the pulse • Same as A but use g.RawSend() for better speed • Synchronize time across the whole group, and just have each group member take actions at the pre-agreed “pulse time” points

  18. Real-Time In Isis2 • Your gaming system needs a kind of real-time “pulse” that will trigger periodic actions by all the members. But you want consistency! • Have one leader track the time and then use g.Send() to trigger the pulse • Same as A but use g.RawSend() for better speed • Synchronize time across the whole group, and just have each group member take actions at the pre-agreed “pulse time” points The CAP theorem tells us that we have a tradeoff here. G.Send() is always consistent, and will normally be very fast. If consistency matters, this is probably the best way to achieve it. RawSend() won’t necessarily reach every member. So it has more steady timing on delivery, but some members might fail to pulse (e.g. if a message is lost – RawSend() won’t try to recover it). This gives the best timing but completely lacks any kind of strong consistency. Also, keep in mind that on shared, virtualized platforms like EC2, even with NTP one may have trouble synchronizing clocks to better than 25-50ms. By renting heavy-weight EC2 instances you can reduce the risk of disruptive scheduling delays

  19. Duplicated Computing • Certain gaming requests require fairly heavy computing. We want to have two group members perform each such request for fault-tolerance, but how should they be picked? • Relay the request via OrderedSend, then on receipt, use the group view to select 2 members. They compute the identical answer because data is consistent and both reply. The client takes the first reply and ignores the duplicate. • Have the external client just send the same request twice. Again, the client just takes the first reply.

  20. Duplicated Computing • Certain gaming requests require fairly heavy computing. We want to have two group members perform each such request for fault-tolerance, but how should they be picked? • Relay the request via OrderedSend, then on receipt, use the group view to select 2 members. They compute the identical answer because data is consistent and both reply. The client takes the first reply and ignores the duplicate. • Have the external client just send the same request twice. Again, the client just takes the first reply. For example, take the request-id and hash it to a number k  0…N-1. Then have group members k and k+1 (mod N) run the operation for this request. This could work, but keep in mind that the two requests might end up assigned to the same group member. It is hard to completely control the EC2 load-balancer!

  21. TCP-R • We mentioned the Cornell TCP-R technology. The role of TCP-R is: • To allow a group member to “take over” a TCP endpoint seamlessly, thus allowing transparent fail-over or migration of computing roles. • To enhance performance of TCP for real-time and gaming uses by changing flow-control behavior. • To allow a TCP connection to terminate at a group of endpoints, like the members of an Isis2 group. All endpoints would deliver identical data.

  22. TCP-R When used correctly, a new server (perhaps a backup) can “splice” a new TCP connection to an already-open one that connected to some prior server (perhaps a primary that crashed). TCP-R ensures that not a byte is duplicated or lost, but it does require application help: code you write to checkpoint the TCP-R state and the data sent/received on the connection. • We mentioned the Cornell TCP-R technology. The role of TCP-R is: • To allow a group member to “take over” a TCP endpoint seamlessly, thus allowing transparent fail-over or migration of computing roles. • To enhance performance of TCP for real-time and gaming uses by changing flow-control behavior. • To allow a TCP connection to terminate at a group of endpoints, like the members of an Isis2 group. All endpoints would deliver identical data. In fact there are a number of special versions of TCP for real-time settings. However, to use them on systems like EC2 you would need to run them as application-layer libraries, which is a little tricky to do. The same can be said of TCP-R: none of these options are transparent. There are also TCP fault-tolerance solutions that work this way, .

  23. TCP-R in action Initial Server TCP-Rblack box tcp tcp connection checkpoints new tcp connection Mobile client sees no disruption at alland the spliced TCP connection looksidentical to the old one. Not a byte isduplicated or lost in either direction Standby Server

  24. Isis2 plus TCP-R • When we say that the application could combine these, we mean that one could use TCP-R to talk to a server group that uses Isis2 internally to maintain replicated state • The replicated state functions as the checkpoint • However, this is still not at all transparent • You must deploy TCP-R and Isis2 • Your server must still include TCP-R state into the data replicated in the group, and must checkpoint at the proper points in time (as per the TCP-R user manual)

  25. Backup takes over • Consider a general setting in which a group replicates state such as “actions the external users have requested” or “the game state” • Now a member fails and a backup takes over • With Isis2 this is transparent and seamless • Isis2 delivers events that can trigger the take-over but the backup will still need to “figure out” what the member had done prior to failing

  26. Backup takes over • Consider a general setting in which a group replicates state such as “actions the external users have requested” or “the game state” • Now a member fails and a backup takes over • With Isis2 this is transparent and seamless • Isis2 delivers events that can trigger the take-over but the backup will still need to “figure out” what the member had done prior to failing The new-view event tells you who failed, and you also know that any multicasts sent prior to the failure either have been delivered, or were completely erased by the crash. But the backout would often need to query the “external world” to know if actions the failed process was performing had succeeded or not, e.g. if it was updating a database or activating a piece of hardware or performing other kinds of “external” actions.

  27. Out-of-Band Tool • The Isis2 OOB (out of band file transfer) tool: • Is used to copy memory-mapped files from node to node, at locations where an Isis2 application has group members. • Is helpful when dealing with remote clients that are using web services to send data outside of the Isis2 system • Provides a way for an application to implement a control layer that oversees some other communication technology, such as with SDN

  28. Out-of-Band Tool Isis2 multicast works best for small objects, so with the OOB tools, you can move gigabyte objects as memory-mapped files. The multicasts talk about file names and sizes, but the data itself is moved externally to the group, at very high data rates using a form of nearly direct DMA transfer from source to destination(s) • The Isis2 OOB (out of band file transfer) tool: • Is used to copy memory-mapped files from node to node, at locations where an Isis2 application has group members. • Is helpful when dealing with remote clients that are using web services to send data outside of the Isis2 system • Provides a way for an application to implement a control layer that oversees some other communication technology, such as with SDN Although your application can certainly use WCF or RESTFUL technology to support remote mobile clients, Isis2wouldn’t have any direct knowledge about them. The OOB technology only works between members of an Isis2 process group. Although it would certainly be possible to build new tools similar to the OOB tool for managing a software defined network, we haven’t tried doing that yet with Isis2

  29. OOB for State Transfer • When using the OOB tool to accelerate a state transfer, which of the following is not true? • One option is to put the state in a mapped file, transfer it via OOB, and have the state transfer itself just point to the mapped file. • One option is to pre-transfer state, then have the state transfer include just the delta of updates that may have occurred after that pre-transfer • OOB cannot be used in this case because the process is not yet a member of the group

  30. When deleting an OOB replica… • Suppose that in group {P,Q,R….} P initially has some large object “X” and uses OOB replication to create new replicas at Q and R. • The copy at P can be deleted in the same OOBRereplicate request that created the copies at Q and R • The copy at P should not be deleted until after the copies for Q and R have been made

  31. When deleting an OOB replica… • Suppose that in group {P,Q,R….} P initially has some large object “X” and uses OOB replication to create new replicas at Q and R. • The copy at P can be deleted in the same OOBRereplicate request that created the copies at Q and R • The copy at P should not be deleted until after the copies for Q and R have been made

  32. OOB for State Transfer • When using the OOB tool to accelerate a state transfer, which of the following is not true? • One option is to put the state in a mapped file, transfer it via OOB, and have the state transfer itself just point to the mapped file. • One option is to pre-transfer state, then have the state transfer include just the delta of updates that may have occurred after that pre-transfer • OOB cannot be used in this case because the process is not yet a member of the group

  33. OOB for State Transfer • When using the OOB tool to accelerate a state transfer, which of the following is not true? • One option is to put the state in a mapped file, transfer it via OOB, and have the state transfer itself just point to the mapped file. • One option is to pre-transfer state, then have the state transfer include just the delta of updates that may have occurred after that pre-transfer • OOB cannot be used in this case because the process is not yet a member of the group There are several ways to work around the “must be a member” limitation. One can do the OOB transfer in some other group, created just for the purpose, or can perform the OOB ReReplicate “during” the state transfer event.

More Related