Skip this Video
Download Presentation
Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems

Loading in 2 Seconds...

play fullscreen
1 / 34

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems - PowerPoint PPT Presentation

  • Uploaded on

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems. Team Members. Mike Seto [email protected] Jeremy Ng [email protected] Wee Ming [email protected] Ian Kalinowski [email protected] or just for ‘borbaman’. Baseline Application 1. SYNOPSIS:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems' - truly

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


18-749: Fault-Tolerant Distributed Systems

team members
Team Members

Mike Seto

[email protected]

Jeremy Ng

[email protected]

Wee Ming

[email protected]

Ian Kalinowski

[email protected]

or just for ‘borbaman’

baseline application 1
Baseline Application 1
    • Fault-tolerant, real-time multiplayer game
    • Inspired by Hudsonsoft’s Bomberman
    • Players interact with other players, and their actions affect the shared environment
    • Players plant timed bombs which can destroy walls, players and other bombs.
    • Last player standing is the winner!
    • Middleware: Orbacus (CORBA/C++) on Linux
    • Graphics: NCURSES
    • Backend: MySQL


baseline application 2
Baseline Application 2
    • Front end: one client per player
    • Middle-tier game servers
    • Back end: database
    • A client participates in a game with other clients
    • 4 clients per game (more challenging and has real-time elements)
    • A client may only belong to one game
    • A server may support multiple games
    • Game does not start until 4 players have joined
fault tolerance goals
Fault-Tolerance Goals
    • Preserves current game state under server failure
      • Coordinates of players, and player state
      • Bomb locations, timers
      • State of map
      • Score
    • Switch from failed server to another server within 1 second
    • Players who “drop” may rejoin game within 2 seconds
ft baseline architecture
FT-Baseline Architecture
  • Stateless Servers
    • Two servers run on two machines with a single shared database
    • Passive replication:
      • No distinction between “primary” and “backup” servers
      • No checkpointing
    • Each server replica can receive and process client requests
    • But…clients only talk to one replica at any one time
    • Naming Service and Database system are single point of failure.
ft baseline architecture1
FT-Baseline Architecture
  • Guaranteeing Determinism
    • State is committed to the reliable database at every client invocation
    • State is read from the DB before processing any requests, and committed back to the DB after processing
    • Table locking per replica in the database insures atomic access per game and guarantees determinism between our replicas
    • Transactions with non-increasing sequence numbers are discarded
  • Transaction processing
    • Database transactions are guaranteed atomic
    • Consistent state is achieved by having the servers read state from the database before beginning any transaction
mechanisms for fail over
Mechanisms for Fail-Over
  • Failing-Over
      • Client detects failure by catching COMM/TRANSIENT exception
      • Client queries Naming Service for list of servers
      • Client connects to first available server in order listed in Naming Service
      • If this list is null, the client waits until a new server registers with the naming service
fail over measurements
Fail-Over Measurements
  • Problem:
    • Average Fault-Free RTT: 14.7 ms
    • Average Failure-Induced RTT: 78.8 ms
    • Maximum Failure-Induced RTT: 1045.8 ms
  • Solution: Have servers pre-resolved by client, and have clients

pre-establish connections with working servers.

Too High!

rt ft baseline architecture
RT-FT-Baseline Architecture
  • What we tried:
    • Clients create a low-priority Update thread which contacts the Naming Service at a regular interval, caches references of working servers, and attempts to pre-establish connections.
    • This thread also performs maintenance on existing connections and repopulate cache with new launched servers
  • What we expected:



rt ft optimization part 1
RT-FT Optimization – Part 1

Before and after multi-threaded optimization

What went wrong?

bounded real time fail over measurements
Bounded “Real-Time” Fail-Over Measurements
  • Jitter:
    • Maximum Jitter BEFORE: 36 ms
    • Maximum Jitter AFTER: 176 ms
    • We have “improved” the jitter in our system by -389% !
  • RTT:
    • Average RTT BEFORE: 13 ms
    • Average RTT AFTER: 21 ms
    • We have “improved” the average RTT by -59% !
  • Why??
    • High overhead from the Update thread
    • Queried the Naming Service every 200 us!
      • Oops….
rt ft optimization part 2
RT-FT Optimization – Part 2

Reduced the update period from 200 us to 500 ms

rt ft optimization part 21
RT-FT Optimization – Part 2

With faults…but why the high periodic jitter?

13 spikes above 200 ms

rt ft optimization part 22
RT-FT Optimization – Part 2

Bug discovered and fixed from analyzing results

3 spikes above 200 ms

rt ft fail over measurements
RT-FT Fail-Over Measurements
  • Average RTT:
    • 41 ms
  • Jitter:
    • Average Faulty Jitter: 81 ms
    • Maximum Jitter: 480 ms
  • Failover time:
    • Previous max: 210 ms
    • Current max: 230 ms

However, these numbers are not realistically useful because

- Cluster variability influences jitter considerably

- Measurements are environment dependent

% of Outliers before = 0.1286%

% of Outliers after = 0.0296%

rt ft performance strategy
RT-FT-Performance Strategy
    • Load balancing is performed on the client-side
    • Client randomly selects initial server to connect to
    • Upon failure, client randomly connects to another alive server
    • Take advantage of multiple servers
    • Explore the effects of spreading a single game across multiple servers


performance measurements
Performance Measurements

Load Balancing Worsens RTT Performance

performance measurements1
Performance Measurements
  • Load balancing decreased performance
    • This is counter-intuitive
    • One single-threaded server should be slower than multiple single-threaded servers
      • Load balancing should have improved RTT since multiple servers could service separate games simultaneously
  • Server code was not modified in implementing load balancing
    • Problem has to be with concurrent accesses to the database
    • This pointed us to a bug in the database table locking code
      • Transactions and table locks were out of order, causing table locks to be released prematurely
performance measurements2
Performance Measurements

Average RTT (µs)

Load-Balancing with DB lock bug fixed

performance measurements3
Performance Measurements
  • Corrected Load Balancing
    • Load balancing resulted in improved performance
      • Non-balanced average RTT: 454 ms
      • Balanced average RTT: 255 ms
insights from measurements
Insights from Measurements
  • FT-Baseline
    • Can’t assume failover time is consistent or bounded
  • RT-FT-Optimization (update thread)
    • Reducing jitter resulted in increased average RTT
    • Scheduling the update thread too frequently results in increased jitter and overhead
  • Load Balancing
    • Load balancing can easily be done incorrectly
      • Spreading games across multiple machines does not necessarily improve performance
      • It can be difficult to select the right server to fail-over to
      • Single shared resource can be the bottleneck
open issues
Open Issues
  • Let’s discuss some issues…
    • Newer cluster software would be nice!
      • Newer MySQL with finer-grained locking, stored procedures, bug fixes, etc.
      • Newer gcc needed for certain libraries
    • Clients can’t rejoin the game if the client crashes
    • Database server is a huge scalability bottleneck
  • If only we had more time…
    • GUI using real graphics instead of ASCII art
    • Login screen and lobby for clients to interact before game starts
    • Rankings for top ‘Borbamen’ based on some scoring system
    • Multiple maps, power-ups (e.g. drop more bombs, move faster, etc.)
  • What we’ve learned…
    • Stateless servers make server recovery relatively simple
      • But this moves the performance bottleneck to the database
    • Testing to see that your system works is not good enough – looking at performance measurements can also point out implementation bugs
    • Really hard to test performance on shared servers…
    • Testing can be fun when you have a fun project 
  • Accomplishments
    • Failover is really fast for non-loaded servers
    • We created a “real-time” application
    • BORBAMAN= CORBA+BOMBERMAN is really fun!
  • For the next time around…
    • Pick a different middleware that will let us run clients on machines outside of the games cluster
    • Run our own SQL server that doesn’t crash during stress testing O:-)
    • Language interoperability (e.g. Java clients with C++ servers) could be cool
      • Orbacus supposedly supports this
    • Store some state on the servers to reduce the load on the database
  • No Questions Please.
  • GG EZ$$$ kthxbye

Varying Server Load Affects RTT