Team 7:  SUPER B
1 / 34

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems - PowerPoint PPT Presentation

  • Uploaded on

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems. Team Members. Mike Seto mseto@. Jeremy Ng jwng@. Wee Ming wmc@. Ian Kalinowski igk@. or just for ‘borbaman’. Baseline Application 1. SYNOPSIS:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems' - truly

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Team 7 super b rbaman 18 749 fault tolerant distributed systems


18-749: Fault-Tolerant Distributed Systems

Team members
Team Members

Mike Seto


Jeremy Ng


Wee Ming


Ian Kalinowski


or just for ‘borbaman’

Baseline application 1
Baseline Application 1


    • Fault-tolerant, real-time multiplayer game

    • Inspired by Hudsonsoft’s Bomberman

    • Players interact with other players, and their actions affect the shared environment

    • Players plant timed bombs which can destroy walls, players and other bombs.

    • Last player standing is the winner!


    • Middleware: Orbacus (CORBA/C++) on Linux

    • Graphics: NCURSES

    • Backend: MySQL


Baseline application 2
Baseline Application 2


    • Front end: one client per player

    • Middle-tier game servers

    • Back end: database


    • A client participates in a game with other clients

    • 4 clients per game (more challenging and has real-time elements)

    • A client may only belong to one game

    • A server may support multiple games

    • Game does not start until 4 players have joined

Fault tolerance goals
Fault-Tolerance Goals


    • Preserves current game state under server failure

      • Coordinates of players, and player state

      • Bomb locations, timers

      • State of map

      • Score

    • Switch from failed server to another server within 1 second

    • Players who “drop” may rejoin game within 2 seconds

Ft baseline architecture
FT-Baseline Architecture

  • Stateless Servers

    • Two servers run on two machines with a single shared database

    • Passive replication:

      • No distinction between “primary” and “backup” servers

      • No checkpointing

    • Each server replica can receive and process client requests

    • But…clients only talk to one replica at any one time

    • Naming Service and Database system are single point of failure.

Ft baseline architecture1
FT-Baseline Architecture

  • Guaranteeing Determinism

    • State is committed to the reliable database at every client invocation

    • State is read from the DB before processing any requests, and committed back to the DB after processing

    • Table locking per replica in the database insures atomic access per game and guarantees determinism between our replicas

    • Transactions with non-increasing sequence numbers are discarded

  • Transaction processing

    • Database transactions are guaranteed atomic

    • Consistent state is achieved by having the servers read state from the database before beginning any transaction

Mechanisms for fail over
Mechanisms for Fail-Over

  • Failing-Over

    • Client detects failure by catching COMM/TRANSIENT exception

    • Client queries Naming Service for list of servers

    • Client connects to first available server in order listed in Naming Service

    • If this list is null, the client waits until a new server registers with the naming service

Fail over measurements
Fail-Over Measurements

  • Problem:

    • Average Fault-Free RTT: 14.7 ms

    • Average Failure-Induced RTT: 78.8 ms

    • Maximum Failure-Induced RTT: 1045.8 ms

  • Solution: Have servers pre-resolved by client, and have clients

    pre-establish connections with working servers.

Too High!

Rt ft baseline architecture
RT-FT-Baseline Architecture

  • What we tried:

    • Clients create a low-priority Update thread which contacts the Naming Service at a regular interval, caches references of working servers, and attempts to pre-establish connections.

    • This thread also performs maintenance on existing connections and repopulate cache with new launched servers

  • What we expected:



Rt ft optimization part 1
RT-FT Optimization – Part 1

Before and after multi-threaded optimization

What went wrong?

Bounded real time fail over measurements
Bounded “Real-Time” Fail-Over Measurements

  • Jitter:

    • Maximum Jitter BEFORE: 36 ms

    • Maximum Jitter AFTER: 176 ms

    • We have “improved” the jitter in our system by -389% !

  • RTT:

    • Average RTT BEFORE: 13 ms

    • Average RTT AFTER: 21 ms

    • We have “improved” the average RTT by -59% !

  • Why??

    • High overhead from the Update thread

    • Queried the Naming Service every 200 us!

      • Oops….

Rt ft optimization part 2
RT-FT Optimization – Part 2

Reduced the update period from 200 us to 500 ms

Rt ft optimization part 21
RT-FT Optimization – Part 2

With faults…but why the high periodic jitter?

13 spikes above 200 ms

Rt ft optimization part 22
RT-FT Optimization – Part 2

Bug discovered and fixed from analyzing results

3 spikes above 200 ms

Rt ft fail over measurements
RT-FT Fail-Over Measurements

  • Average RTT:

    • 41 ms

  • Jitter:

    • Average Faulty Jitter: 81 ms

    • Maximum Jitter: 480 ms

  • Failover time:

    • Previous max: 210 ms

    • Current max: 230 ms

      However, these numbers are not realistically useful because

      - Cluster variability influences jitter considerably

      - Measurements are environment dependent

      % of Outliers before = 0.1286%

      % of Outliers after = 0.0296%

Rt ft performance strategy
RT-FT-Performance Strategy


    • Load balancing is performed on the client-side

    • Client randomly selects initial server to connect to

    • Upon failure, client randomly connects to another alive server


    • Take advantage of multiple servers

    • Explore the effects of spreading a single game across multiple servers


Performance measurements
Performance Measurements

Load Balancing Worsens RTT Performance

Performance measurements1
Performance Measurements

  • Load balancing decreased performance

    • This is counter-intuitive

    • One single-threaded server should be slower than multiple single-threaded servers

      • Load balancing should have improved RTT since multiple servers could service separate games simultaneously

  • Server code was not modified in implementing load balancing

    • Problem has to be with concurrent accesses to the database

    • This pointed us to a bug in the database table locking code

      • Transactions and table locks were out of order, causing table locks to be released prematurely

Performance measurements2
Performance Measurements

Average RTT (µs)

Load-Balancing with DB lock bug fixed

Performance measurements3
Performance Measurements

  • Corrected Load Balancing

    • Load balancing resulted in improved performance

      • Non-balanced average RTT: 454 ms

      • Balanced average RTT: 255 ms

Insights from measurements
Insights from Measurements

  • FT-Baseline

    • Can’t assume failover time is consistent or bounded

  • RT-FT-Optimization (update thread)

    • Reducing jitter resulted in increased average RTT

    • Scheduling the update thread too frequently results in increased jitter and overhead

  • Load Balancing

    • Load balancing can easily be done incorrectly

      • Spreading games across multiple machines does not necessarily improve performance

      • It can be difficult to select the right server to fail-over to

      • Single shared resource can be the bottleneck

Open issues
Open Issues

  • Let’s discuss some issues…

    • Newer cluster software would be nice!

      • Newer MySQL with finer-grained locking, stored procedures, bug fixes, etc.

      • Newer gcc needed for certain libraries

    • Clients can’t rejoin the game if the client crashes

    • Database server is a huge scalability bottleneck

  • If only we had more time…

    • GUI using real graphics instead of ASCII art

    • Login screen and lobby for clients to interact before game starts

    • Rankings for top ‘Borbamen’ based on some scoring system

    • Multiple maps, power-ups (e.g. drop more bombs, move faster, etc.)


  • What we’ve learned…

    • Stateless servers make server recovery relatively simple

      • But this moves the performance bottleneck to the database

    • Testing to see that your system works is not good enough – looking at performance measurements can also point out implementation bugs

    • Really hard to test performance on shared servers…

    • Testing can be fun when you have a fun project 

  • Accomplishments

    • Failover is really fast for non-loaded servers

    • We created a “real-time” application

    • BORBAMAN= CORBA+BOMBERMAN is really fun!


  • For the next time around…

    • Pick a different middleware that will let us run clients on machines outside of the games cluster

    • Run our own SQL server that doesn’t crash during stress testing O:-)

    • Language interoperability (e.g. Java clients with C++ servers) could be cool

      • Orbacus supposedly supports this

    • Store some state on the servers to reduce the load on the database


  • No Questions Please.

  • GG EZ$$$ kthxbye


Varying Server Load Affects RTT