Team 7:  SUPER B
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems PowerPoint PPT Presentation


  • 68 Views
  • Uploaded on
  • Presentation posted in: General

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems. Team Members. Mike Seto mseto@. Jeremy Ng jwng@. Wee Ming wmc@. Ian Kalinowski igk@. http://www.ece.cmu.edu/~ece749/teams/team7/ or just for ‘borbaman’. Baseline Application 1. SYNOPSIS:

Download Presentation

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Team 7 super b rbaman 18 749 fault tolerant distributed systems

Team 7: SUPER B RBAMAN

18-749: Fault-Tolerant Distributed Systems


Team members

Team Members

Mike Seto

mseto@

Jeremy Ng

jwng@

Wee Ming

wmc@

Ian Kalinowski

igk@

http://www.ece.cmu.edu/~ece749/teams/team7/

or just for ‘borbaman’


Baseline application 1

Baseline Application 1

  • SYNOPSIS:

    • Fault-tolerant, real-time multiplayer game

    • Inspired by Hudsonsoft’s Bomberman

    • Players interact with other players, and their actions affect the shared environment

    • Players plant timed bombs which can destroy walls, players and other bombs.

    • Last player standing is the winner!

  • PLATFORM:

    • Middleware: Orbacus (CORBA/C++) on Linux

    • Graphics: NCURSES

    • Backend: MySQL

3


Baseline application 2

Baseline Application 2

  • COMPONENTS:

    • Front end: one client per player

    • Middle-tier game servers

    • Back end: database

  • STRUCTURE:

    • A client participates in a game with other clients

    • 4 clients per game (more challenging and has real-time elements)

    • A client may only belong to one game

    • A server may support multiple games

    • Game does not start until 4 players have joined


Baseline architecture

Baseline Architecture


Fail over measurements fault free

Fail-Over Measurements – Fault-free


Fault tolerance goals

Fault-Tolerance Goals

  • ORIGINAL FAULT-TOLERANT GOALS:

    • Preserves current game state under server failure

      • Coordinates of players, and player state

      • Bomb locations, timers

      • State of map

      • Score

    • Switch from failed server to another server within 1 second

    • Players who “drop” may rejoin game within 2 seconds


Ft baseline architecture

FT-Baseline Architecture

  • Stateless Servers

    • Two servers run on two machines with a single shared database

    • Passive replication:

      • No distinction between “primary” and “backup” servers

      • No checkpointing

    • Each server replica can receive and process client requests

    • But…clients only talk to one replica at any one time

    • Naming Service and Database system are single point of failure.


Ft baseline architecture1

FT-Baseline Architecture

  • Guaranteeing Determinism

    • State is committed to the reliable database at every client invocation

    • State is read from the DB before processing any requests, and committed back to the DB after processing

    • Table locking per replica in the database insures atomic access per game and guarantees determinism between our replicas

    • Transactions with non-increasing sequence numbers are discarded

  • Transaction processing

    • Database transactions are guaranteed atomic

    • Consistent state is achieved by having the servers read state from the database before beginning any transaction


Ft baseline architecture2

FT-Baseline Architecture


Mechanisms for fail over

Mechanisms for Fail-Over

  • Failing-Over

    • Client detects failure by catching COMM/TRANSIENT exception

    • Client queries Naming Service for list of servers

    • Client connects to first available server in order listed in Naming Service

    • If this list is null, the client waits until a new server registers with the naming service


Fail over measurements 16 faults

Fail-Over Measurements – 16 Faults


Fail over measurements breakdown with 16 faults

Fail-Over Measurements – Breakdown with 16 Faults


Fail over measurements

Fail-Over Measurements

  • Problem:

    • Average Fault-Free RTT: 14.7 ms

    • Average Failure-Induced RTT: 78.8 ms

    • Maximum Failure-Induced RTT: 1045.8 ms

  • Solution: Have servers pre-resolved by client, and have clients

    pre-establish connections with working servers.

Too High!


Rt ft baseline architecture

RT-FT-Baseline Architecture

  • What we tried:

    • Clients create a low-priority Update thread which contacts the Naming Service at a regular interval, caches references of working servers, and attempts to pre-establish connections.

    • This thread also performs maintenance on existing connections and repopulate cache with new launched servers

  • What we expected:

X

15


Rt ft optimization part 1

RT-FT Optimization – Part 1

Before and after multi-threaded optimization

What went wrong?


Bounded real time fail over measurements

Bounded “Real-Time” Fail-Over Measurements

  • Jitter:

    • Maximum Jitter BEFORE: 36 ms

    • Maximum Jitter AFTER: 176 ms

    • We have “improved” the jitter in our system by -389% !

  • RTT:

    • Average RTT BEFORE: 13 ms

    • Average RTT AFTER: 21 ms

    • We have “improved” the average RTT by -59% !

  • Why??

    • High overhead from the Update thread

    • Queried the Naming Service every 200 us!

      • Oops….


Rt ft optimization part 2

RT-FT Optimization – Part 2

Reduced the update period from 200 us to 500 ms


Rt ft optimization part 21

RT-FT Optimization – Part 2

With faults…but why the high periodic jitter?

13 spikes above 200 ms


Rt ft optimization part 22

RT-FT Optimization – Part 2

Bug discovered and fixed from analyzing results

3 spikes above 200 ms


Rt ft fail over measurements

RT-FT Fail-Over Measurements

  • Average RTT:

    • 41 ms

  • Jitter:

    • Average Faulty Jitter: 81 ms

    • Maximum Jitter: 480 ms

  • Failover time:

    • Previous max: 210 ms

    • Current max: 230 ms

      However, these numbers are not realistically useful because

      - Cluster variability influences jitter considerably

      - Measurements are environment dependent

      % of Outliers before = 0.1286%

      % of Outliers after = 0.0296%


Rt ft performance strategy

RT-FT-Performance Strategy

  • LOAD BALANCING:

    • Load balancing is performed on the client-side

    • Client randomly selects initial server to connect to

    • Upon failure, client randomly connects to another alive server

  • MOTIVATION:

    • Take advantage of multiple servers

    • Explore the effects of spreading a single game across multiple servers

23


Performance measurements

Performance Measurements

Load Balancing Worsens RTT Performance


Performance measurements1

Performance Measurements

  • Load balancing decreased performance

    • This is counter-intuitive

    • One single-threaded server should be slower than multiple single-threaded servers

      • Load balancing should have improved RTT since multiple servers could service separate games simultaneously

  • Server code was not modified in implementing load balancing

    • Problem has to be with concurrent accesses to the database

    • This pointed us to a bug in the database table locking code

      • Transactions and table locks were out of order, causing table locks to be released prematurely


Performance measurements2

Performance Measurements

Average RTT (µs)

Load-Balancing with DB lock bug fixed


Performance measurements3

Performance Measurements

  • Corrected Load Balancing

    • Load balancing resulted in improved performance

      • Non-balanced average RTT: 454 ms

      • Balanced average RTT: 255 ms


Insights from measurements

Insights from Measurements

  • FT-Baseline

    • Can’t assume failover time is consistent or bounded

  • RT-FT-Optimization (update thread)

    • Reducing jitter resulted in increased average RTT

    • Scheduling the update thread too frequently results in increased jitter and overhead

  • Load Balancing

    • Load balancing can easily be done incorrectly

      • Spreading games across multiple machines does not necessarily improve performance

      • It can be difficult to select the right server to fail-over to

      • Single shared resource can be the bottleneck


Open issues

Open Issues

  • Let’s discuss some issues…

    • Newer cluster software would be nice!

      • Newer MySQL with finer-grained locking, stored procedures, bug fixes, etc.

      • Newer gcc needed for certain libraries

    • Clients can’t rejoin the game if the client crashes

    • Database server is a huge scalability bottleneck

  • If only we had more time…

    • GUI using real graphics instead of ASCII art

    • Login screen and lobby for clients to interact before game starts

    • Rankings for top ‘Borbamen’ based on some scoring system

    • Multiple maps, power-ups (e.g. drop more bombs, move faster, etc.)


Conclusions

Conclusions

  • What we’ve learned…

    • Stateless servers make server recovery relatively simple

      • But this moves the performance bottleneck to the database

    • Testing to see that your system works is not good enough – looking at performance measurements can also point out implementation bugs

    • Really hard to test performance on shared servers…

    • Testing can be fun when you have a fun project 

  • Accomplishments

    • Failover is really fast for non-loaded servers

    • We created a “real-time” application

    • BORBAMAN= CORBA+BOMBERMAN is really fun!


Conclusions1

Conclusions

  • For the next time around…

    • Pick a different middleware that will let us run clients on machines outside of the games cluster

    • Run our own SQL server that doesn’t crash during stress testing O:-)

    • Language interoperability (e.g. Java clients with C++ servers) could be cool

      • Orbacus supposedly supports this

    • Store some state on the servers to reduce the load on the database


Team 7 super b rbaman 18 749 fault tolerant distributed systems

And the winner goes to …..


Finale

Finale!

  • No Questions Please.

  • GG EZ$$$ kthxbye


Appendix

Appendix

Varying Server Load Affects RTT


Rt ft fail over measurements1

RT-FT Fail-Over Measurements


  • Login