slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Reliability in cloud and mobile apps PowerPoint Presentation
Download Presentation
Reliability in cloud and mobile apps

Loading in 2 Seconds...

play fullscreen
1 / 24

Reliability in cloud and mobile apps - PowerPoint PPT Presentation

  • Uploaded on

Doh !. Reliability in cloud and mobile apps. Traditional client-server vs cloud. Traditional client-server Usually highly-reliable server available on demand Cloud Garbage hardware that could fail at any time

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Reliability in cloud and mobile apps' - wilma

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript


Reliabilityin cloud and mobile apps

traditional client server vs cloud
Traditional client-server vs cloud
  • Traditional client-server
    • Usually highly-reliable server available on demand
  • Cloud
    • Garbage hardware that could fail at any time
    • Challenge: ensure reliability of apps nonetheless
btw cloud servers aren t necessarily very well configured either
BTW, cloud servers aren’t necessarily very well-configured, either
  • Example GAE servers
    • 128MB-1GB RAM; 600MHz-4.8GHz CPU

  • Amazon EC2 “medium” servers
    • 3.75 GB RAM; 2.0-2.4 GHz 2007 Opteron CPU

  • My cheap, busted up, 4-year old laptop
    • 3 GB RAM; 2x2.4 GHz Intel (Core Duo) CPU cores

May 14, 2012

yet from the current gae service level agreement sla
Yet, from the current GAEService Level Agreement (SLA)

They really mean to be highly reliable!!!!So how do they do it? How can you make the most of it?

slas often quote reliability as nines
SLAs often quote reliability as “nines”
  • Two nines: 99%, 3.65 days downtime every year
    • Easy to do with cheap hardware + backup
  • Three nines: 99.9%, 8 hours every year
    • Can be done with reasonably good hardware
  • Four nines: 99.99%, < 1 hour every year
    • Not all systems can do this
  • Fine nines: 99.999%, 7 minutes every year
    • Very hard to achieve, and very expensive
    • Each “nine” approximately doubles the cost 
key reliability principles
Key reliability principles
  • Replication
  • Provide a means for monitoring
  • Consider using a hybrid cloud
replication of computation
Replication of computation
  • GAE automatically will copy your code
    • Starting up multiple servers to handle requests
      • If your server generally responds quickly to requests
      • And there is extra hardware available at the moment
    • Automatically balancing load

Replication Monitoring  Hybridize

data also needs replication
Data also needs replication
  • You can control the level of replication
  • Old-fashioned (traditional client-server)
    • Set up a “master” database server
    • Configure the master to copy its data to “slaves” (e.g., every night)
  • Cloud-based approach
    • Let the infrastructure replicate data automatically
    • GAE: You have two options… master/slave, and high-replication datastore

Replication Monitoring  Hybridize

high replication datastore hrd vs master slave datastore msd
High-replication datastore (HRD) vs master/slave datastore (MSD)
  • HRD makes backup copies across datacenters

(and > 2 copies—MSD has only 2 copies)

  • HRD includes a more sophisticated algorithm for resolving errors on (some) servers
    • MSD: writes all go to the master (if available); master copied to slaves; reads all go to the master (if available) [Deprecated!]
    • HRD: more sophisticated algorithm where the different servers (no master) form a consensus

Replication Monitoring  Hybridize

pros of using hrd
Pros of using HRD
  • Pro: Reliability is vastly improved
    • Largely due to replication of data across datacenters
  • Pro: support for cross-group transactions in Python
    • Apparently? Test before relying on it!
    • Maybe available in Java?
    • Config change needed?

Replication Monitoring  Hybridize

cons latency and eventual consistency
Cons: Latency and eventual consistency
  • Con: Latency can be pretty big (> 1 second)
    • Writes (and reads) go to multiple servers, multiple datacenters
  • Con: Data just written might not appear in a read
    • GAE might write to server X but then read from Y
    • Data on X might not be copied to Y right away
coping with problem 1 latency
Coping with problem #1, latency
  • Cache a copy of data on client
    • Eliminates the need to hit the server
    • Bonus: improves reliability when server is offline
  • Write a copy to memcache
    • So you can read back faster
    • Only do this for data you read a lot, of course

Replication Monitoring  Hybridize

coping with problem 2 writes not appearing on read
Coping with problem #2, writes not appearing on read
  • Don’t assume that an entity you just wrote will immediately appear in a query (in HRD)
    • Wait a few seconds to read back
    • Or automatically append the written entity to the query results if you don’t see it

Replication Monitoring  Hybridize

example pseudocode must be fancier for sorted queries
Example pseudocode(must be fancier for sorted queries)

Course mycourse = create a new entity


List<Course> courses = query for courses

booleansawit = false

foreach (Course course in courses)

if ( == {sawit = true; break;}

If !(sawit) courses.add(mycourse);

Foreach (Course course in courses)

do something with course

coping with another reliability problem 3 exception on commit
Coping with another reliability problem (#3), exception on commit
  • If you use transactions (locks), you will get exceptions on multiple simultaneous writes
    • True for MSD, HRD, or any other platform that relies on optimistic locking
  • Use a try/catch/retry approach
    • Repeatedly try to write your updates if they fail on the first try

Replication Monitoring  Hybridize

example pseudocode
Example pseudocode

int retries = 10

while (--retries >= 0) {

try {

Start transaction

Course mycourse = get the course entity

make modifications to mycourse


commit transaction

retries = 0

} catch (JDOException) {

log the exception



  • You should provide a means of monitoring your system’s uptime
  • Common approach: Script on client elsewhere
    • Could be another cloud service (e.g., EC2)
    • Script accesses the server
    • Client tracks success rate + latency

Replication  Monitoring Hybridize

what to monitor
What to monitor
  • The services of the application itself
    • You probably need to include some test data
  • Also three other “dummy” services
    • One that just returns
    • One that reads from datastore
    • One that writes to datastore and reads back

Replication  Monitoring Hybridize

things you can do with data
Things you can do with data
  • Detect when one/some of your application’s services have crashed
    • Or are getting slow
  • Detect if any problems are your fault
    • i.e., one of your own application’s services has failed but the dummy services are working
  • Decide whether/when/how to redesign
    • Changes to your own application
    • Integrate a different cloud platform

Replication  Monitoring Hybridize

consider using a hybrid cloud
Consider using a hybrid cloud
  • Distributing code and data across platforms
    • Example: EC2 + GAE
    • Example: EC2 + your own servers
  • Ways that hybrid can help
    • Taking advantage of specialized APIs
    • Fail-over when one platform fails
    • Protecting access to data

Replication  Monitoring  Hybridize

hybrid cloud scenario 1
Hybrid cloud scenario #1
  • Your application analyzes some binary files. The analyzer code only runs on Windows. Unfortunately, Azure is very expensive.
  • Solution:
    • Deploy the analyzer on Azure
    • Expose its functionality via network calls
    • Deploy most of the code on GAE (nice and cheap)
    • The GAE part of the application calls the Azure part of the application and stores result in GAE

Replication  Monitoring  Hybridize

hybrid cloud scenario 2
Hybrid cloud scenario #2
  • Your application is on EC2 and has demonstrated high performance + reliability. But the outage a few years back scared your manager.
  • Solution:
    • Tweak the application to run on GoGrid (very similar to EC2)
    • But continue hosting on EC2, where your application has shown excellent performance.
    • Tweak your client so that if your EC2 server stops responding, then it calls GoGrid instead
    • Write scripts on GoGrid and EC2 to sync data.

Replication  Monitoring  Hybridize

hybrid cloud scenario 3
Hybrid cloud scenario #3
  • Some of your data is very sensitive and cannot be trusted to cloud providers. Other data and associated computations are not sensitive and have periodic demand spikes.
  • Solution:
    • Deploy the sensitive data on your server and the not-so-sensitive data+computation on cloud.
    • In your client, invoke the company server for computations on sensitive data and invoke cloud servers for not-so-sensitive data+computation.

Replication  Monitoring  Hybridize

key reliability principles1
Key reliability principles
  • Replicate
    • Replicate your code
    • Use the high-replication datastore
    • Be prepared to cope with problems
      • Replicate data to client and memcache
      • Detect and handle writes-not-appearing-on-read
      • Try/catch/retry approach to handle failure
  • Provide a means for monitoring
  • Consider using a hybrid cloud
    • For APIs, fail-over, securing data