1 / 24

Reliability in cloud and mobile apps

Doh !. Reliability in cloud and mobile apps. http://www.flickr.com/photos/johanl/4934459020. Traditional client-server vs cloud. Traditional client-server Usually highly-reliable server available on demand Cloud Garbage hardware that could fail at any time

wilma
Download Presentation

Reliability in cloud and mobile apps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Doh! Reliabilityin cloud and mobile apps http://www.flickr.com/photos/johanl/4934459020

  2. Traditional client-server vs cloud • Traditional client-server • Usually highly-reliable server available on demand • Cloud • Garbage hardware that could fail at any time • Challenge: ensure reliability of apps nonetheless

  3. BTW, cloud servers aren’t necessarily very well-configured, either • Example GAE servers • 128MB-1GB RAM; 600MHz-4.8GHz CPU https://developers.google.com/appengine/docs/java/config/backends • Amazon EC2 “medium” servers • 3.75 GB RAM; 2.0-2.4 GHz 2007 Opteron CPU http://aws.amazon.com/ec2/instance-types/ • My cheap, busted up, 4-year old laptop • 3 GB RAM; 2x2.4 GHz Intel (Core Duo) CPU cores May 14, 2012

  4. Yet, from the current GAEService Level Agreement (SLA) They really mean to be highly reliable!!!!So how do they do it? How can you make the most of it? https://developers.google.com/appengine/sla

  5. SLAs often quote reliability as “nines” • Two nines: 99%, 3.65 days downtime every year • Easy to do with cheap hardware + backup • Three nines: 99.9%, 8 hours every year • Can be done with reasonably good hardware • Four nines: 99.99%, < 1 hour every year • Not all systems can do this • Fine nines: 99.999%, 7 minutes every year • Very hard to achieve, and very expensive • Each “nine” approximately doubles the cost 

  6. Key reliability principles • Replication • Provide a means for monitoring • Consider using a hybrid cloud

  7. Replication of computation • GAE automatically will copy your code • Starting up multiple servers to handle requests • If your server generally responds quickly to requests • And there is extra hardware available at the moment • Automatically balancing load Replication Monitoring  Hybridize

  8. Data also needs replication • You can control the level of replication • Old-fashioned (traditional client-server) • Set up a “master” database server • Configure the master to copy its data to “slaves” (e.g., every night) • Cloud-based approach • Let the infrastructure replicate data automatically • GAE: You have two options… master/slave, and high-replication datastore Replication Monitoring  Hybridize

  9. High-replication datastore (HRD) vs master/slave datastore (MSD) • HRD makes backup copies across datacenters (and > 2 copies—MSD has only 2 copies) • HRD includes a more sophisticated algorithm for resolving errors on (some) servers • MSD: writes all go to the master (if available); master copied to slaves; reads all go to the master (if available) [Deprecated!] • HRD: more sophisticated algorithm where the different servers (no master) form a consensus Replication Monitoring  Hybridize

  10. Pros of using HRD • Pro: Reliability is vastly improved • Largely due to replication of data across datacenters • Pro: support for cross-group transactions in Python • Apparently? Test before relying on it! • Maybe available in Java? • Config change needed? https://developers.google.com/appengine/docs/python/datastore/overview#Cross_Group_Transactions Replication Monitoring  Hybridize

  11. Cons: Latency and eventual consistency • Con: Latency can be pretty big (> 1 second) • Writes (and reads) go to multiple servers, multiple datacenters • Con: Data just written might not appear in a read • GAE might write to server X but then read from Y • Data on X might not be copied to Y right away

  12. Coping with problem #1, latency • Cache a copy of data on client • Eliminates the need to hit the server • Bonus: improves reliability when server is offline • Write a copy to memcache • So you can read back faster • Only do this for data you read a lot, of course Replication Monitoring  Hybridize

  13. Coping with problem #2, writes not appearing on read • Don’t assume that an entity you just wrote will immediately appear in a query (in HRD) • Wait a few seconds to read back • Or automatically append the written entity to the query results if you don’t see it Replication Monitoring  Hybridize

  14. Example pseudocode(must be fancier for sorted queries) Course mycourse = create a new entity pm.makePersistent(mycourse) List<Course> courses = query for courses booleansawit = false foreach (Course course in courses) if (course.id == mycourse.id) {sawit = true; break;} If !(sawit) courses.add(mycourse); Foreach (Course course in courses) do something with course

  15. Coping with another reliability problem (#3), exception on commit • If you use transactions (locks), you will get exceptions on multiple simultaneous writes • True for MSD, HRD, or any other platform that relies on optimistic locking • Use a try/catch/retry approach • Repeatedly try to write your updates if they fail on the first try Replication Monitoring  Hybridize

  16. Example pseudocode int retries = 10 while (--retries >= 0) { try { Start transaction Course mycourse = get the course entity make modifications to mycourse pm.makePersistent(mycourse) commit transaction retries = 0 } catch (JDOException) { log the exception } }

  17. Monitoring • You should provide a means of monitoring your system’s uptime • Common approach: Script on client elsewhere • Could be another cloud service (e.g., EC2) • Script accesses the server • Client tracks success rate + latency Replication  Monitoring Hybridize

  18. What to monitor • The services of the application itself • You probably need to include some test data • Also three other “dummy” services • One that just returns • One that reads from datastore • One that writes to datastore and reads back Replication  Monitoring Hybridize

  19. Things you can do with data • Detect when one/some of your application’s services have crashed • Or are getting slow • Detect if any problems are your fault • i.e., one of your own application’s services has failed but the dummy services are working • Decide whether/when/how to redesign • Changes to your own application • Integrate a different cloud platform Replication  Monitoring Hybridize

  20. Consider using a hybrid cloud • Distributing code and data across platforms • Example: EC2 + GAE • Example: EC2 + your own servers • Ways that hybrid can help • Taking advantage of specialized APIs • Fail-over when one platform fails • Protecting access to data Replication  Monitoring  Hybridize

  21. Hybrid cloud scenario #1 • Your application analyzes some binary files. The analyzer code only runs on Windows. Unfortunately, Azure is very expensive. • Solution: • Deploy the analyzer on Azure • Expose its functionality via network calls • Deploy most of the code on GAE (nice and cheap) • The GAE part of the application calls the Azure part of the application and stores result in GAE Replication  Monitoring  Hybridize

  22. Hybrid cloud scenario #2 • Your application is on EC2 and has demonstrated high performance + reliability. But the outage a few years back scared your manager. • Solution: • Tweak the application to run on GoGrid (very similar to EC2) • But continue hosting on EC2, where your application has shown excellent performance. • Tweak your client so that if your EC2 server stops responding, then it calls GoGrid instead • Write scripts on GoGrid and EC2 to sync data. Replication  Monitoring  Hybridize

  23. Hybrid cloud scenario #3 • Some of your data is very sensitive and cannot be trusted to cloud providers. Other data and associated computations are not sensitive and have periodic demand spikes. • Solution: • Deploy the sensitive data on your server and the not-so-sensitive data+computation on cloud. • In your client, invoke the company server for computations on sensitive data and invoke cloud servers for not-so-sensitive data+computation. Replication  Monitoring  Hybridize

  24. Key reliability principles • Replicate • Replicate your code • Use the high-replication datastore • Be prepared to cope with problems • Replicate data to client and memcache • Detect and handle writes-not-appearing-on-read • Try/catch/retry approach to handle failure • Provide a means for monitoring • Consider using a hybrid cloud • For APIs, fail-over, securing data

More Related