1 / 41

Building Fault-Tolerant Enterprise Applications

Building Fault-Tolerant Enterprise Applications. Greg Hinkle Chariot Solutions chariotsolutions.com. Adapted from original presentation by: Erin Mulder & Brian McCallister. Agenda. Goals of Fault Tolerance User Recoverable Errors Expected Application Errors System Failure

dotsonm
Download Presentation

Building Fault-Tolerant Enterprise Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Fault-Tolerant Enterprise Applications Greg Hinkle Chariot Solutions chariotsolutions.com Adapted from original presentation by: Erin Mulder & Brian McCallister

  2. Agenda Goals of Fault Tolerance User Recoverable Errors Expected Application Errors System Failure Useful Strategies Discussion

  3. Goals of Fault Tolerance What are we really worried about? • Availability • Integrity • Confidentiality • Usability • Cost

  4. Goals of Fault Tolerance What can go wrong? • User Error • Concurrent Changes • Bugs • Resource Failure/Downtime • System Overload • Misconfiguration • Sabotage

  5. Goals of Fault Tolerance Themes we’ll keep visiting… • Prevention • Code Guidelines & Reviews • Automated Validation & Regression Testing • Performance / Stress Testing • Negative / Security Testing • Detection • Logging and Auditing • Validation Patterns • Monitoring • Recovery • Exception handling patterns • Error feedback loop • Redundancy

  6. Agenda Goals of Fault Tolerance User Recoverable Errors Expected Application Errors System Failure Useful Strategies Discussion

  7. User Recoverable Errors Simple validation error • What do you do when the user: • Leaves a required field blank • Enters a value too big for the database field • Types letters in a numeric field • Selects inconsistent options • Tries to do things in the wrong order

  8. User Recoverable Errors Simple validation error • Fault tolerance is more than detection • Prevent the user from making errors • Set maxlengths on input fields • Use character masks • Specify units • Show example input • Don’t allow the selection of inconsistent options • Don’t present navigation options that aren’t meant to be followed • Guide the user through longer processes

  9. User Recoverable Errors Simple validation error • Help the user recover quickly • Highlight all errors clearly • Show help text and examples for invalid fields • If some other action is required first, launch it instead of interrupting the flow with frustrating errors • Perception is everything! • Log the error for later analysis • Save enough information to recreate • Start automatically handling common mistakes

  10. User Recoverable Errors Optimistic concurrency clash • Everything looks good until the save • Then… • Item has just gone out of stock • Another user has just updated the same document • Time has passed and action is no longer allowed

  11. User Recoverable Errors Optimistic concurrency clash • Increase save points • Alert user to potential risk: • Low stock • Another user just accessed this record • Another user has “soft lock” on record • Offer useful options for resolving collision: • Merge changes • Backorder • Automatically retry later • “Email me when it is available” • Give tips for avoiding future collisions

  12. User Recoverable Errors Bookmarks, back buttons and browsers • User escapes normal page flow • Bookmarks login page or internal page • Uses back button • Opens a new window within same session • Session times out • Missing context from previous requests • Next click is like bookmark to internal page • Other browser oddities • Double-clicking submit buttons • Pressing stop button in the middle of a request

  13. User Recoverable Errors Bookmarks, back buttons and sessions • Prevention is difficult – the user is in control • Javascript can sometimes help • Javascript can sometimes hurt • Plan for and test each of these scenarios • Plan for handling out-of-sequence requests • Limit state or unique key it

  14. User Recoverable Errors Bookmarks, back buttons and sessions • To seamlessly handle session timeouts and out-of-sequence requests, consider: • Persistent sessions (saved to database) • Passing state in every request (form fields or URL rewriting) • Storing state in custom cookies • Adding custom logic to recover from timed-out sequences • Resubmit requests after re-authentication • To simply detect and alert, consider: • Using listener to catch session expiration • Using state validation to catch out-of-sequence requests • Redirecting user to session expiration page • To improve process: • Log session losses (requests within expired session) • Consider increasing session timeout • Consider using prevention techniques described above

  15. User Recoverable Errors Bookmarks, back buttons and sessions • To minimize impact of back button, consider: • Techniques described for out-of-sequence requests • Redirecting to GETs instead of returning responses to POSTs • To work around double submissions, consider: • Utilize unique transaction identifiers stored in session • Forward action submissions to separated response pages • Response pages automatically display on double submit • To handle multiple windows, consider: • Passing state in every request • Pass state in hidden fields throughout a wizard • Adapting web frameworks to map state (e.g. Struts form beans) by primary key or request ID instead of a static name

  16. Agenda Goals of Fault Tolerance User Recoverable Errors Expected Application Errors System Failure Useful Strategies Discussion

  17. Expected Application Errors Resource is unavailable… • Database is down for maintenance • No connection to integrated partner service • Resource is overloaded: • Out of DB connections • JMS Queue full

  18. Expected Application Errors Resource is unavailable… • To prevent, consider: • Coordinating maintenance schedules • Planning for failover at the resource level • Increasing hardware budget  • Increasing transaction timeout seconds (caution – last resort) • To handle, analyze transactional requirements: • Is immediate user response necessary? • Can the resource access be handled asynchronously with an extended, logical transaction? • Plan rollbacks carefully to allow for retries (consider idempotence, sub-transactions) • Alert operator/admin if out of SLA • Log all outages (study for patterns)

  19. Expected Application Errors Application is overloaded… • Mentioned on CNBC • Linked from Slashdot • Denial of Service

  20. Expected Application Errors Application is overloaded… • Test under heavy load • Plan for growth • Tune hot spots • Run with excess capacity • Throttle at network level • Use JMS and other asynchronous technologies to throttle on backend • Tune application server to degrade gracefully • Monitor carefully • Be prepared to scale out, not just up

  21. Expected Application Errors Bugs and other undocumented features… • Friendly bug: • Triggers invalid state • Causes VM or app server to throw exception • Greedy bug: • Monopolizes resources • Leaks connections • Silent and deadly bug: • Corrupts data

  22. Expected Application Errors Bugs and other undocumented features… • To handle friendly bugs: • Bulletproof your transactions & rollbacks • Write coding and design guidelines • Conduct peer code reviews (share best practices) • For client applications, catch Throwable • Map exception handling in server container • The finally clause is your friend • Display sanitized errors to user • Give enough information to map back to logs • Log carefully to allow easy debugging • Configure timestamp, thread id output • Log data together not individually • Alert operator/administrator

  23. Expected Application Errors Bugs and other undocumented features… • To handle greedy bugs: • Reduce transaction timeout seconds • Handle timeouts in the same way as friendly bugs • Monitor carefully • Log statistics (# of transaction timeouts, CPU usage, memory usage, GC, network traffic, stuck threads…) • Automate log analysis • Trigger a thread dump (kill -3) during hot spots • Alert operator/administrator to hot spots • Use clustering to contain damage

  24. Expected Application Errors Bugs and other undocumented features… • To handle silent and deadly bugs: • Bulletproof transaction settings • Validate on multiple levels, use referential integrity • Audit everything • Unless performance/cost prohibits, keep a complete audit trail on every table (easy with triggers, aspects or code generators), try to include transaction ID • Flush caches regularly • After a save, load the record from the database and display back to the user • Run periodic audits with human review • Plan for how to use audit trail to recover from data corruption • Early detection is key… escalate user concerns!

  25. Agenda Goals of Fault Tolerance User Recoverable Errors Expected Application Errors System Failure Useful Strategies Discussion

  26. System Failure Never have an “unplanned” outage • Determine acceptable downtime • Plan clustering / failover accordingly • Monitor carefully so outages are detected immediately • Be ready with a tiny “planned outage” page and server in advance • Consider offsite host • Build this functionality into non-Web clients at development time • Plan for transaction recovery • Plan for JMS recovery • Use “quiescing” load balancing to bring servers offline for maintenance

  27. System Failure Sabotage • Encrypt data in database • Security through obscurity • Key entry on startup • Credit cards should be two-way encrypted (resist the urge to Rot13) • Passwords should be one-way hashed • Create new temporary passwords for “forgotten pass” • SQL Injection Prevention • Don’t dynamically generate SQL with user input • Use prepare statements • Cross-site scripting • Cleanse any user data republished on a site • Don’t publish extra information • Turn of server headers, require SSL on login or throughout • Create a DMZ • Two firewalls • Use SSL between tiers

  28. Agenda Goals of Fault Tolerance User Recoverable Errors Expected Application Errors System Failure Useful Strategies Discussion

  29. Useful Strategies Be sure that you develop guidelines for: • Error Messages • Validation (format, business rules, size, cleansing…) • Logging (when, where, what…) • Auditing • Monitoring (level of automation, alerts) • Transactions (who rolls back, checked vs. unchecked…) • Sessions & Caching (request vs. session, flushing…) • Clustering

  30. Useful Strategies Error Messages • For validation errors, be sure to: • Include format and size hints • Show examples • Give more information than the basic field label • Mention the error at the top of the screen and Highlight the field • Catch all errors at the same time • For other user-recoverable errors • Let the user know what to do next • If the user can’t recover • Apologize • Give no details • Suggest workarounds • (Silently log and alert!)

  31. Useful Strategies Validation • If possible, validate at all levels • Common strategies: • Externalize validation rules and use a framework that supports rich validation • Clearly define which layers are responsible for which types of validation. For example: • All format errors handled in web tier • All business rule violations handled in application tier • All field lengths enforced at data tier

  32. Useful Strategies Logging • Log in all tiers • Define logging levels and when they are used • Log user failures at different levels than system failures • Include: timestamp, user, thread ID, transaction ID, etc. • Don’t make logs a source of failure (watch disk space, JMS load, etc.) • Log information in a single call • Aggregate server logs • Socket appender • Scripts and mounting Bad log.trace(“Searching: “ + keyword); log.trace(“Found: “ + results.size()); Good Log.trace(“Searching: “ + keyword + “Found: “ + results.size());

  33. Useful Strategies Auditing • Audit operations where possible • Provides accountability • Easier to support users • Easier to debug • Easier to recover from disaster • Easier to detect attacks • Include: • Timestamp • Current User • Some sort of thread ID, transaction ID, etc. • Complete data record or diff

  34. Useful Strategies Monitoring • Common strategies include: • 24/7 operations center • Business hours operation center • Automated, redundant processes that analyze logs and raise alerts to on-call administrators • SNMP and monitors • Logs show more than critical errors • Ideally, mine them for clues on usability, performance problems and attacks • JMX clients

  35. Useful Strategies Monitoring - Tools • Free • Nagios (Host, Network, Service monitoring) • Groundwork Monitor • MC4J • EJTools • Cost • AdventNet • OpenView

  36. Useful Strategies Transactions • Top server-side tier creates a user transaction, catches all errors and then determines its fate • Container-managed transactions with session façade: • Top level methods responsible for rollbacks • Business methods responsible for rollbacks • Unchecked exceptions not recommended with EJB • Unchecked exceptions with Spring

  37. Useful Strategies Sessions and Caching • Use session sparingly • Common strategies: • Hidden form fields • Cookies (encrypted) • URL rewriting • HTTP Session • Shared caches (OSCache, Tangosol) • When to flush cache? • Caches can mask data problems • Data should have timeouts • Shared caches should limit usage (LRU)

  38. Useful Strategies Clustering • Why use clusters? • Availability • Scalability • Will this application need a cluster? • Can you take it offline for maintenance? • Can you take it offline to scale it up? • Are you sure you won’t need to scale it out? • Can be expensive and complicated • Can require more expensive licensing • Requires serializable data in session • Limit the use of session and re-put objects on edit • Requires more testing (test fail over conditions)

  39. Useful Strategies Clustering • JBoss & Tomcat have limited cluster sizes • Multicast can require network and operating system changes • Multiple JVMs and log files to monitor • Configuration management issues • Synchronizing updates • Custom settings per instance

  40. Discussion Get the slides online at: http://www.chariotsolutions.com/slides 40

  41. Building Fault-Tolerant Enterprise Applications Greg Hinkle Chariot Solutions chariotsolutions.com

More Related