1 / 44

A Recovery-Friendly, Self-Managing Session State Store

A Recovery-Friendly, Self-Managing Session State Store. Benjamin Ling , Emre Kiciman, Armando Fox {bling,emrek,fox}@cs.stanford.edu. Outline. Motivation: What is Session State? SSM: Architecture Algorithm Backpressure and Admission Control SSM + Pinpoint

bell
Download Presentation

A Recovery-Friendly, Self-Managing Session State Store

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox{bling,emrek,fox}@cs.stanford.edu

  2. Outline • Motivation: What is Session State? • SSM: • Architecture • Algorithm • Backpressure and Admission Control • SSM + Pinpoint • Self-recovering, self-monitoring • Benchmarks • Next steps: Sun Reference AppServer integration • Conclusion

  3. Proliferation of J2EE and Web Services • J2EE embraced as industry standard • Framework • Simplifies development • Allows for portability of services • Standardized interfaces • However, difficulties remain…

  4. The Pain – Administration and Maintenance • Administration is difficult and costly • $$ -- Database admins cost ~$200K/yr a head • Development efficiency negatively impacted • Failure/Recovery is costly • Recovery slow, especially site outages • Data loss on crashes • Users adversely affected

  5. Not All State is Created Equal • Various types of state in J2EE… • User profile state • Persistent shared state • Transaction history state • But usually stored in the same place • Stored in DB or FS Focus on particular class Exploit its properties Simplify Administration and Maintenance

  6. Example of Session State

  7. Properties of Session State 2 1 App Server 3 Browser 4 6 5 • Subcategory of session state • Single-user, serial access, semi-persistent data • Examples: Temporary application data, application workflow • Example of usage (e.g. J2EE):

  8. Goal • Build a session state store that is: • Failure-friendly • Does not lose data on crash • Degrades gracefully • Recovery-friendly • Recovers fast • Self-Managing

  9. Outline • Motivation: What is Session State? • SSM: • Architecture • Algorithm • Backpressure and Admission Control • SSM + Pinpoint • Self-recovering, self-monitoring • Benchmarks • Next steps: Sun Reference AppServer integration • Conclusion

  10. Session State Manager (SSM) AppServer AppServer STUB STUB Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 RAM, Network Interface Redundant, in-memory hash table distributed across nodes • Algorithm: Redundancy similar to quorums • Write to many random nodes, wait for few (avoid performance coupling) • Read one

  11. Write example: “Write to Many, Wait for Few” AppServer STUB Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  12. Write example: “Write to Many, Wait for Few” AppServer STUB Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  13. Write example: “Write to Many, Wait for Few” AppServer STUB Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  14. Write example: “Write to Many, Wait for Few” AppServer STUB Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  15. Write example: “Write to Many, Wait for Few” AppServer STUB Crashed? Slow? Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser 14 Brick 3 Brick 4 Cookie holds metadata Brick 5

  16. Read example: AppServer STUB Try to read from Bricks 1, 4 Brick 1 14 Brick 2 Browser Brick 3 Brick 4 Brick 5

  17. Read example: AppServer STUB 14 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  18. Read example: AppServer STUB Brick 1 crashes Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  19. Read example: AppServer STUB Brick 2 Browser Brick 3 Brick 4 Brick 5

  20. SSM: Failure and Recovery • Failure of single node • No data loss, WQ-1 remain • State is available for R/W during failure • Recovery • Restart – No recovery • No special case recovery code • State is available for R/W during brick restart • Session state is self-recovering • User’s access pattern causes data to be rewritten

  21. Backpressure and Admission Control AppServer AppServer STUB STUB Brick 1 Brick 2 Drop Requests Brick 3 Brick 4 Brick 5 Heavy flow to Brick 3

  22. Backpressure and Admission Control AppServer AppServer STUB STUB Brick 1 Brick 2 Drop Requests Brick 3 Brick 4 Reduce Sending Brick 5 Reject requests

  23. Outline • Motivation: What is Session State? • SSM: • Architecture • Algorithm • Backpressure and Admission Control • SSM + Pinpoint • Self-recovering, self-monitoring • Benchmarks • Next steps: Sun Reference AppServer integration • Conclusion

  24. Recovery Philosophy Downtime Undetected Errors Undetected Errors Hard Hard Ideal Ideal Downtime RECOVERY COST Cheap Expensive Lax Accurate Aggressive DETECTION ACCURACY

  25. Failure detection and Recovery Recovered Detection Failure Recovery SSM: Failure masked Instant recovery

  26. False Positives Normal Operation False positivetriggered Instant recovery

  27. Statistical Monitoring Pinpoint Pinpoint Statistics Statistics NumElementsMemoryUsedInboxSizeNumDroppedNumReadsNumWrites Brick 1 Brick 2 Brick 3 Brick 4 Brick 5

  28. Statistical Monitoring Pinpoint Pinpoint Statistics Statistics NumElementsMemoryUsedInboxSizeNumDroppedNumReadsNumWrites Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 REBOOT

  29. Statistical Monitoring Pinpoint Pinpoint Statistics Statistics NumElementsMemoryUsedInboxSizeNumDroppedNumReadsNumWrites Brick 1 Brick 2 Brick 3 Brick 4 Brick 5

  30. SSM Monitoring • N replicated bricks handle read/write requests • Cannot do structural anomaly detection! • Alternative features (performance, mem usage, etc) • Activity statistics: How often did a brick do something? • Msgs received/sec, dropped/sec, etc. • Same across all peers, assuming balanced workload • Use anomalies as likely failures • State statistics: Current state of system • Memory usage, queue length, etc. • Similar pattern across peers, but may not be in phase • Look for patterns in time-series; differences in patterns indicate failure at a node.

  31. Surprising Patterns in Time-Series 1. Discretize time-series into string. [Keogh] [0.2, 0.3, 0.4, 0.6, 0.8, 0.2] -> “aaabba” 2. Calculate the frequencies of short substrings in the string. “aa” occurs twice; “ab”, “bb”, “ba” occurs once. 3. Compare frequencies to normal, look for substrings that occur much less or much more than normal.

  32. Outline • Motivation: What is Session State? • SSM: • Architecture • Algorithm • Backpressure and Admission Control • SSM + Pinpoint • Self-recovering, self-monitoring • Benchmarks • Next steps: Sun Reference AppServer integration • Conclusion

  33. Microbenchmarks • UC Berkeley Millennium Cluster • Six bricks running • Candidate Write Set = 3, Write quota = 2 • Candidate Read Set = 2 • State Size = 8K

  34. Induced Fault SSM unaffected One bricked killed Brick restarted by PP

  35. Memory fault SSM unaffected Memory fault detected in hash PP restarts Brick

  36. Network Fault – 70% packet loss Fault detectedBrick killed Network fault injected PP restarts Brick

  37. Performance Fault Performance fault injected

  38. Macrobenchmark • TellMe’s Email-By-Phone Application • Session state stored in memory • Email header information • Index information • Alter application to store session state using • Disk • SSM

  39. Macrobenchmark Throughput preserved compared to disk 25% Throughput Degradation compared to in-memory

  40. Future Work • Integrate with Sun’s reference Application Server • Enterprise benchmarks • Statistical Anomaly Detection • Too many magic numbers • Integrated ROC-J2EE application server

  41. Conclusion SSMA Recovery-Friendly, Self-ManagingSession State Store Benjamin Lingbling@cs.stanford.eduhttp://swig.stanford.edu/

  42. Existing solutions : • File System and Databases • Poor failure behavior • Lose data (FS) • Slow recovery (Both) • Difficult to administer (DB) • Difficult to tune (both) • In-memory replication using primary/secondary: • Performance coupling • Poor failover (uneven load balancing)

  43. Other implementation details • Garbage collection • Generational hash table • Hash table of hash tables • Each hash table has an associated time range • When time has passed, GC that table • No reference counting, scanning, etc.

  44. SSM: Self-Managing • Adaptive: • Stub maintains count of maximum allowable in-flight requests to each brick • Additive increase on successful request • Multiplicative decrease on timeout • Stubs discover capacity of each brick  Self-Tuning • Admission control • Stubs say “no” if insufficient bricks • Propagate backpressure from bricks to clients • Turn users away under overload  Self-Protecting

More Related