1 / 48

Atsipra šau...

Atsipra šau. ... b et ši skaidrė bus vienintelė skaidrė lietuvių kalba. When things break. It’s not big deal!. Resilience and Remediation. Agenda. About me Intro to resilience Credits to authors Let it crash! Distributed resilience Remediation patterns QA. About me.

avital
Download Presentation

Atsipra šau...

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Atsiprašau... ... bet ši skaidrė bus vienintelė skaidrė lietuvių kalba

  2. When things break It’s not big deal! Resilience and Remediation

  3. Agenda • About me • Intro to resilience • Credits to authors • Let it crash! • Distributed resilience • Remediation patterns • QA

  4. About me • BS & MS in Program engineering KTU • Developer for 10+ years… • …and team-lead, tech-lead, consultant, agile pioneer, certified scrum master, architect • Backend, distributed systems developer for last 4+ years

  5. hard.core team in BI department • Data warehouse and business intelligence • Tens of TB of information • 500 million transactions per day • Various availability requirements • Always online, always consistent, always up to date

  6. Reliability • a.k.a. availability • Most critical component has two nines (.99) • Gmail had 0.99984 @ 2010

  7. QCon London 2011 • My goal • Bring new ideas • Leave comfort zone • Evolve • Visit it! • InfoQ.com • Credits to authors

  8. Things break, live with it

  9. Let it crash! Why should we?

  10. Let it crash! Erlang is a programming language used to build massively scalable real-time systems with requirements on high availability

  11. Let it Crash! Idea behind • The world is concurrent • Things in the world don’t share data • Things communicate with messages • Things fail

  12. Let it crash! Erlang • Runtime system with functional language • Simple threading • Message passing (no locks!) • CouchDB, Riak, Facebook chat (ejabberd), RabbitMQ, GitHub • We don’t use it :)

  13. Let it Crash! Defensive programming • How to do it - write code to… • solve a problem • check all possible inputs • handle all possible errors/exceptions public void DoStuff(string message, Action<string> action) { if (String.IsNullOrEmpty(message)) throw new ArgumentNullException("message", "The message is null or empty."); if (action == null) throw new ArgumentNullException("action", "The action is null."); try { action(message); } catch (BoringMessageException) { return; } catch (OffensiveMessageException exc) { this.ReportOffender(SystemState.CurrentUser); return; } catch (ActionMonkeyIsBusyException exc) { throw new RetryLaterException(exc, DateTime.Now.AddYears(1), action, message); } catch { throw; } } public void DoStuff(string message, Action<string> action) { if (String.IsNullOrEmpty(message)) throw new ArgumentNullException("message", "The message is null or empty."); if (action == null) throw new ArgumentNullException("action", "The action is null."); action(message); } public void DoStuff(string message, Action<string> action ) { action(message); } public void DoStuff(string message,Action<string> action ) { action(message); }

  14. Let it Crash! Defensive programming • How to do it - write code to… • solve a problem • check all possible inputs • handle all possible errors/exceptions • Result • More code, more bugs • Obscure code, unclear logic • Error handling is poorly tested • It is very hard to defend against everything

  15. Let it Crash! Erlang way of managing applications

  16. Let it Crash! According to Joe Armstrong • Do not code defensively • If you can’t do what you want, die! • Nobody should stop you from crashing • Let other process do recovery But be reasonable…

  17. Story time! Legacycache

  18. Let it crash! Couple of hints

  19. Let it Crash! My takeaways • “Let it crash” is not • a design crutch • an excuse to lose vital data • Principle of Least Surprise • Handle what you can • Let someone else do the rest

  20. Distributed Systems, Databases and Resilience … or just Distributed Resilience

  21. Distributed Resilience Plan your failures • You can usually prevent full system crash • But how will it behave on partial failure? • Plan and understand… • …before the users tell you • You think you know what will break… • … you’re probably wrong

  22. Distributed Resilience Failure is not bad • Best way to avoid failures is to fail constantly! • Netflix “Chaos Monkey” • Navigates in infrastructure • Kills random processes • Monitors how system recovers

  23. Distributed Resilience Harvest and yield • Harvest is your data • Yield is possibility to retrieve data

  24. Distributed Resilience It’s just a flesh wound! Problems • Finding single point of failures • Security • Configuration • Administration, Shared, Central • Removing • Hard to avoid, even harder to remove • Minimization is the target

  25. Distributed Resilience Share nothing - sharding • Pros • Index sizes • Security (OotB)? • Leverages HW risk • Cons • More HW • Harder than looks • Auto is even harder • Shared info • Complex releases Proxy Config

  26. Distributed resilience Load balancing • Pros • Rather easy • Lots of products • Cons • Session is killer • Configuration Service Service Service Service

  27. Distributed Resilience Mirrors & Replication • Pros – Various! • Lots of products • Very secure? • Very fast? • Cons – Various! • Hardware • Slow? • Inconsistent?

  28. Distributed resilience NoSQL CouchDB Hbase • Not (always) ACID • Atomicity • Consistency • Isolation • Durability • BASE • Basically Available • Soft state • Eventual consistency • CAP theorem • a.k.a. Brewer's theorem • Consistency • Availability • Partition tolerance Memcached Cassandra Redis MarkLogic BigTable riak mongoDB SimpleDB

  29. Next slide :) Sample time!

  30. Samples from adform • Sensitive statistics • Real time bidding Google User sees Node2 Node1 Bid Service1 Bid Service2

  31. Remediation patterns For better releases

  32. Remediation patterns Vocabulary • Remediation • Recovery to known state after a failed release • Recovery • Returning system to working state • “Fixing sh*t when it breaks” • It’s all about… • Prevention • Patterns of low risk release • Patterns of incremental delivery Yea!

  33. Remediation patterns Background • Release is risky operation • The best way to fail release is to do it once • Don’t touch it if it works! • Agile • Time to market • Continuous deployment • 20 releases in 2 weeks @ adform

  34. Remediation patterns Prevention

  35. Remediation patterns Prevention

  36. Remediation patterns Problems • The hard bits: • Testing on production environment • Create maintainable acceptance tests • Testing cross-functional requirements

  37. Remediation patterns Reducing risk • Canary releasing • Partial release • Observe effects • Release for the rest

  38. Remediation patterns Reducing risk • Dark launch • Release to invisible infrastructure • Direct some of real load to dark-side

  39. Remediation patterns Reducing risk • Feature toggles • Develop on trunk, or else… • Feature toggle / branch by abstraction

  40. Monitoring Separate slide, cause it is so damn important! • Monitoring is essential • Remote watchdogs • Watchdogs for watchdogs • Business metrics are essential • Root cause analysis • The game: why? why? why? • Root cause graph

  41. Next slide :) Sample time!

  42. No tests in front Testing identified 2 fires Ownership problems Delayed checklist No review on checklists Different knowledge levels Only one person writing checklist Developers don’t use it

  43. Final word Troll appears Things break Learn to bend (one way or another)

  44. Almost done Shameless ads

  45. We are recruiting! • 100+ in Lithuania • 60+ in development • Architects • Analytics • Programmers • QA • http://www.adform.com/site/company/careers/

More Related