1 / 57

Capacity Management

Capacity Management . for Web Operations. John Allspaw Operations Engineering. the book I’m writing. ???. Rules of Thumb Planning/Forecasting Stupid Capacity Tricks. (with some Flickr statistics sprinkled in). Things that can cause downtime. bugs (disguised as capacity problems)

Download Presentation

Capacity Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Capacity Management • for Web Operations John Allspaw Operations Engineering

  2. the book I’m writing

  3. ???

  4. Rules of ThumbPlanning/ForecastingStupid Capacity Tricks (with some Flickr statistics sprinkled in)

  5. Things that can cause downtime • bugs (disguised as capacity problems) • edge cases (disguised as capacity problems) • security incidents • real capacity problems* * (should be the last thing you need to worry about)

  6. Capacity != Performance • Forget about performance for right now • Measure what you have right NOW • Don’t count on it getting any better

  7. Thank You HPC Industry! • Automated Stuff • Scalable Metric Collection/Display a lot of great deployment and management tricks come from them, adopted by web ops

  8. I Good Measurement Tools • record and store • metrics in/out • custom metrics • easily compare • lightweight-ish

  9. Clouds need planning too • Makes deployment and procurement easy and quick • But clouds are still resources with costs and limits, just like your own stuff • Black-boxes: you may need to pay even more attention than before

  10. Metrics • System Statistics

  11. (photos processed per minute) (average processing time per photo) (apache requests) (concurrent busy apache procs) Metrics • “Application” Level

  12. Metrics • App-level meets system-level here, total CPU = ~1.12 * # busy apache procs (ymmv)

  13. 2400 photos per minute being uploaded right NOW (Tuesday afternoon)

  14. Ceilings the most amount of “work” your resources will allow before degradation or failure

  15. Forget Benchmarking

  16. The End Find your ceilings what you have left

  17. Use real live production data to find ceilings Production: “it’s like a lab, but bigger!”

  18. Like: database ceilings replicationlag: bad!

  19. sustained disk I/O wait for >40% creates slave lag* waiting on disk too much *for us, YMMV Ceilings

  20. 35,000 photo requests per second on a Tuesday peak

  21. Safety Factors

  22. Safety Factors Ceiling * Factor of Safety = UR LIMITZ

  23. Safety Factors webserver!

  24. what you have left Safety Factors “safe” ceiling @85% CPU 85% total CPU = ~76 busy apache procs

  25. Safety Factors Yahoo Front Page link to Chinese NewYear Photos (8% spike) (photo requests/second)

  26. Forecasting

  27. Forecasting Fictional Example: webservers

  28. Forecasting peak of the week Fictional example: 15 webservers. 1 week.

  29. Forecasting ...bigger sample, 6 weeks....isolate the peaks...

  30. not too shabby Forecasting now ...”Add a Trendline” with some decent correlation...

  31. this will tell you when it is ceiling when is this? what you have left Forecasting 15 servers @76 busy apache proc limit = 1140 total procs

  32. Forecasting (1140-726) / 42.751 = 9.68 (week #10, duh)

  33. Forecasting Automation • Writing excel macros is boring • All we want is “days remaining”, so all we need is the curve-fit Use http://fityk.sf.net to automate the curve-fit

  34. Forecasting Fictional Example: storage consumption

  35. this will tell you when this is Forecasting Automation actual flickr storage consumption from early 2005, in GB (ceiling is fictional)

  36. Forecasting Automation cmd line script jallspaw:~]$cfityk ./fit-storage.fit 1> # Fityk script. Fityk version: 0.8.2 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye... output

  37. Forecasting Automation fityk gave: y = 0.786854x2 + 146.657x + 14147.4 ( R2 = 99.84) Excel gave: y = 0.7675x2 + 146.96x + 14147.3 ( R2 = 99.84) (SAME)

  38. Capacity Health • 12,629 nagios checks • 1314 hosts • 6 datacenters • 4 photo “farms” • farm = 2 DCs (east/west)

  39. High and Low Water Marks alert if higher alert if lower Per server, squid requests per second

  40. type # limit/box ceiling units limit (total) current (peak) % peak Est days left www 20 80 busy procs 1600 1000 62.50% 36 shard db 20 40 I/O wait 800 220 27.50% 120 squid 18 950 req/sec 17,100 11,400 66.67% 48 A good dashboard looks something like... (yes, fictional numbers)

  41. Diagonal Scaling vertically scaling your already horizontal nodes • Image processing machines • Replace Dell PE860s with HP DL140G3s

  42. Diagonal Scalingexample: image processing 4 cores 8 cores (about the same CPU “usage” per box)

  43. Diagonal Scaling example: image processing throughput ~45 images/min @ peak ~140 images/min @ peak (same CPU usage, but ~3x more work) “processing” means making 4 sizes from originals

  44. went from: 23U rack 1035 photos/min 3008.4 Watts 23 Dell PE860s to: 8 HP DL140 G3s 1036.8 Watts 8U rack 1120 photos/min !!! (75% faster, even) Diagonal Scaling example: image processing

  45. 3.52 terabytes will be consumed today (on a Tuesday)

  46. 2nd Order Effects(beware the wandering bottleneck) running hot, so add more

  47. 2nd Order Effects(beware the wandering bottleneck) running great now, so more traffic! now these run hot

  48. Stupid Capacity Tricks

  49. Stupid Capacity Tricksquick and dirty management DSH http://freshmeat.net/projects/dsh [root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2

  50. Stupid Capacity Tricksquick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh> date executing 'date' www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh>

More Related