1 / 13

Simulation

Simulation. Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara Liskov, Jeff Mogul. What was the session about?. How do you know whether your system design or implementation meets your goals?

ivey
Download Presentation

Simulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Simulation Remzi Arpaci-Dusseau, Peter Druschel, Vivek Pai, Karsten Schwan, David Clark, John Jannotti, Liuba Shrira, Mike Dahlin, Miguel Castro, Barbara Liskov, Jeff Mogul

  2. What was the session about? • How do you know whether your system design or implementation meets your goals? • At scales larger than you can actually test • Over longer time frames • With loads/faults/changes that aren’t normally seen • “Simulation” as a way around this • Stretch reality beyond what you can test directly (e.g., on a testbed)

  3. Problems with simulation • We don’t trust our simulators • Can we get “real scale” rather than oversimplified simulation of scale? • Focus has been on mathematical properties of workload rather than • errors • Unanticipated uses (attacks, semi-attacks) • Secondary behaviors

  4. Two main issues • Engines to run simulation • Workloads/faultloads/changeloads/ topologies

  5. Engines to run simulation • Scale issues • Expressibility (e.g., delays, misbehavior, heterogeneity) • Performance • Repeatability/controllability • Plugability of components • Not a good match for clusters • Need to fix this, because big CPUs are history

  6. Workloads/faultloads/changeloads/ topologies • What range of things do we have to cover? • How do we find out what happens in real life? • Anticipating things that haven’t happened before • Simulating security threats (unknown worms, botnets, etc.) • How do you manage a system that has failed? • Metaphor: preparing the surfaces before painting the house

  7. Approaches for engines • Use SETI-at-home approach • To get scale and some exposure to errors • PlanetLab is too small, too well-connected • “honeypots-at-home”? • Ask for access to Windows Update • Fault injection • Trace-driven or model-based

  8. Simulation tools require community consensus • Otherwise reviewers don’t trust results • Provides shared “shorthand” for what a published simulation result means • Need some sort of consensus-building process • Requires lots of effort, testing, bug-fixing • Tends to draw community into one standard

  9. *loads • Need “repeatable reality” • Need some diversity • Need enough detail • E.g., link bandwidths, error rates • Need well-documented assumptions • And a way to describe the range of these

  10. So we can almost do “networks”; can we do “distributed systems”? • What do you need beyond network details? • Content-sensitive behavior • Fault models • How does user behavior change? • Especially in response to system behavior changes • I/O, memory, CPU, other resource constraints • Changes in configuration

  11. Fault injection • Problem: disk vendors don’t tell you what they know • Bigger users share anecdotes, not data • Look at what they do, infer what problem is being solved • “disk-fault-at-home” • Microsoft has Watson data – would they share? • Linux ought to be gathering similar data! • Also need “behavior injection” for pre-fault sequences • Need more methodology • Crazy subsystems after unexpected requests • Better-than-random fault injection • How do we model (collect data on) correlated faults? • Does scale help or hinder independence?

  12. Things to spend money on • Obtain topologies/faultloads/changeloads • Increase realism (more detailed) • Maintain/evolve these as the world changes • Pay for the maintenance costs of a community-consensus simulator • Or more than one, for different purposes • Enough resources for repeatable results • Don’t ignore storage and storage bandwidth

  13. Things that cannot be solved with “more money” alone • Scale beyond what we can plausibly afford or manage • Time scales • Dynamic behaviors • Access to real-world fault/behavior data

More Related