1 / 22

CSSE 477 – More on Availability & Reliability

jjohnathon
Download Presentation

CSSE 477 – More on Availability & Reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Right – High availability with VMWare – the major goal is to eliminate all single points of failure. When the an ESX host goes down (the red X), the virtual machines running on the failed host migrate over to the other, healthy ESX hosts that have spare capacity. After migration, the virtual machines are restarted automatically, without intervention from the administrator. From http://www.vmracks.com/it-hosting/esx-high-availability.html. CSSE 477 – More on Availability & Reliability Steve Chenoweth Thursday, 9/22/11 Week 3, Day 3

  2. Today • 5 Minute talks on availability project • Tactics for software availability engineering… • A bit more, mostly from Musa’s book • Tonight: • Project 2, final part • HW 3 (individual) John Musa (1933-2009), the inventor of “software reliability engineering.”

  3. Software failures • Important to discuss with customers • Need to know what variations in system behavior are tolerable • Requirements provide a positive specification • Defining failures gives you a “negative specification” • What the system must not do • Adds another dimension to communication with the user Below – A windmill goes south.

  4. How to classify severity of software failures • Based on cost: (these are 1999 $ from Musa’s book, so you need to set your own scale)

  5. How to classify severity of software failures, cntd • Based on operational impact:

  6. How to classify severity of software failures, cntd • Need to make such a table product-specific. • Suppose the product is “Fone Follower”:

  7. Need to set “failure intensity objectives” for each release • Need to define a global way of measuring the “intensity” – like failures per hour of operation that will be tolerated. • Next goal is to convert these to global measures related to code, like failures per million lines of code run, for the critical subsystems. • Heuristic: For a given release, the product of these factors tends to be roughly constant, related to the amount of new functionality added: • Failure intensity • Development time • Development cost

  8. Failure intensity vs reliability • Generally,  = - ln R t • Where R = reliability,  = failure intensity, and t = number of natural time units. • E.g., if reliability is 0.992 for 8 hours, the failure intensity is one failure per 1000 hours. • Note that Musa’s definition of “reliability” is slightly more sophisticated than usual: It’s the probability of execution without failure for a specified time interval (like hours). • In contrast, usually we talk of the average time to failure, and so skip over the probability part (it’s 50%).

  9. For a new system… • We don’t already have a track record of failure intensities • So it’s tougher to judge how much time to spend trying to get it right, for the “next release” – that’s release 1.0 ! • What to do? • Find operational data for similar systems • Or, how reliable are the underlying systems you’ll use? • Consider vendor warranties – what will we promise? (Or what do others promise?) • Get experts to estimate, based on prior work

  10. Availability Tactics Opening rounds • Once you know what to prevent, you develop to counteract those problems: • Try the 3 Strategies from Bass Ch 5: • Fault detection • Fault recovery • Fault prevention • Musa’s variation on this list is: • Fault prevention • Fault removal • Fault tolerance Later

  11. Musa’s development strategies • Engineer the right balance among these reliability strategies • Determine where to focus them, to maximize the likelihood of meeting the objectives in an economical way • Components you buy and integrate are a problem – you have less control over those! • The best you can do may be to test with them as thoroughly as possible, give feedback to their vendors

  12. Musa’s strategies, cntd • Testers should be in on setting the objectives and deciding the system architecture • Fault prevention is done by: • Good development processes, especially: • Having sound underlying methodologies • Doing reviews – like to requirements & design • Enforcing standards • Using design tools that keep faults from being introduced

  13. Musa’s strategies, cntd • How’s fault removal done? • Primarily by code reviews and testing • You can measure the effectiveness of the code reviews by how many were caught vs how many remained to be caught in testing • Measure the effectiveness of testing by the number of faults found by that testing, vs those found after (like by the customer!)

  14. Musa’s strategies, cntd • How to achieve fault-tolerance? • Needs design: • Anticipate what deviations are likely to occur and will lead to failures • Implement “robust” software to counteract them • Like handling unexpected input from users or from other systems • How to minimize performance degradation, data corruption, or undesirable outputs • Use hardware to help (see slide 1!) • Can measure effectiveness by the reduction in failure intensity that results

  15. Musa’s strategies, cntd • In large organizations: • Important to measure the effectiveness of each of these availability tactics • Leads to sound decisions about what to spend money on – “best strategies”

  16. Software Safety • A topic related to reliability • Means freedom from mishaps • Mishap = loss of human life, injury, or property damage • Most software failures just cause user dissatisfaction • Software safety is dependent on realistic testing • Need operational profiles to focus testing

  17. Software vs hardware • Hardware reliability is affected by aging and wear. • Software – not so much. • Execution time affects reliability • What is its “duty cycle” (in hours)? And • What are its failures per execution hour? • To get ultrareliable functions, we need a test duration of several times the reciprocal of the failure intensity objective • E.g., to get a failure intensity of 10-9 failures per hour, you’d need to test for several times 109 hours.

  18. Musa on fault detection • A less precise idea than software failure • Absolute concept – an entity you can define without reference to software failures • Like a bad disk read • Operational concept – an entity that exists only with reference to the fact that failures will occur if that state is executed • The fault is the defect causing the failure

  19. Absolute faults • Makes it necessary to postulate a “perfect” program to which an actual program can be compared. • Then a fault is incorrect code, in comparison • It’s defective, missing, or an extra instruction • The defective program can be compared with the perfect program. • Of course, in reality, there could be many “perfect” programs. • You don’t know about the bad code till you are trying to fix it.

  20. Operational faults • The “fault” has some reality of its own, as an implementation defect. • But really, a piece of bad code could be responsible for no failures, or lots of failures, etc. • And “software failures” may not be in the code, per se. • So this definition of “fault” has issues, too.

  21. A realistic definition • Musa ends up saying, like Bass, that a software fault is incorrect code that leads to actual or potential failures. • Which leaves open the question of “What code should be included in the fault?” • Requires judgment • Would changing a few related instructions prevent additional failures from occurring?

  22. Availability in summary • Strategic thing is to work to improve it • Definitely don’t ignore it on any real product • “Hope is a city on de Nile”. • Lots more to explore • Plenty of systems out there, where people will pay to have them improve on this • Like all the quality attributes, this could be a career activity • A good additional course in this area – Dr. Radu’s ECE 497 – Design of fault tolerant digital systems Right - A software cruise down “de Nile”

More Related