1 / 28

A Case Study In Reliability Analysis

Lewis Sykalski. A Case Study In Reliability Analysis. Background (cont.). Net Centric Warfare Data Collector Approximately 180KLOC Written in Java and heavily uses JDBC and RMI from J2EE package CMMI Level 1 Utilizes Oracle 9.2 EE OTS DBMS Reliability Required: Moderate. Background.

calvin
Download Presentation

A Case Study In Reliability Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lewis Sykalski A Case Study In Reliability Analysis

  2. Background (cont.) • Net Centric Warfare Data Collector Approximately 180KLOC Written in Java and heavily uses JDBC and RMI from J2EE package CMMI Level 1 Utilizes Oracle 9.2 EE OTS DBMS • Reliability Required: Moderate

  3. Background GLOBAL VISION NETWORK (GVN) CAOC FUSION DC VBMS LM – Mission Sys Colorado Springs, CO DC WCS JSAF JTAC Light House Suffolk, VA JIMM VBMS JABE Other Simulators Threat Sims Integrated Warfare Development Center Fort Worth, TX LM – Sim & Training Orlando, FL

  4. Design Diversity (Part I) • Part I: Oracle DBMS Design Diversity • Acquire 20 bug reports each from Oracle 9.2 & Oracle 10.0 • Bugs had to be Date Independent, Easy To Reproduce, & Type Independent • Results would then be classified by self-evidence & divergence

  5. Design Diversity: Results 9.2 Bugs

  6. Design Diversity: Results 10.0 Bugs

  7. Oracle 9.2 Oracle 10.0 Oracle 10.0 Oracle 9.2 Total Bug Scripts 20 - 20 - Failure Observed 20 - 20 11 Performance/ Hang S.E 2 0 1 0 Internal Error S.E 11 0 10 6 Engine Crash S.E 0 0 2 2 Incorrect Result S.E 0 0 0 0 N.S.E 7 0 6 2 Other S.E 0 0 1 1 N.S.E 0 0 0 0 Design Diversity: More Analysis

  8. Design Diversity: Even More Analysis Bottom Line: • Not a Statistical Sample (Not Enough Time) • 2/40 = 10% of Failures not detected across both products • Out of the 20 failures for Oracle 10.0, 6 were N.S.E & 4 out of 6 of these failures would be resolved by utilizing a past release in tangent with future release

  9. Reliability Analysis (Part II) • Part II: CASRE Reliability Analysis of NCW Data Collector • Extract the following from Failure Logs using JavaScript: Time of Program Start, Time of Program Termination, Time of Thread Terminations, and Exception or Failure Messages • Parse failures manually into CASRE input format • Categorize by severity utilizing chart on next slide • Compare 2 consecutive events (CALOE08 & MAGTF08) as well as 2 consecutives lifecycles within same event (Integration & Execution)

  10. Severity

  11. Using CASRE

  12. Using CASRE (cont.)

  13. CASRE Input Format TIME BETWEEN FAILURES FORMAT: N/A FAILURE COUNT FORMAT Interval Number of Interval Error Number Errors Length Severity (int) (float) (float) (int) Example: Hours 1 5.0 40.0 1 1 3.0 40.0 2 1 2.0 40.0 3 2 4.0 40.0 1 2 3.0 40.0 3 3 7.0 40.0 1 4 5.0 40.0 1 5 4.0 40.0 1

  14. CASRE Failure Counts CALOE+MAGTF Execution MAGTF Integration + Execution

  15. CASRE Time Between Failures CALOE+MAGTF Execution MAGTF Integration + Execution

  16. CASRE Failure Intensity CALOE+MAGTF Execution MAGTF Integration + Execution

  17. CASRE Cummulative Failures CALOE+MAGTF Execution MAGTF Integration + Execution

  18. CASRE Test Interval Length CALOE+MAGTF Execution MAGTF Integration + Execution

  19. Detecting Reliability Trends • Running Average: • Not as Useful for Failure Count Data (unless test intervals are equal length) • Computes the running average of the time between successive failures for time between failures data, or the running average of number of failures per interval for failure count data. • If the running average decreases with time (fewer failures per test interval), reliability growth is indicated. • Laplace Test: • Not as Useful for Failure Count Data (unless test intervals are equal length) • Occurrences of failures = homogeneous Poisson process • If the test statistic decreases with increasing failure#, then the null hypothesis can be rejected in favor of reliability growth at an appropriate significance level. Opposite for increases with increasing failure#

  20. Running Average CALOE+MAGTF Execution MAGTF Integration + Execution

  21. Laplace Test CALOE+MAGTF Execution MAGTF Integration + Execution

  22. CASRE Cum Failure Predictions CALOE+MAGTF Execution MAGTF Integration + Execution

  23. CASRE Prediction Setup CALOE+MAGTF Execution MAGTF Integration + Execution

  24. CASRE Reliability Prediction CALOE+MAGTF Execution MAGTF Integration + Execution

  25. CASRE Prequential Likelihood CALOE+MAGTF Execution MAGTF Integration + Execution

  26. CASRE Model-Ranking CALOE+MAGTF Execution MAGTF Integration + Execution

  27. Reliability Models • Haven’t been able to get these to run yet. • Instruction manual says many of the built-in models only work with Time Between Failures Data. • Doubt there would be much utility with Failure Count Data

  28. Conclusion/Follow-Up • It actually would be QUITE easy to integrate Failure Count or Time Between Failures Output Auto-Generation into my environment • This would facilitate quick trend-analysis • Reliability trends and not the actual numbers is what is important

More Related