Acknowledgments:

An Overview of High Volume Test Automation(Early Draft: Feb 24, 2012)Cem Kaner, J.D., Ph.D.Professor of Software EngineeringFlorida Institute of Technology Acknowledgments: Many of the ideas presented here were developed in collaboration with Douglas Hoffman. These notes are partially based on research that was supported by NSF Grant CCLI-0717613 “Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing.” Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Abstract • This talk is an introduction to the start of a research program. Drs. Bond, Gallagher and I have some experience with high volume test automation but we haven't done formal, funded research in the area. We've decided to explore it in more detail, with the expectation of supervising research students. We think this will be an excellent foundation for future employment in industry or university. If you're interested, you should talk with us. • Most discussions of automated software testing focus on automated regression testing. Regression tests rerun tests that have been run before. This type of testing makes sense for testing the manufacturing of physical objects, but it is wasteful for software. Automating regression tests *might* make them cheaper (if the test maintenance costs are low enough, which they often are not) but if a test doesn't have much value to begin with, how much should we be willing to spend to make it easier to reuse it? Suppose we decided to break away from the regression testing tradition and use our technology to create a steady stream on new tests instead. What would that look like? What would our goals be? What should we expect to achieve? • This is not yet funded research--we are still planning our initial grant proposals. We might not get funded, and if we do, we probably won't get anything for at least a year. So, if you're interested in working with us, you should expect to support yourself (e.g. via GSA) for at least a year and maybe longer.

Typical Testing Tasks Analyze product & its risks • Benefits & features • Risks in use • Market expectations • Interaction with external S/W • Diversity / stability of platforms • Extent of prior testing • Assess source code Develop testing strategy • Pick key techniques • Prioritize testing foci Design tests • Select key test ideas • Create tests for each idea • Design oracles • Mechanisms for determining whether the program passed or failed a test • Assess the tests • Debug the tests • Polish their design • Evaluate any bugs found by them • Execute the tests • Troubleshoot failures • Report bugs • Identify broken tests • Document the tests • What test ideas or spec items does each test cover? • What algorithms generated the tests? • What oracles are relevant? • Maintain the tests • Recreate broken tests • Redocument revised tests • Manage test environment • Set up test lab • Select / use hardware/software configurations • Manage test tools • Keep archival records • What tests have we run • What collections / suites provide what coverage

Regression testing • This is the most commonly discussed approach to automated testing: • Create a test case • Run it and inspect the output • If the program fails, report a bug and try again later • If the program passes the test, save the resulting outputs • In future testing: • Run the program • Compare the output to the saved results. • Report an exception whenever the current output and the saved output don’t match.

Really? This is automation? • Analyze product & its risks --Human • Develop testing strategy -- Human • Design tests -- Human • Design oracles -- Human • Run each test the first time-- Human • Assess the tests -- Human • Save the code -- Human • Save the results for comparison -- Human • Document the tests -- Human • (Re-)Execute the tests -- Computer • Evaluate the results -- Computer + Human • Maintain the tests -- Human • Manage test environment-- Human • Keep archival records -- Human

This is computer-assisted testing, not automated testing. ALL testing is computer-assisted.

Other computer-assistance… • Tools to help create tests • Tools to sort, summarize or evaluate test output or test results • Tools (simulators) to help us predict results • Tools to build models (e.g. state models) of the software, from which we can build tests and evaluate / interpret results • Tools to vary inputs, generating a large number of similar (but not the same) tests on the same theme, at minimal cost for the variation • Tools to capture test output in ways that make test result replication easier • Tools to expose the API to the non-programmer subject matter expert, improving the maintainability of SME-designed tests • Support tools for parafunctional tests (usability, performance, etc.)

Don't think "automated or not" • Think continuum: more to less • Not, "can we automate" • Instead: "can we automate more?"

A hypothetical • System conversion (e.g. Filemaker application to SQL) • Database application, 100 types of transactions, extensively specified (we know the fields involved in each transaction, know their characteristics via data dictionary) • 15000 regression tests • Should we assess the new system by making it pass the 15000 regression tests? • Maybe to start, but what about… • Create a test generator to create high volumes of data combinations for each transaction. THEN: • Randomize the order of transactions to check for interactions that lead to intermittent failures • This lets us learn things we don’t know, and ask / answer questions we don’t know how to study in other ways

Suppose you decided to never run another regression test. What kind of automation could you do?

Issues that Drive Design of Test Automation • Theory of error What kinds of errors do we hope to expose? • Input data How will we select and generate input data and conditions? • Sequential dependence Should tests be independent? If not, what info should persist or drive sequence from test N to N+1? • Execution How well are test suites run, especially in case of individual test failures? • Output data Observe which outputs, and what dimensions of them? • Comparison data IF detection is via comparison to oracle data, where do we get the data? • Detection What heuristics/rules tell us there might be a problem? • Evaluation • How to decide whether X is a problem or not? • Troubleshooting support • Failure triggers what further data collection? • Notification • How/when is failure reported? • Retention • In general, what data do we keep? • Maintenance • How are tests / suites updated / replaced? • Relevant contexts Under what circumstances is this approach relevant/desirable?

Primary drivers of our designs • The primary driver of a design is the key factor that motivates us or makes the testing possible. • In Doug's and my experience, the most common primary drivershave been: • Theory of error • We’re hunting a class of bug that we have no better way to find • Available oracle • We have an opportunity to verify or validate a behavior with a tool • Ability to drive long sequences • We can execute a lot of these tests cheaply.

More on … Theory of Error • Computational errors • Communications problems • protocol error • their-fault interoperability failure • Resource unavailability or corruption, driven by • history of operations • competition for the resource • Race conditions or other time-related or thread-related errors • Failure caused by toxic data value combinations • that span a large portion or a small portion of the data space • that are likely or unlikely to be visible in "obvious" tests based on customer usage or common heuristics

Simulate Events with Diagnostic Probes • 1984. First phone on the market with an LCD display. • One of the first PBX's with integrated voice and data. • 108 voice features, 110 data features. • Simulate traffic on system, with • Settable probabilities of state transitions • Diagnostic reporting whenever a suspicious event detected

More on … Available Oracle • Typical oracles used in test automation • Reference program • Model that predicts results • Embedded or self-verifying data • Checks for known constraints • Diagnostics

Function Equivalence Testing • MASPAR (the Massively Parallel computer, 64K parallel processors). • The MASPAR computer has several built-in mathematical functions. We’re going to consider the Integer square root. • This function takes a 32-bit word as an input. Any bit pattern in that word can be interpreted as an integer whose value is between 0 and 232-1. There are 4,294,967,296 possible inputs to this function. • Tested against a reference implementation of square root

Function Equivalence Test • The 32-bit tests took the computer only 6 minutes to run the tests and compare the results to an oracle. • There were 2 (two) errors, neither of them near any boundary. (The underlying error was that a bit was sometimes mis-set, but in most error cases, there was no effect on the final calculated result.) Without an exhaustive test, these errors probably wouldn’t have shown up. • For 64-bit integer square root, function equivalence tests involved random sample rather than exhaustive testing because the full set would have required 6 minutes x 232 tests.

Program state Program state, (and uninspected outputs) System state System state Intended inputs Monitored outputs Configuration and system resources Impacts on connected devices / resources To cooperating processes, clients or servers Cooperating processes, clients or servers Program state, (and uninspected outputs) Program state System state System state Intended inputs Monitored outputs Impacts on connected devices / resources Configuration and system resources To cooperating processes, clients or servers Cooperating processes, clients or servers This tests for equivalence of functions, but it is less exhaustive than it looks(Acknowledgement: From Doug Hoffman) System under test Reference function

More on … Ability to Drive Long Sequences • Any execution engine will (potentially) do: • Commercial regression-test execution tools • Customized tools for driving programs with (for example) • Messages (to be sent to other systems or subsystems) • Inputs that will cause state transitions • Inputs for evaluation (e.g. inputs to functions)

Long-sequence regression • Tests taken from the pool of tests the program has passed in this build. • The tests sampled are run in random order until the software under test fails (e.g crash). • Typical defects found include timing problems, memory corruption (including stack corruption), and memory leaks. • Recent (2004) release: 293 reported failures exposed 74 distinct bugs, including 14 showstoppers. • Note: • these tests are no longer testing for the failures they were designed to expose. • these tests add nothing to typical measures of coverage, because the statements, branches and subpaths within these tests were covered the first time these tests were run in this build.

Imagining a structure for high-volume automated testing

Some common characteristics • The tester codes a testing process rather than individual tests. • Following the tester’s algorithms, the computer creates tests (maybe millions of tests), runs them, evaluates their results, reports suspicious results (possible failures), and reports a summary of its testing session. • The tests often expose bugs that we don’t know how to design focused tests to look for. • They expose memory leaks, wild pointers, stack corruption, timing errors and many other problems that are not anticipated in the specification, but are clearly inappropriate (i.e. bugs). • Traditional expected results (the expected result of 2+3 is 5) are often irrelevant.

What can we vary? • Inputs to functions • To check input filters • To check operation of the function • To check consequences (what the other parts of the program do with the results of the function) • To drive the program's outputs • Combinations of data • Sequences of tasks • Contents of files • Input files • Reference files • Configuration files • State transitions • Sequences in a state model • Sequences that drive toward a result • Execution environment • Background activity • Competition for specific resources • Message streams

How can we vary them? • Fuzzing: • Random generation / selection of tests • Execution engine • Weak oracle (run till crash) • Fuzzing examples • Random inputs • Random state transitions (dumb monkey) • File contents • Message streams • Grammars • Statistical or AI sampling • Test selection optimized against some criteria • Long-sequence regression • Model-based oracle • E.g. state machine • E.g. mathematical model • Reference program • Diagnostic oracle • Constraint oracle

Issues that Drive Design of Test Automation • Theory of error What kinds of errors do we hope to expose? • Input data How will we select and generate input data and conditions? • Sequential dependence Should tests be independent? If not, what info should persist or drive sequence from test N to N+1? • Execution How well are test suites run, especially in case of individual test failures? • Output data Observe which outputs, and what dimensions of them? • Comparison data IF detection is via comparison to oracle data, where do we get the data? • Detection What heuristics/rules tell us there might be a problem? • Evaluation • How to decide whether X is a problem or not? • Troubleshooting support • Failure triggers what further data collection? • Notification • How/when is failure reported? • Retention • In general, what data do we keep? • Maintenance • How are tests / suites updated / replaced? • Relevant contexts Under what circumstances is this approach relevant/desirable?

About Cem Kaner • Professor of Software Engineering, Florida Tech • I’ve worked in all areas of product development (programmer, tester, writer, teacher, user interface designer, software salesperson, organization development consultant, as a manager of user documentation, software testing, and software development, and as an attorney focusing on the law of software quality.) • Senior author of three books: • Lessons Learned in Software Testing (with James Bach & Bret Pettichord) • Bad Software (with David Pels) • Testing Computer Software (with Jack Falk & Hung Quoc Nguyen). • My doctoral research on psychophysics (perceptual measurement) nurtured my interests in human factors (usable computer systems) and measurement theory.

Acknowledgments:

Acknowledgments:

Presentation Transcript

Introduction to the Message Passing Interface (MPI)

Software Engineering CSE470 (Fall 2001)

Advances in Database Querying

Electronic Integrating Resources: AACR2 Revisions and MARC Coding

Effect of Hourly Nursing Rounds on Call Light Use, Patient Falls, and Patient Satisfaction

Software Quality Management : Managing the quality of the software process and products

Diffusion Tensor Imaging: from Dicom to Nrrd

Building a Chemical Informatics Grid

Online Advertising Open lecture at Warsaw University January 7/8, 2011

Reynolds Intellectual Assessment Scales TM (RIAS TM )

Preparing for an Unplanned Radiation Event

Procedural Sedation

Introduction: Convolutional Neural Networks for Visual Recognition

Lessons learned from LONGSCAN Presented by Diana English, PhD

Optimization-Based Approaches to Understanding and Modeling Internet Topology

Evolution of the Revolution: How Can Evidence-Based Practice Work in the Real World?

Quality assessment of an academic current awareness system

Prolonged exposure An Evidence-Based Psychotherapy for PTSD

Track Forecasting of 2001 Atlantic Tropical Cyclones Using a Kilo-Member Ensemble

Mean, Variability, and the Most Predictable Patterns in CFS over the Tropical Atlantic Ocean

Stress and Distress in Military Children

Bruce Warrington National Measurement Institute, Australia