1 / 22

Test Case Filtering and Prioritization Based on Coverage of Combinations of Program Elements

Test Case Filtering and Prioritization Based on Coverage of Combinations of Program Elements. Wes Masri and Marwa El- Ghali American Univ. of Beirut ECE Department Beirut, Lebanon wm13@aub.edu.lb. Test Case Filtering.

quasim
Download Presentation

Test Case Filtering and Prioritization Based on Coverage of Combinations of Program Elements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Test Case Filtering and Prioritization Based on Coverage of Combinations of Program Elements Wes Masri and Marwa El-Ghali American Univ. of BeirutECE Department Beirut, Lebanon wm13@aub.edu.lb

  2. Test Case Filtering • Test case filtering is concerned with selecting from a test suite T a subset T’ that is capable of revealing most of the defects revealed by T • Approach:T’ to cover all elements covered by T

  3. Test Case Filtering: What to Cover? • Existing techniques cover singular program elements of varying granularity: • methods, statements, branches, def-use pairs, slice pairs and information flow pairs • Previous studies have shown that increasing the granularity leads to revealing more defects at the expense of larger subsets

  4. Test Case Filtering • This work explores covering suspicious combinations of simple program elements • The number of possible combinations is exponential w.r.t. the number of singular elements  use an approximation algorithm • We use a genetic algorithm

  5. Test Case Filtering: Conjectures • Combinations of program elements are more likely to characterize complex failures • The percentage of failing tests is typically much smaller than that of the passing tests • Each defect causes a small number of tests to fail • Given groups of (structurally) similar tests, smaller ones are more likely to be failure-inducing than larger ones

  6. Test Case Filtering: Steps • Given a test suite T, generate execution profiles of simple program elements (statements, branches, and def-use pairs) • Choose a threshold Mfail for the maximum number of tests that could fail due to a single defect • Use the genetic algorithm to generate C’, a set of combinations of simple program elements that were covered by less than Mfail tests  suspicious combinations • Use a greedy algorithm to extract T’, the smallest subset of T that covers all the combinations in C’

  7. Genetic Algorithm • A genetic algorithm solves a problem by • Operating on an initial population of candidate solutions or chromosomes • Evaluating their quality using a fitness function • Uses transformation to create new generations with improved quality • Ultimately evolving to a single solution

  8. Fitness Function • We use the following equation: fitness(combination) = 1 - %tests where %tests is the percentage of test cases that exercised the combination The smaller the percentage the higher the fitness • The aim is to end up with a manageable set of combinations in which each combination occurred in at most Mfail tests

  9. Initial Population Generation Generated from union of all execution profiles Size: 50 in our implementation 00 always, 11 with small probability P

  10. Transformation Operator • Combines two parent chromosomes to produce a child • Passes down properties from each, favoring the parent with the higher fitness. • Goal: child to have a better fitness than its parents • Replace the parent with the worse fitness with the child

  11. Solution Set • The obtained solution set contains all the encountered combinations with high-enough fitness values  suspicious combinations

  12. Experimental Work Our subject programs included: • The JTidy HTML syntax checker and pretty printer; 1000 tests; 8 defects; 47 failures • The NanoXML XML parser; 140 tests; 4 defects; 20 failures

  13. Experimental Work • We profiled the following program elements: • basic-blocks or statements (BB) • basic-block edges or branches (BBE) • def-use pairs (DUP) • Next we applied the genetic algorithm to generate the following: • a pool of BBcomb • a pool of BBEcomb • a pool of DUPcomb • a pool of ALLcomb (combinations of BBs, BBEs and DUPs) • The values of Mfail we chose for JTidy, and NanoXML were 100, and 20, respectively

  14. JTidy results: • In the case of ALLcomb, 14.1% of the original test suite was needed to exercise all of the combinations exercised by the original test suite, and these tests revealed all the defects revealed by the original test suite • In previous work we showed that coverage of slice pairs (SliceP) performed better than coverage of BB, BBE and DUP; this is why we are including the results of SliceP here for comparison.

  15. Above Figure compares the various techniques to random sampling : • All variations performed better than random sampling • BBcomb revealed 10.6% more defects than BB but selected 4.2% more tests • BBEcomb revealed 8.8% more defects than BBE but selected 3.7% more tests • DUPcomb revealed 6.3% more defects than DUP but selected 2.4% more tests • ALLcomb performed better than SliceP, since it revealed all defects, as SliceP did, but selected 12.6% less tests

  16. Experimental Work • Concerning BBcomb , BBEcomb , DUPcomb, the additional cost due to the selection of more tests might not be well justified, since the rate of improvement is no better than it is for random sampling • Concerning ALLcomb, not only did it perform better than SliceP, but it is considerably less costly • It took 90 seconds on average per test to generate its profiles (i.e., BB’s, BBE’s and DUP’s), whereas it took 1200 seconds per test to generate the SliceP profiles (1 day vs. 2 weeks)

  17. NanoXML observations: • BB, BBE, DUP, and ALL did not perform any better than random sampling, whereas BBcomb, BBEcomb, DUPcomb, and ALLcomb performed noticeably better • BBcomb, BBEcomb, DUPcomb, and ALLcomb revealed all the defects, but at relatively high cost, since over 50% tests were needed to be executed • The cost of running the genetic algorithm and the greedy selection algorithm has to be factored in when comparing our techniques to others

  18. Test Case Prioritization • Test case prioritization aims at scheduling the tests in T so that the defects are revealed as early as possible Summary of our technique • Prioritize combinations in terms of their suspiciousness • Then assign the priority of a given combination to the tests that cover it

  19. Test Case Prioritization: Steps • Identify combinations that were exercised by 1 test; assign that test priority 1, and add it to T’ • Identify combinations that were exercised by 2 tests; assign those tests priority 2, and add them to T’ • … and so on … until all tests are prioritized, or Mfailis exceeded, or all combinations were explored • Use the greedy algorithm to reduce T’ • Any remaining tests that were not prioritized will be scheduled to run randomly following the prioritized tests

  20. JTidy prioritization resultswhen step 3 is satisfied, i.e., when all tests are prioritized, or Mfail is exceeded, or all combinations were explored Observation: Using BBcomb, BBEcomb, and DUPcomb not all defects were revealed. Combinations of BB, BBE, and DUP (ALLcomb) are needed to reveal all defects.

  21. NanoXML prioritization results Observation: All defects were revealed using BBcomb, BBEcomb, DUPcomb , or ALLcomb, but at a high cost of selected tests.

  22. Conclusion • Our techniques performed better than similar coverage-based techniques that consider program elements of the same type and that do not take into account their combinations • Will conduct a more thorough empirical study • Will use APFD (Average Percentage of Faults Detected) approach to evaluate prioritization

More Related