1 / 36

A Framework for Checking Regression Test Selection Tools

A Framework for Checking Regression Test Selection Tools. Chenguang Zhu, Owolabi Legunsen, August Shi , Milos Gligoric ICSE 2019 Montreal, Canada 5/29/2019. CCF-1421503, CCF-1566363, CCF-1652517, CCF-1704790, CCF-1763788, CNS-1646305, CNS-1740916. Regression Testing.

boycej
Download Presentation

A Framework for Checking Regression Test Selection Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Framework for Checking Regression Test Selection Tools Chenguang Zhu, Owolabi Legunsen, August Shi, Milos Gligoric ICSE 2019 Montreal, Canada 5/29/2019 CCF-1421503, CCF-1566363, CCF-1652517, CCF-1704790, CCF-1763788, CNS-1646305, CNS-1740916

  2. Regression Testing • Widely used by developers, in industry and open source • Checks if changes break existing functionality • Problem: Time consumingand expensive Old revision New revision changes RetestAll: Run all tests after change t1 t1 t2 t2 Available tests Available tests t3 t3 … … tn tn 2

  3. Regression Test Selection (RTS) • Optimizes regression testing by running a subset of available tests • Do not select tests whose outcome is unaffected by the change Tests t1 t2 t3 t2 t3 t1 t2 t1 change X Y Y X Z Z X Classes

  4. RTS Tools • Developed in industry, e.g., • Clover, Infinitest, Facebook Buck, Google TAP, Microsoft CloudBuild, Microsoft’s Test Impact Analysis, etc. • Developed by researchers, e.g., • Ekstazi, STARTS, HyRTS, FaultTracer, RTSLinux, AutoRTS, etc. • Tools are being adopted by large software organizations

  5. Desired RTS Tool Properties • Safety: Do not miss to run tests with different outcome • Precision: Do not run tests with outcomes already known • Efficiency: Time(analysis) + Time(selected tests) < Time(all tests) How do we check? Miss to select BUG! t1 t3 t2 Select too many BUG! t1 t2 t3 Y Z X Y Z X See paper about efficiency

  6. Our Solution: RTSCheck • A novel framework for checking RTS tools • Systematically check RTS tools on evolving programs • Checks the output of tools against rules • The violations of these rules likely indicate bugs in tool • RTSCheck is publicly available http://cozy.ece.utexas.edu/rtscheck

  7. RTSCheck Overview RTSCheck Input Output Check Rules Run Programs Obtain Evolving Programs RTS Tools Safety Violations RetestAll RTS Tool 1 … RTS Tool N AutoEP Precision Violations Components Configuration R1 R2 … RM DefectsEP Generality Violations Components EvoEP

  8. Component: AutoEP • Automatically generates old and new revision of a program • Explore language features that may be challenging for RTS tools Evolution Operators Program Evolver Program Generator Test Generator Tests Code under test See paper JDolly1 Randoop2 Old revision New revision 1 Soares et al. “Automated behavioral testing of refactoring engines” 2Pacheco et al. “Feedback-directed random test generation”

  9. Component: DefectsEP • Use fixed and buggy revisions from bug databases as old and new revisions, respectively • Check RTS tools can expose known failures Buggy revision Old revision Extract revisions from Bug Database Fixed revision New revision e.g., from Defects4J3 3 Just et al. “Defects4J: A database of existing faults to enable controlled testing studies for Java programs”

  10. Component: EvoEP • Extracts old and new revisions from existing software repositories • Can extract multiple new revisions • Explore long sequences of real changes from developers Old revision Select Revisions Download project from software hosting platform Project New revision 1 … e.g., from GitHub New revision N

  11. RTSCheck Overview RTSCheck Input Output Check Rules Run Programs Obtain Evolving Programs RTS Tools Safety Violations RetestAll RTS Tool 1 … RTS Tool N AutoEP Precision Violations Components Configuration R1 R2 … RM DefectsEP Generality Violations Components EvoEP

  12. Rules for Finding Likely Bugs in RTS Tools • Rules for safety, precision, generality violations, e.g., • Safety: RTS tool runs fewer newly failed tests on the new revision than RetestAll • Precision: RTS tool runs tests even when no changes • Generality: RTS tool runs more failed tests than those run by RetestAll See paper for all rules

  13. Evaluation: Methodology • Run RTSCheck on multiple RTS tools • Activate all three RTSCheck components • AutoEP – Automatically generate evolving programs • DefectsEP – Use Defects4J as bug database • EvoEP – Extract revisions from GitHub repositories • Manually inspect violations generated by RTSCheck • Too many violations, requires sampling • Cluster violations based on rules violated

  14. Evaluation: RTS Tools

  15. Evaluation: Detected Bugs • Using RTSCheck, we discovered 27 bugs • Reported all 27 bugs to the developers of these RTS tools • Four bugs already known, 10 of the unknown bugs confirmed

  16. Evaluation: Breakdown of Detected Bugs

  17. Example Bug: Clover Bug due to Field Hiding • Clover misses to detect that a new field was added • Exposed as a safety violation • Confirmed as a true bug by Clover developers Test: 1classCTest { 2 @Test void test() { 3 C c = new C(); 4 assertEquals(10, c.f); 5 } 6 } Code under Test: 1public class A { 2 public int f = 10; 3 } 4public class C extends A { 5 + public int f = 11; 6 }

  18. Example Bug: STARTS Bug due to Suite Runner • If tests use @RunWith(Suite.class), STARTS always runs tests • Exposed as a precision violation • We confirmed as true bug in STARTS Test: • 1 classCTest{ • 2@Test void test1() throwsThrowable{…} • 3 } • 4@RunWith(Suite.class) • 5@Suite.SuiteClasses({ CTest.class}) • 6classRSuite {}

  19. Example Bug: Ekstazi Bug due to JUnit4 Annotations • Ekstazi ignores @Ignore annotated test, even in JUnit 3 • Exposed as a generality violation • We confirmed as true bug in Ekstazi • Test: • 1 importjunit.framework.*; //JUnit 3 • 2 public class CTest extendsTestCase { • 3 public voidtest1() {…} • 4 • 5 @Ignorepublicvoid test2() {…} • 7 }

  20. Beyond RTS • RTSCheck’s infrastructure can be used as a basis for testing various incremental program analysis techniques • Incremental compilation • Incremental static analysis • Regression verification • Reuse evolving programs, different rules for different applications

  21. Conclusions • RTS tools are widely adopted, but unsure if they are implemented properly • RTSCheck is a novel framework for checking RTS tools • RTSCheck discovered 27 bugs in 3 RTS tools • RTSCheck can support checking tools that analyze evolving programs http://cozy.ece.utexas.edu/rtscheck Chenguang Zhu (cgzhu@utexas.edu) Owolabi Legunsen (legunse2@illinois.edu) August Shi (awshi2@illinois.edu) Milos Gligoric (gligoric@utexas.edu)

  22. backup slides

  23. Regression Testing is Costly • Time consuming and resource intensive • Google’s Test Automation Platform (TAP) system runs (on an average day)* • 800K builds and 150 Million test runs • Microsoft’s CloudBuild runs (on an average day)** • 20K builds; it is used by more than 4K developers at Microsoft * Memon, Atif, et al. "Taming google-scale continuous testing." Proceedings of the 39th International Conference on Software Engineering: Software Engineering in Practice Track. IEEE Press, 2017. ** Esfahani, Hamed, et al. "CloudBuild: Microsoft's distributed and caching build service." Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 2016.

  24. AutoEP Details Evolution Operators • Program Generation Constraints

  25. RTSCheck Detects Violations • Safety Violation: an RTS tool does not select expected tests • E.g., not selecting to run newly failed tests that fail in RetestAll • Precision Violation: an RTS tool selects unnecessary tests • E.g., running all tests the second time when run twice on the same program revision • Generality Violation: an RTS tool does not integrate well with the program, leading to unexpected behavior • E.g., failing more tests than RetestAll due to incorrect instrumentation

  26. Rules for Detecting Violations

  27. Evaluation: Inspection Procedure • RTSCheck found > 24K violations in total, not feasible to manually inspect all. • For AutoEP (24,472 violations), we sampled two violations for each AutoEP mode (JD + E), i.e., we inspected 84 violations • For DefectsEP (348 violations), we grouped the violations based on which rules are violated, which tools violate the rules, and on which projects the rules are violated. In total, we create 25 groups and inspected 41 violations • For EvoEP, we inspected all 82 violations • In total, we inspected 207 violations • Our inspection found only two false positives among the violations

  28. Evaluation: AutoEP Configurations • AutoEP • Configured to generate 400 first-revision programs for each JD+E combination • Base - the number of compilable first-revision programs for each JD • #G - The number of generated evolved programs • #C - The number of generated compilable programs

  29. Evaluation: DefectsEP Configuration • DefectsEP • Start with all the examples in the Defects4J repository (SHA 6bc92429) • Filter out non-Maven examples (Clover and STARTS currently support only Maven), examples that cannot build, and examples that do not compile with Java 8 (STARTS does not work with earlier Java versions) • Exclude the seven tests that are flaky or fail consistently on the fixed program version

  30. Evaluation: EvoEP Configurations • EvoEP • Use ten projects available on GitHub, using Maven, and were recently used in research on regression testing • Limit the max number of revisions to 20 to ensure feasibility

  31. Evaluation: Research Questions • RQ1: What safety violations and bugs are detected by RTSCheck? • RQ2: What precision violations and bugs are detected by RTSCheck? • RQ3: What generality violations and bugs are detected by RTSCheck? • RQ4: How efficient are RTS tools under various RTSCheck components?

  32. Example Bug: Ekstazi Bug due to JUnit 3 • If test is run with Ekstazi twice, the test is executed both times • Exposed as a precisionviolation • Confirmed as true bug by Ekstazi developers Test: 1 importjunit.framework.*; //JUnit 3 2 public class CTest extendsTestCase { 3 public static Test suite() { 4 return new TestSuite(CTest.class); 5 } 6 public void test() { 7 assertNotNull(newA()); 8 }}

  33. Example Bug: Clover Bug due to Instrumentation • Clover performs an intrusive instrumentation by inserting extra methods and fields in the classes • Exposed as a generalityviolation • Confirmed as true bug by Clover developers Test: 1 classCTest { 2 public void test() { 3 assertEquals(1, B.class. 4 getDeclaredClasses().length); 5 } 6 } Code under Test: 1public class A {} 2 public class B { 3 publicA a; 4 public void m() {} 5 }

  34. Evaluation: Efficiency Report • Only DefectsEP and EvoEP are useful for checking efficiency • Clover is relatively inefficient and runs longer than RetestAll DefectsEP: EvoEP:

  35. Discussion and Future Work • Extending RTSCheck • Take about an hour to add a new rule that uses the data and logs collected already • Adding new Maven projects to evaluate is trivial • Mutation testing • An alternative approach to generate evolving programs • Our initial experiment (extending PIT to generate source-code mutants) on two open-source projects, Apache CSV and Google compile-testing, shows that none of the bugs found by RTSCheck can be found using mutations • Future work • RTSCheck’s infrastructure can be used as a basis for testing various incremental program analysis techniques • Explore ways of clustering evolving programs that expose the same bug • Develop new rules: investigating better thresholds for differences in tests selected as to better balance trade-offs between true and false positives • Expand existing rules by looking at individual failed tests

More Related