Comparison of Unit-Level Automated Test Generation Tools

Comparison of Unit-Level Automated Test Generation Tools Shuang Wang Co-authored with Jeff Offutt April 4, 2009 1

Motivation • We have more software, but insufficient resources • We need to be more efficient • Frameworks like JUnit provide empty boxes • Hard question: what do we put in there? • Automated test data generation tools • Reduce time and effort • Easier to maintain • Encapsulate knowledge of how to design and implement high quality tests

What are our criteria? • Two commercial tools • AgitarOne • JTest • Free • Unit-level • Automated test generation • Java What’s available out there? 3

Experiment Goals and Design • Compare three unit level automatic test data generators • Evaluate them based on their mutation scores • Subjects • Three free automated testing tools • - JCrasher, TestGen4j, and JUB • Control groups • - Edge Coverage and Random Test • Metric • Mutant score results 4

Experiment Design muJava JCrasher Mutation Score JCrasher Test Set JC TestGen4J Mutation Score TestGen4J Test Set TG JUB Mutation Score P JUB Test Set JUB Mutants Manual Random Test Set Ram Random Mutation Score Manual Edge Cover Test Set EC Edge Cover Mutation Score 5

Java Programs Used 7

Subjects (Automatic Test Data Generators) Control groups • Edge Coverage • one of the weakest and most basic test criterion • Random Test • the “weakest effort” testing strategy 9

Experiment Design muJava Jcrasher Mutation Score JCrasher Test Set JC TestGen4J Mutation Score TestGen4J Test Set TG JUB Mutation Score P JUB Test Set JUB Mutants Manual Random Test Set Ram Random Mutation Score Manual Edge Cover Test Set EC Edge Cover Mutation Score 10

muJava • Create mutants • Run tests 11

Results & findings Total % Killed 12

Results & findings Efficiency 13

Results & findings 14

Example • For vendingMachine, except for edge coverage, the other four mutation scores are below 10% • MuJava creates dozens of mutants on these predicates, and the mostly random values created by the three generators have a small chance of killing those mutants 15

Example • Scores for BoundedStack were the second lowest for all the test sets except edge coverage • only two of the eleven methods have parameters. The three testing generators depend largely on the method signature, so fewer parameters may mean weaker tests 16

Example • JCrasher got the highest mutation score among the three generators • JCrasher uses invalid values to attempt to “crash” the class 17

Conclusion • These three tools by themselves generate tests that are very poor at detecting faults • Among public-accessible tools, criteria-based testing is hardly used • We need better Automated Test Generation Tools 18

Contact Shuang Wang Computer Science Department George Mason University SWANGB@gmu.edu 18

Comparison of Unit-Level Automated Test Generation Tools