From Test Adequacy To Test Oracle

From Test Adequacy To Test Oracle W.K. Chan

My Background • Wing Kwong Chan 陳榮光 • Call me “Ricky”. • “W.K. Chan” is my published name used on papers only. • You may visit my website to know my work better: www.cs.cityu.edu.hk/~wkchan • But, if you want to know me better, talk to me! • If you have any further questions or comments or suggestions, or are interested in research collaboration, send emails to wkchan@cityu.edu.hk • I worked in the software industry for quite a number of years before completing my PhD. Perhaps, to your surprise, I completed my PhD in the part-time mode. Then, after a few years, now, I become an assistant professor at Department of Computer Science, City University of Hong Kong.

Today’s Agenda • What is Program Testing • Test Data Adequacy and Test Oracle Problems • Backgrounds • Examples • Possibilities of Further Research

My Declaration and Expectation • I am not going to facilitate you to learn a new/old research topic • I am not going to sell my research • In fact, some work I am going to cover are not my research • I merely want to share with you a way (out of many possible ways) in exploring a specific research space • Your research space could be very different from my research space. • I choose these two testing research topics for this summer school because they are two of the very first research topics that I encountered when I studied my PhD.

Program Testing • Software engineering: develop systematic methodologies and techniques to construct software. Here

Program Testing Executing a program over an inputand checking the program outputto decide whether the test results demonstrate any problems about the program. “Program testing can be used to show the presence of bugs, but never their absence.”  Dijkstra, 1972

Program Testing • Program testing has been and is continuing to be widely practiced. • Its popularity should have the survival reasons of program testing. • But ifit has been practiced for decades, why are there still significant research-oriented challenges to overcome and why does testing remain a popular topic in research? Any guess? • See whether you could find yourself a reason from the class today.

Increasing Amount of Research Results in Major Publication Outlets • For software testing and verification General (top-tier) ACM TOSEM ACM TOPLAS IEEE TSE ICSE FSE OOSPLA PLDI POPL Sub-field Specific STVR TR CAV ISSTA ICST QSIC Others WWW ASE FASE ICSM ISSRE IPSN EMSOFT ICDCS ICWS SCC SOSP OSDI TPDS TC

Increasing Amount of Research Results in Major Publication Outlets • For software testing and verification General (top-tier) ACM TOSEM ACM TOPLAS IEEE TSE ICSE FSE OOSPLA PLDI POPL Sub-field Specific Elsevier STVR IEEE TR CAV ISSTA ICST QSIC Others WWW ASE FASE ICSM ISSRE IPSN EMSOFT ICDCS ICWS SCC SOSP OSDI TPDS TC US NIST Report: In 2002, inadequate software testing infrastructure causes US 60 billions a year. Hot sub-area e.g., for FSE’08 (Nov 2008) , over 20% are testing papers.

Program Testing • We also need a target to conduct testing research. • A typical target is to find as many bugs (known as faults) as possible at the lowest possible cost • In general, we want to conduct a trade off between effectiveness and cost • We may want to show that a set of test cases can effectively execute all the program statements in a program, irrespective to whether they could find out any bugs. • Understand what are “effectiveness” and what are “cost” prior to trading between them.

Example 1 • A program has an input validation logic implementation to filter out invalid inputs. • E.g., an input text filed only accepts an email address • Feeding arbitrary inputs to the program is ineffective to test the core feature of the program • Low cost in generating random inputs • But, low effectiveness in finding bugs as well • It is not a tradeoff that we want to study in research. • We would rather like to accept slightly lower effectiveness if the associated cost can be reduced much.

Example 2 • Testing a new module (e.g., class A) this time will become re-testing of the same module later when we develop a new module that interacts with the former • Additional re-testing on the same module will generally not be able to discover as many bugs as when the same testing is conducted on its first few (e.g., 13) times. • The fault detection ability (effectiveness) of the testing decreases, but the cost (efficiency) does not

General Software Testing Framework A program accepts inputs, executes code and produces outputs. locate faults program F assess outcome T Developers may further enhance the program with additional features. add feature execute assess adequacy T F Testing aims at assuring the program at each such activity. generate test cases as inputs

Dataflow Testing Given three integers, find the integer that is in the middle. int findMid(int a, int b, int c){ int result = a; if a >= b && b >= c result := b;; if a >= c && c >= b result := b; … // similar structure to handle other cases return result; } output: 3

Dataflow Testing Fault here: should be “result :=c”; Given three integers, find the integer that is in the middle. int findMid(int a, int b, int c){ int result = a; if a >= b && b >= c result := b;; if a >= c && c >= b result := b; … // similar structure to handle other cases return result; } output: 3

Dataflow Testing Given three integers, find the integer that is in the middle. int findMid(int a, int b, int c){ int result = a; if a >= b && b >= c result := b;; if a >= c && c >= b result := b; … // similar structure to handle other cases return result; } The variable “result” is defined as 3 output: 3

Dataflow Testing Find a test case so that the value of “result” defined at s1 is used at s2. e.g., findMid(1,3,2) outputs “3”, while the expected result is “2”. Given three integers, find the integer that is in the middle. int findMid(int a, int b, int c){ int result = a; if a >= b && b >= c result := b;; if a >= c && c >= b result := b; … // similar structure to handle other cases return result; } The variable “result” is defined as 3 This value 3 is used here. output: 3

Dataflow Testing Find a test case so that the value of “result” defined at s1 is used at s2. e.g., findMid(1,3,2) outputs “3”, while the expected result is “2”. Given three integers, find the integer that is in the middle. int findMid(int a, int b, int c){ int result = a; if a >= b && b >= c result := b;; if a >= c && c >= b result := b; … // similar structure to handle other cases return result; } The all-use idea: Test against all such pairs. result [define] result [use]

Dataflow Testing Given three integers, find the integer that is in the middle. c-use: Directly affects the computation being performed or outputs the result of some earlier definition. int findMid(int a, int b, int c){ int result = a; if a >= b && b >= c result := b;; if a >= c && c >= b result := b; … // similar structure to handle other cases return result; } result [define] result[c-use]

Dataflow Testing Given three integers, find the integer that is in the middle. p-use: directly affects the flow of control through the sub-program. int findMid(int a, int b, int c){ int result = a; if a >= b && b >= c result := b;; if a >= c && c >= b result := b; … // similar structure to handle other cases return result; } a [define] a [p-use]

Data Flow Test Adequacy Criteria Frankl and Weykuer, TSE 1988

Test Adequacy Criterion • Define a set of requirements (e.g., all define-use associations in a program) that executing a program over a set of test cases should exercise all these requirements while testing the program • The set of requirements is usually computed statically • May serve as a rule to stop finding more testing • E.g., all-edges in KLEE (OSDI 2008) • May also serve to give the sense of progress • Point out program locations that need more testing efforts (as a test comprehensive of the test set)

Data Flow Test Adequacy Criteria Not quite practical • tradeoff between effectiveness and cost Extension Much less effective than all-uses Frankl and Weykuer, TSE 1988

Empirical Study on Comparing Criteria • An empirical comparison of data flow and mutation-based adequacy criteria by Mathur and Wong (STVR 1994) • An experimental evaluation of dataflow and mutation testing by Offutt et al. (SPE 1996) • All-uses vs. mutation testing: An experimental comparison of effectiveness by Frankl et al. (JSS 1997) • An experimental comparison of four unit test criteria: mutation, edge-pair, all-uses and prime path coverage by Li et al. (Mutation 2009) • An evaluation of mutation and data-flow testing: a meta analysis by Kakarla et al. (Mutation 2011) • On the danger of coverage directed test case generation by Staats et al. (FASE 2012) • …

Domain-Specific Extension of Testing Criteria • A family of test adequacy criteria for database-driven applications by Kapfhammer and Soffa (FSE 2003) • Testing context-aware middleware-centric programs: a data flow approach and an RFID-based experimentation by Lu et al. (FSE 2006) • Testing pervasive software in the presence of context inconsistency resolution services by Lu et al. (ICSE 2008) • Inter-context control-flow and data-flow test adequacy criteria for nesC applications by Lai et al. (FSE 2008) • Data flow testing of service-oriented workflow applications by Mei et al. (ICSE 2009) • Data flow testing of service choreography by Mei et al. (FSE 2009) • Semi-valid input coverage for fuzz testing by Tsankov et al. (ISSTA 2013) • …

Domain-Specific Extension of Testing Criteria • Sometimes, there are new research problems when knowing more about a specific domain

Pervasive Computing Applications • Context • Abstraction of environmental situations; exist in streams • e.g., (data:) location, temperature, RF signal, user profile • e.g., (process:) other computing entities, execution states • Context-awareness • Application behavior evolves as contexts change

Example Application at Shenzhen Urban Transport Planning Centerhttp://www.sutpc.com/szgis • Purpose: Monitor the real-time traffic condition at Shenzhen City. • Every taxi installed with a mobile client • Report the taxi’s location as contexts • Display the suggested route for a taxi journey • Monitoring and routine computation are done at back office. ■畅通 ■拥挤 ■堵塞

Example Application at Shenzhen Urban Transport Planning Centerhttp://www.sutpc.com/szgis Good Traffic Drop point Traffic Congestion ■畅通 ■拥挤 ■堵塞 Start point

Many Obstacles to Prevent the Software Application to Behave Adaptively 1. Not easy to collect good raw information Context corruption Context loss Signal reflection Too many context sources nearby In tunnel Many buildings in surrounding

Many Obstacles to Prevent the Software Application to Behave Adaptively 1. Not easy to collect good raw information Context corruption Context losses Signal reflection Need to detect problems in contexts and repair them Too many data sources nearby In tunnel Many buildings in surrounding

Many Obstacles to Prevent the Software Application to Behave Adaptively • 10000+ taxis on street  14+ millions contexts per day • Many potential context consistencies can be checked  Huge # of checking required • Many ways to fix the same detected problems context Planning Center  Huge # of alternatives to be assessed.

Many Obstacles to Prevent the Software Application to Behave Adaptively 2. Not easy to behave smartly Human-in-the-loop Program bugs Turn left Road accident In-situ processing

Many Obstacles to Prevent the Software Application to Behave Adaptively 2. Not easy to behave smartly Human-in-the-loop Program bugs Turn left Accident Context-aware programs should use contexts correctly, and work well with context related services In-situ processing

Status of Current Development of Context-aware Software • Research Question • How can we test well to gain confidence in the developed context-aware programs efficiently? little adequate plenty Common Problem  Successful development experience  Brute-force development  Program bugs

Status of Current Development of Context-aware Software little adequate plenty Common Problem  Successful development experience  Brute-force development  Test around contexts in programs Automated testing Program bugs Testing Strategy  

Testing Context-Aware Software locate faults context-aware Program P F assess outcome T add feature execute assess adequacy F T context detect context problem repair contexts T F Context source generate test cases

Context source 1 Context source 2 Detect Context Problems c3 c2 c1 context stream r2 r1 • What if: many problematic contexts. Time-consuming to check every context against every consistency constraint in full. • An approach: Reuse previous checking results if possible (instead of re-checking every time). [ACM TOSEM’09, ICSE’06] reach in Road( not ( cp in Checkpoint (WithinTrans(reach, cp) ) ) ) Express a constraint in the first-order logic. Bind reach to context source 1 Bind cp to context source 2

Detect Context Problems WT: WithinTrans. T reachinRoad reach = r1 reach = r2 T T not not reach in Road( not ( cp in Checkpoint (WithinTrans(reach, cp) ) ) ) F F cp inCheckpoint cp inCheckpoint F F F F cp= c2 cp= c1 cp= c1 cp = c2 WT(reach, cp) WT(reach, cp) WT(reach, cp) WT(reach, cp)

Detect Context Problems WT: WithinTrans. Adding a new context d3 T F reachinRoad reach = r1 reach = r2 T F T Partially Reuse not not F T F cp inCheckpoint cp inCheckpoint F F F F cp = c2 cp=c3 cp= c1 cp= c3 cp= c1 cp= c2 T WT(reach, cp) WT(reach, cp) WT(reach, cp) WT(reach, cp) WT(reach, cp) WT(reach, cp) Recheck Recheck Reuse

Experimental Evaluation • Goal: Savings in execution time, miss rate • Subject: the Shenzhen Urban Transport Planning Center application. • Procedure: • Use 24 consecutive hours of real data collected from 760 taxis of the same taxi company. • Input these 24 hours of data to our technique and the one without the reuse component (dubbed Re-Check).

Experimental Evaluation • Goal: Savings in execution time, miss rate • Subject: the Shenzhen Urban Transport Planning Center application. • Procedure: • Use 24 consecutive hours of real data collected from 760 taxis of the same taxi company. • Input these 24 hours of data to our technique and the one without the reuse component (dubbed Re-Check). Ours: more accurate Ours: faster

Testing Context-Aware Software locate faults context-aware Program P F assess outcome T add feature execute assess adequacy F T context detect context problem repair contexts T F Context source generate test cases

c2 c1 Assess Adequacy context stream r2 r1 c3 • Problem: How well do test cases exercise P with respect to context changes? • Approaches: Develop test data adequacy criteria • Push-mode program semantics [FSE’06, ICSE’08a] • Event-driven program semantics [FSE’08] • Pull-mode program semantics [ICSE’08b] reach in Road( not ( cp in Checkpoint (WithinTrans(reach, cp) ) ) ) Context Dropping Service c2 c1 c3’ c4 r3 r2 r1 Context-aware program P

Assess Adequacy:Estimating the location of taxi Context 1: reader’s readings Checkpoint Reader0 Checkpoint Reader 1 CheckpointReader 2 Checkpoint Reader 3 Taxi p Road Context 2: Taxi’s estimated position

Assess Adequacy: Location Estimation Program Checkpoint Reader0 Checkpoint Reader 1 CheckpointReader 2 Checkpoint Reader 3 20 25 30 Taxi p Road reading value Estimate parcel position through readers nearby

Assess Adequacy: Location Estimation Program Reader 0 Reader 1 Reader 2 Reader 3 20 25 30 Item p Road Current estimated position

Assess Adequacy: Location Estimation Program Reader 0 Reader 1 Reader 2 Reader 3 30 20 10 Item p Conveyor Drop Current estimated position Current estimated position Last estimated position Context Dropping Service Constraint: for current context p0 and any old context pi, p0.position  pi.position Strategy: [drop] Violated

Assess Adequacy: Sample Fault Make as Position 0 only if r[0] is smallest instead of largest. Other position estimation has been correctly implemented. Expected implementation P contains n7:if (r[0].strength >= r[1].strength && r[0].strength >= r[2].strength) Incorrect implementation P’ contains n7’: if (r[0].strength<=r[1].strength && r[0].strength<=r[2].strength) Fault

From Test Adequacy To Test Oracle

From Test Adequacy To Test Oracle

Presentation Transcript

Test Case Selection and Adequacy Criteria

To Test or Not To Test:

From Test Adequacy To Test Oracle

To Test or Not to Test

From Test Tube to Toilets

To Test or Not To Test

Adequacy of a Class Test Suite

Oracle 1Z0-498 practice test

Oracle 1Z0-898 Test Questions

Oracle 1Z0-869 Test Questions

Oracle 1Z0-560 Test Questions

Oracle 1Z0-518 Test Questions

Oracle 1Z0-516 Test Questions

Oracle 1Z0-202 Test Questions

Oracle 1Z0-151 Test Questions

Oracle 1Z0-050 Test Questions

Oracle 1Z0-470 Test Questions

Oracle 1Z0-441 Test Questions

Test Case Selection and Adequacy Criteria

To Test or Not To Test: