1 / 29

Parameterizing Random Test Data According to Equivalence Classes

This talk explores using parameterized random test data generation to ensure the quality assurance of machine learning applications. It discusses the challenges of testing ML applications and presents a solution using a hybrid approach of equivalence class partitioning and random testing.

agivens
Download Presentation

Parameterizing Random Test Data According to Equivalence Classes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University

  2. What is random testing? This is not part of the talk!!!! • Random testing is the notion of using “random” input to test the application • As opposed to using pre-determined and manually selected “equivalence classes” or “partitions”

  3. Introduction • We are investigating the quality assurance of Machine Learning (ML) applications • Currently we are concerned with a real-world application for potential future use in predicting electrical device failures • Using ranking instead of classification • Our concern is not whether an algorithm predicts well but whether an implementation operates correctly

  4. Data Set Options • Real-world data sets • Not always accessible/available • May not necessarily contain the separation or combination of traits that we desire to test • Hand-generation of data • Only useful for small tests • Random testing • Limited by the lack of a reliable test oracle • ML applications of interest fall into the category of “non-testable programs”

  5. Motivation • Without a reliable test oracle, we can only: • Look for obvious faults • Consider intermediate results • Detect discrepancies in the specification • We need to restrict some properties of random test data generation

  6. Our Solution • Parameterized Random Test Data Generation • Automatically generate random data sets, but parameterized to control the range and characteristics of those random values • Parameterization allows us to create a hybrid between equivalence class partitioning and random testing

  7. Overview • Machine Learning Background • Data Generation Framework • Findings and Results • Evaluation and Observations • Conclusions and Future Work

  8. Machine Learning Fundamentals • Data sets consist of a number of examples, each of which has attributes and a label • In the first phase (“training”), a model is generated that attempts to generalize how attributes relate to the label • In the second phase (“validation”), the model is applied to a previously-unseen data set with unknown labels to produce a classification (or, in our case, a ranking)

  9. Problems Faced in Testing • The testing input should be based on the problem domain • Need to consider a way to mimic all of the traits of the real-world data sets • Also need to keep in mind that we do not have a reliable test oracle

  10. Analyzing the Problem Domain • Consider properties of data sets in general • Data set size: number of attributes and examples • Range of values: attributes and labels • Precision of floating-point numbers • Whether values can repeat • Consider properties of real-world data sets in the domain of interest • How alphanumeric attributes are to be interpreted • Whether data values might be missing

  11. Equivalence Classes • Data sizes of different orders of magnitude • Repeating vs. non-repeating attribute values • Missing vs. no-missing attribute values • Categorical vs. non-categorical data • 0/1 labels vs. non-negative integer labels • Predictable vs. non-predictable data sets • Used data set generator to parameterize test case selection criteria

  12. How Data Are Generated • M attributes and N examples • No-repeat mode: • Generate a list of integers from 1 to M*N and then randomly permute them • Repeat mode: • Each value in the data set is simply a random integer between 1 and M*N • Tool ensures at least one set of repeating numbers

  13. Generating Labels • Specify percentage of “positive examples” to include in the data set • positive examples have a label of 1 • negative examples have a label of 0 • Data generation framework guarantees that the number of positive examples comes out to be the right number, even though the values are randomly placed throughout the data set • Labels are never unknown/missing

  14. Categorical Data • For some alphanumeric attributes, data pre-processing is used to expand K distinct values to K attributes • Same as in real-world ranking application • Input parameter to data generation tool is of the format (a1, a2, ..., aK-1, aK, m) • a1 through aK represent the percentage distribution of those values for the categorical attribute • m is the percentage of unknown values

  15. Data Set Generator - Parameters • # of examples • # of attributes • % positive examples (label = 1) • % missing • any categorical data • repeat/no-repeat modes

  16. Sample Data Sets • 10 examples, 10 attributes, 40% positive examples, 20% missing, repeats allowed 27,81,88,59, ?,16,88, ?,41, ?,0 15,70,91,41, ?, 3, ?, ?, ?,64,0 82, ?,51,47, ?, 4, 1,99, ?,51,0 22,72,11, ?,96,24,44,92, ?,11,1 57,77, ?,86,89,77,61,76,96,98,1 76,11, 4,51,43, ?,79,21,28, ?,0 6,33, ?, ?,52,63,94,75, 8,26,0 77,36,91, ?,47, 3,85,71,35,45,1 ?,17,15, 2,90,70, ?, 7,41,42,0 8,58,42,41,74,87,68,68, 1,15,1 35, 3,20,41,91, ?,32,11,43, ?,1 19,50,11,57,36,94, ?,96, 7,23,1 24,36,36,79,78,33,34, ?,32, ?,0 ?,15, ?,19,65,80,17,78,43, ?,0 40,31,89,50,83,55,25, ?, ?,45,1 52, ?, ?, ?, ?,39,79,82,94, ?,0 86,45, ?, ?,74,68,13,66,42,56,0 ?,53,91,23,11, ?,47,61,79, 8,0 77,11,34,44,92, ?,63,62,51,51,1 21, 1,70,14,16,40,63,94,69,83,0

  17. The Testing Framework • Data set generator • Model comparison • Ranking comparison: includes metrics like normalized equivalence and AUCs • Tracing options: for generating and comparing outputs of debugging statements

  18. MartiRank and SVM • MartiRank was specifically designed for the real-world device failure application • Seeks to find the sequence of attributes to segment and sort the data to produce the best result • SVM is typically a classification algorithm • Seeks to find a hyperplane that separates examples from different classes • SVM-Light has a ranking mode based on the distance from the hyperplane

  19. Findings • Testing approach and framework were developed for MartiRank then applied to SVM • Only the findings most related to parameterized random testing are presented here • More details and case studies about the testing of MartiRank can be found in our tech report

  20. Issue #1: Repeating Values • One version of MartiRank did not use “stable” sorting ... 91,41,19, 3,57,11,20,64,0 36,73,47, 3,85,71,35,45,1 ... ... ... ... stable ... 91,41,19, 3,57,11,20,64,0 ... ... ... 36,73,47, 3,85,71,35,45,1 ... ... 36,73,47, 3,85,71,35,45,1 91,41,19, 3,57,11,20,64,0 ... ... ... ... unstable

  21. Issue #2: Sparse Data Sets • Not specifically addressed in specification 41,91, ?,32,11,43, ?,1 19,65,80,17,78,46, ?,0 79,78,33,34, ?,31, ?,0 ?, ?,39,79,82,94, ?,0 50,83,55,25, ?, ?,45,1 57,36,94, ?,96, 7,23,1 41,91, ?,32,11,43, ?,1 57,36,94, ?,96, 7,23,1 79,78,33,34, ?,31, ?,0 19,65,80,17,78,46, ?,0 50,83,55,25, ?, ?,45,1 ?, ?,39,79,82,94, ?,0 sort “around” missing values randomly insert missing values put missing values at end 41,91, ?,32,11,43, ?,1 19,65,80,17,78,46, ?,0 ?, ?,39,79,82,94, ?,0 57,36,94, ?,96, 7,23,1 79,78,33,34, ?,31, ?,0 50,83,55,25, ?, ?,45,1 41,91, ?,32,11,43, ?,1 50,83,55,25, ?, ?,45,1 19,65,80,17,78,46, ?,0 79,78,33,34, ?,31, ?,0 ?, ?,39,79,82,94, ?,0 57,36,94, ?,96, 7,23,1

  22. Issue #3: Categorical Data • Discovered that refactoring had introduced a bug into an important calculation • A global variable was being used incorrectly • This bug did not appear in any of the tests only with repeating values or only with missing values • However, categorical data necessarily has repeating values and may have missing

  23. Issue #4: Permuted Input Data • Randomly permuting the input data led to different models (and then different rankings) generated by SVM-Light • Caused by “chunking” data for use by an approximating variant of optimization algorithm

  24. Observations • Parameterized random testing allowed us to isolate the traits of the data sets • These traits may appear in real-world data but not necessarily in the desired combinations • Algorithm’s failure to address specific data set traits can lead to discrepancies

  25. Related Work – Machine Learning • There has been much research into applying Machine Learning techniques to software testing, but not the other way around • Reusable real-world data sets and Machine Learning frameworks are available for checking how well a Machine Learning algorithm predicts, but not for testing its correctness

  26. Related Work – Random Testing • Parameterization generally refers to specifying data type or range of values • Our work differs from that of Thénevod-Fosse et al. [’91] on “structural statistical testing”, which focuses on path selection and coverage testing, not system testing • Also differs from “uniform statistical testing” because although we do select random data over a uniform distribution, we parameterize it according to equivalence classes

  27. Limitations and Future Work • Test suite adequacy for coverage not addressed or measured • Could also consider non-deterministic Machine Learning algorithms • Can also include mutation testing for effectiveness of data sets • Should investigate creating large data sets that correlate to real-world data

  28. Conclusion • Our contribution is an approach that combines parameterization and randomness to control the properties of very large data sets • Critical for limiting the scope of individual tests and for pinpointing specific issues related to the traits of the input data

  29. Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University

More Related