Analysis & Evaluation of Data

Analysis & Evaluation of Data • The collected data should be • Reliable • none or very little error is committed in the gathering and tabulation of data • Accurate • maintain the desired degree of precision • Valid • the data is applicable to the issue and attribute of interest

Sample Consideration • We have collected “error data” on Requirements Inspection, Design Inspection and Unit Testing and want to analyze them for“quality” attribute • Potential Reliability problem? • Did we collect and count the data correctly in all three cases • Potential Accuracy problem ? • Did we use the same level of precision (e.g. same level of severity breakdown) • Potential Validity problem ? • Is number of “defect” a valid quality attribute • Do these data reflect a measure of the extent of “defects” committed (extent = number, severity, complexity of fix, etc. ?)

Some Common Analysis Methods of Data • Distribution of Data • Centrality and Dispersion • Moving Averages • Data Correlation • Normalization of Data

1. Distribution of Data • We often look at a scatter diagram of the raw data and pick out the “outliers” • We count the frequency of occurrences and get a distribution to get a view of the “shape” of the distribution and the “range” of distribution. • severity 1 : 7 defects • severity 2 : 24 defects • severity 3 : 26 defects • severity 4 : 88 defects • severity 5 : 92 defects • Rangeis from 7 defects to 92 defects • Shapeis not that important in this case, • the skew is towards the less severe defects

Common Distributions of Data • There are some “recognizable” distributions * * * * * * * * * * * * Normal Linear * * * * * * * * * * * * * Logarithmic Exponential Negative Exponential

2. Centrality and Dispersion • Use centrality to compare two sets of data distribution • mean • median median value mean value median value median value mean value Mean value

Variance & Standard Deviation • A measure of dispersion from the central value (see below) • we measured number of defects (xi) from n similar sized functional areas • the mean or central value is calculated : Xmean = ∑(xi) / n • the variance = [ ∑ ( (Xi – Xmean )**2 ) ] / n • Std Dev. = SQRT (variance) • For Normal Distribution, 1 Std captures about 68% of the sample. • Given a new function of similar size, we can measure the number of defects found and compare against the mean of the earlier group and the 1 std deviation.

Control Chart * 1 Std Dev. * * * * Mean = 5.3 * * 1 Std Dev. *

3. Moving Average - a “Smoothing” Technique Jump smoothed Jump smoothed Special jump

4. Correlation • Only addresses whether there is a “relationship” • Does not address “cause and effect” • Example : • size of the module may correlateto number of defects • but size of the module may or may not be the cause

Linear Relationship Y Linear equation of the form Y= a+ bX where: - ‘b’ is the slope and - ‘a’ is the y intercept * * * * * * * * * X

Least Square Linear Regression • A method of estimating the linear relationship of Y variables with the X variables in the following form by minimizing the distance of Y coordinates from the linear line to get Y = a+bX. • We can estimate the parameters a, b as follows: • b = [ ∑(XY) - (1/n)(∑X)(∑Y)]/ [∑(X**2) - (1/n)(∑X)**2] • this b estimate gives the same value as the one shown in the book • a = Yave - (b*Xave) • where X is each of the X observation and Xave is the average of X’s

Least Square Linear Regression - Example • (size,defects) : (150,2); (230,3);(500,4);(730,7);(1000,9) • Xs: 150, 230, 500, 730, 1000; ∑(Xs) = 2610 • X**2: 22,500; 52,900; 250,000; 532,900; 1,000,000 and ∑(X**2) = 1,858,300 • Ys: 2, 3, 4, 7, 9; ∑(Y) = 25 • XY = 300, 690, 2000, 5110, 9000; ∑(XY) = 17,100 • b = [17100-(1/5)(2610)(25)]/[1858300 -(1/5)((2610)**2)] • = 4050/495880 = .0081 • a = 25/5 - (.0081)(2610/5) = 5 - 4.23 ≡.77 • Least Square Regression line is : Y= .77 + .0081 X Let’s plug in x= 150 and see what we get. .0081 * (150) + .77 = 1.22 + .77 = 1.99 (close!) More accurate for interpolation than extrapolation.

5. Normalization • Pure data gives 1-dimensional comparison • program A : 52 person days to complete • program B : 33 person days to complete • program C : 64 person days to complete • 64 > 52 > 33 what else can we say ? (suspect different sizes of programs) • Normalization gives an equalizing factor in terms of another attribute. • 52 person days : 5000 loc or 96.1 loc / person day • 33 person days : 3000 loc or 90.9 loc / person day • 64 person days : 6000 loc or 93.7 loc / person day

Analysis & Evaluation of Data