2cee: 21st Century Effort Estimation Methodology - Our Journey, Lessons Learned, and Key Findings

“2cee”A 21stCentury Effort Estimation Methodology Jairus Hihn Karen Lum jhihn@jpl.nasa.gov ktlum@jpl.nasa.gov Tim Menzies Dan Baker tim@menzies.us dbaker6@mix.wvu.edu 22nd International Forum on COCOMO and Systems/Software Cost Modeling (2007)

Our Journey • It became quickly apparent in the early stages of our research task that there was a major disconnect between the techniques used by estimation practitioners and the numerous ideas being addressed in the research community • It also became clear that many fundamental estimation questions were not being addressed • What is a models real estimation uncertainty? • How many records required to calibrate? • Answers have varied from 10-20 just for intercept and slope • If we do not have enough data what is the impact on model uncertainty • Data is expensive to collect and maintain so want to keep cost drivers and effort multipliers as few as possible • But what are the right ones? • When should we build domain specific models? • What are the best functional forms? • What are the best ways to tune/calibrate a model? 2cee

Our Journey Continued • Data mining techniques provided us with the rigorous tool set we needed to explore the many dimension of the problem we were addressing in a repeatable manner • Different Calibration and Validation Datasets • Analyze standard and non-standard models • Perform exhaustive searches over all parameters and records in order to guide data pruning • Rows (Stratification) • Columns (variable reduction) • Measure model performance by multiple measures • We have even been able to determine what performance measures are best 2cee

Some Things We Learned Along the Way 2cee

Local CalibrationDoes Not Always Improve Performance • For the NASA data set Local Calibration (LC) or re-estimating a and b only does not produce the ‘best’ model. • A more thorough analysis is required including reducing the number of variables • Effort models were learned via either standard LC or COSEEKMO • The top plot shows the number of projects in 27 subsets of our two data sources • The middle and bottom plots show the standard deviation and mean in performance error • Data subsets are sorted by the error’s standard deviation 2cee

Stratification Does Not Always Improve Performance • Stratification does not always improve model performance • Results show it is 50-50 • Main implication is that one must really know their data as there is no solution to determine the best approach to model calibration • The plots show mean performance error (i.e. |(predicted − actual)|/actual) based on 30 experiments with each subset • The dashed horizontal lines shows the error rate of models learned from all data from the two sources • The crosses show the mean error performance seen in models learned from subsets of that data • Crosses below/above the lines indicate models performing better/worse (respectively) than models built from all the data 2cee

Cost Driver Instability • The bottom line is that we have way too many cost drivers in our models! • Furthermore, what smaller set is best varies across different domains and stratifications • The cost drivers that are unlikely to improve model performance are pcap, vexp, lexp, modp, tool, sced • It is expected for more contemporary data that stor and time would drop out because there are fewer computer constraints these days and modp may become more significant 2cee

Some Good News • Physical SLOC always loads as significant with no language adjustment • The standard functional form shown below is virtually always selected as indicated by the non-standard model M5P being selected only once • The ‘out-of-the-box ‘ version of COCOMO 81 is almost always the best model on the original COCOMO81 data • View as a sanity check on our methodology • However, for the NASA93 data sometimes • one can use the model right out of the box • sometimes local calibration is sufficient • sometimes a full regression analysis needs to be performed to obtain optimal results 2cee

Key Research Findings • Our models have too many inputs • Measures of RE go up with over specified models • Median measures of error not Mean or Pred should be used to compare models • There is an instability issue due to the small data sets with significant outliers, which makes it difficult to determine which estimation model or calibration is best. • Mann Whitney U Test • Manual stratification does not lead to the ‘best’ model • E.g. a combination of flight SW and Class B ground produces a ‘better’ model then just selecting all your flight records and doing LC. • Nearest Neighbor searches for analogous records based on your current project model inputs • The same approach is never best but some combination of the following always wins • LC • Column Pruning • Nearest Neighbor • Which is best is determined case-by-case 2cee

2cee • 21st Century Estimation Environment • Just Born: released October 2, 2007 • Result of four years of research using machine learning technique to study model calibration and validation techniques • Probabilistic • Key Features: • Dynamic calibration using variable reduction and nearest neighbor search • Can be used as either a model analysis tool, calibration tool, and/or an estimation tool • Can estimate with partial inputs • Uses N-Fold Cross Validation (also called Leave One Out Cross Validation) • Uses median not mean to evaluate model performance • Runs in Windows, coded in Visual Basic • Will be running it in parallel with core tools over next year 2cee

2CEE Use Predefined COCOMO Coefficients Load Historical Data Optionally Use Manual Stratification Optionally Use Manual Or Automatic Feature Selection Define Project Ranges Full Local Calibration Bootstrapped Local Calibration Nearest Neighbour Local Calibration Monte Carlo Project Instances Produce Range of COCOMO Estimates

2CEE Steps Define Model Calibration Evaluate with Cross Validation Define Project Ranges Monte Carlo Estimates

2cee Provides Insight into Model Performance and Tuning • e.g. “officially”, COCOMO’s tuning parameters vary • 2.5 <= a <= 2.94 • 0.91 <= b < 1.01 There are many outliers in our data • Which is nothing like whatwe see with real NASA data, • 3.5 <= a <= 14 • 0.65 <= b <= 1 2cee

Karen will be available at the tool fairStop in and take a lookunder the hood 2cee

Bibliography Current Research Publications Selecting Best Practices for Effort Estimation, IEEE Transactions On Software Engineering, Nov 2006. (Menzies, Chen, Hihn, Lum) Evidence-Based Cost Estimation for Better-Quality Software, IEEE Software, July/August 2006. (Menzies and Hihn ) Studies in Software Cost Model Behavior:Do We Really Understand Cost Model Performance?, Proceedings of the ISPA International Conference 2006, Seattle, WA. (Lum, Hihn, Menzies) (Best Paper Award) Simple Software Cost Analysis: Safe or Unsafe?, Proceedings of the International Workshop on Predictor Models in Software Engineering (PROMISE 2005), St Louis, MS, 14 June 2005. (Menzies, Port, Hihn , Chen) Feature Subset Selection Improves Software Cost Estimation. (PROMISE 2005), St Louis, MS, 14 June 2005. (Chen, Menzies, Port, Boehm) Validation methods for calibrating software effort models, ICSE 2005Proceedings, May 2005, St Louis, MS. May 2005. (Menzies, Port, Hihn, Chen) Specialization and Extrapolation of Software Cost Models, Proceeding in Automation in Software Engineering Conference, Nov 2005. (Menzies, Chen, Port, Hihn) Finding the Right Data for Software Cost Modeling, IEEE Software, Nov/Dec 2005. (Chen, Menzies, Port, Boehm) 2cee

State of the Art Best Practice The following is a comprehensive list of best-practice based on an extensive review of the literature. Our proposed methodology, 2cee, addresses the practices designated in green • According to Jorgensen [2], expert-based best practices include: • Evaluate estimation accuracy, but avoid high evaluation pressure; • Avoid conflicting estimation goals; • Ask the estimators to justify and criticize their estimates; • Avoid irrelevant and unreliable estimation information; • Use documented data from previous development tasks; • Find estimation experts with relevant domain background; • Estimate top-down and bottom-up, independently of each other; • Use estimation checklists; • Combine estimates from different experts and estimation strategies; • Assess the uncertainty of the estimate; • Provide feedback on estimation accuracy ; and, • Provide estimation training opportunities. • According to Boehm [3], [4]; Chulani [5], [6]; Kemerer [7]; Stutzke [8]; Shepperd [9]; our own work [10]–[12]; and a recent tutorial at the 2006 International Conference of the International Society of Parametric Analysts [13], best practices for model-based estimation include at least the following: • Reuse regression parameters learned from prior projects on new projects. • Log-transforms on costing data before performing linear regression to learn log-linear effort models. • Model-tree learning to generate models for non-linear relationships. • Stratification, i.e. given a database of past projects, and a current project to be estimated, just learn models from those records from similar projects; • Local calibration, i.e. tune a general model to the local data via a small number of special tuning parameters; • Hold-out experiments for testing the learned effort model [10]; • Assessing effort model uncertainty via the performance deviations seen during the hold-out experiments of item #17. • Variable subset selection methods for minimizing the size of the learned effort model [11], [12], [14], [15]; 2cee

2cee: 21st Century Effort Estimation Methodology - Our Journey, Lessons Learned, and Key Findings