210 likes | 373 Views
Exploiting the Essential Assumptions of Analogy-based Effort Estimation. Syed Shah A Zaman (6705130) Shakil Mahmud(7015384) Submitted to Professor Shervin Shirmohammadi in partial fulfillment of the requirements for the course ELG 5100. Roadmap. Effort estimation and its importance
E N D
Exploiting the Essential Assumptions of Analogy-based Effort Estimation Syed Shah A Zaman (6705130) Shakil Mahmud(7015384) Submitted to Professor ShervinShirmohammadi in partial fulfillment of the requirements for the course ELG 5100
Roadmap • Effort estimation and its importance • Different methods of effort estimation • Analogy based effort estimation • TEAK and its steps • Conclusion & Future Work
Effort estimation “Software development efforts estimation is the process of predicting the most realistic use of effort required to develop or maintain software ” • Importance of effort estimation: • Tracking velocity • Iteration scope • Prioritizing • Release planning
Effort estimation methods • Three subcategories: • Human Centric Techniques (e.g. Expert judgment) • Algorithmic models (e.g. COCOMO) • Machine learning (e.g. Analogy based estimation or ABE)
Analogy Based Effort Estimation “projects that are similar with respect to project features will be also similar with respect to project effort” • Five basic steps: • Select the historical project dataset • Choose the project features for similarity measurement • Measure the similarities • Identify the most similar projects • Adopt the efforts of similar projects to generate the effort estimation
ABE0 or “Baseline” ABE: Basic features • Table containing all the training projects • Each row-one project • Each column- independent and dependent variables or features of each project (e.g. duration, effort) choice of variables is flexible • Input: test project • Output: estimate for that project • Scaling measure is used to maintain same degree of feature in test and training projects • Feature weighting is used to reflect the influence of the features
Measuring Similarity “Measuring the closeness between two data objects in n-dimensional feature space” • Objective: Rank the similar cases from the dataset and utilize k nearest cases. • Most common method: Euclidean distance metric • Example: Two points X(x1, x2, x3…..) and Y(y1 ,y2, y3 ……) • Unweighted Euclidian distance: • Weighted Euclidian distance:
TEAK: Test Estimation Assumption Knowledge • It uses an easy path heuristic which finds the situations that confuse estimation and removes those
Select prediction system • There are many prediction systems • Authors have chosen ABE, because : • It is widely studied. • It works even if the domain data are sparse • Unlike other predictors, it makes no assumptions about data distributions or an underlying model. • When the local data do not support standard Algorithmic/parametric models like COCOMO, ABE can still be applied.
Identify essential assumption(s) • Assumption one: locality implies homogeneity • If two projects are closer, it would mean they are more similar • Avoiding confuse estimation • Not choosing a project that has higher variances Variance defined as:
Identify assumption violations • We need to find a way to compare the variance between small and large k estimates, so, clustering is needed , and a GAC (Greedy Agglomerative Clustering) tree is formed • Using the GAC tree, finding the k-nearest neighbors in project data can be implemented using the following procedure called TRAVERSE: • 1. Place the test project at the root of the tree. • 2. Move the test project to the nearest child (where “nearest” is defined by Eulicidean). • 3. Go to step 2. • In the tree, check where going to a child increases variance, that is identified as a violation.
Remove violations • After identifying the violated node, check if it satisfies the selected pruning policy, among : • more than α times the parent variance; • more than β*max(σ2); • more than Rγ*max(σ2), where R is a random number 0<R<1 • Prune that subtree
ABCDEF TEST ABCD CD AB EF C D A B E F
Conclusion • Higher variance “confuses” estimation • TEAK doesn’t consider variance only; TRAVERSE2 moves away from regions of higher variance and toward regions with similar features. • Augmenting nearest neighbor algorithms with variance avoidance does better than just applying nearest neighbor. • In ABE, a brute force method is exhaustive, but TEAK uses subtree pruning which reduces CPU consumption.
Future Work • Use easy path to measure feature weighting • Explore alternatives to GAC • Improve pruning policy by examining more datasets.
References • Menzies et. al. “Exploiting the Essential Assumptions of Analogy-based Effort Estimation”, IEEE Transactions on Software Engineering, Vol 38, No.2 . March-April 2012 • J.Wen et al. “Improve Analogy Based Software Estimation using Principle Components Analysis and Co-relation Weighting”. 16th Asia Pacific Software Engineering Conference, 2009. pp 179-186 • D. Baker, “A Hybrid Approach to Expert and Model-Based Effort Estimation,” master’s thesis, LCSEE, West Virginia Univ., http://bit.ly/hWDEfU, 2007 • Shepperd et al. “Effort Estimation using Analogy”. Proc. 18th Intl. Conf. on Software Engineering, 1996, pp-170-178 • D. Beeferman and A. Berger, “Agglomerative Clustering of a Search Engine Query Log”. Proc. 6th ACM, SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, 2000 , pp 407-416 • Li et al. “A Study of Project Selection and Feature Weighting for Analogy Based Software Cost Estimation”. J. Systems and Software, vol. 82, pp. 241-252, 2009.
Thank You Questions