How to Find Relevant Data for Effort Estimation ?

How to Find Relevant Data for Effort Estimation ? 毛可 2012-03-28 1

Author Ekrem Kocaguneli ( Ph.d@WVU.LCSEE ) Tim Menzies Specialties: Data Mining, Effort Estimation 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation (TEAK) 11’ TSE: On the Value of Ensemble Effort Estimation 11’ ESEM: – 10’ ASE: When to Use Data from Other Projects for Effort Estimation(short) Pre: Relevancy Filtering for Defect Estimation 2

Motivation (Why) The Locality(1) Assumption Data divides best on one attribute 1. project type;e.g. embedded, etc; 2. development centers of developers; 3. development language 4. application type(MIS; GNC; etc); 5. targeted hardware platform; 6. in-house vs out sourced projects; If Locality(1) Hard to use data across these boundaries confined model, need to collect local data 3

Motivation (Why) The Locality(N) Assumption Data divides best on combination of attributes If Locality(N) Easier to use data across these boundaries 4

Work Cross-vs-Within + “relevancy filtering” for effort estimation Cross as good as within Companies can use other’s data for their estimates If they first apply “relevancy filtering” "cross" same as "local" 5

Technology (How) How to find relevant training data? 6

Technology (How) Variance Pruning 7

Technology (How) TEAK = ABE0 + Instance selection 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation ABE0 = ABE version 0 most commonly used Normalized numerics, 0 to 1 Euclidean distance equal weight to all attributes return median effort of k-nearest neighbors Instance selection smart way to adjust training data 8

Technology (How) TEAK is a variance-based instance selector It is built via GAC trees (binary for even) TEAK is a two-pass system First pass selects low variance relevant projects （instance selection） Second pass retrieves projects to estimate from （instance retrieval） Variance Pruning > 10% * max ( σ2 ) > (100%+10%) * max ( σ2 ) ? 9

Technology (How) TEAK finds local regions important to the estimation of particular cases TEAK finds those regions via locality(N) not locality(1) 11

Experiments - Datasets Public availability: for reproducibility cross-within divisibility 6 out of 20+ datasets from PROMISE 12

Experiments - Datasets For dataset X: subset X1，X2，X3 Within TEAK for X1, X2, X3 separately. LOOCV Cross X1 test, X2+X3 train. … N-Fold CV Repeat 20 times!As TEAK is greedy, vary according to input data order 13

Experiments Win-Loss-Tie: Mann Whitney Test (95%) 检验两个总体的分布是否有显著的差别 14

Experiment1 - Performace Comparison MAR: Mean Absolute Residual MdMRE: Median MRE 15

Experiment1 - Performace Comparison Analogy by 1-neighbor: (PRED(25) > 0.3 on C81 Subsets ) for i = 1:numTestCases estimates(i) = effortTrain(nearestCase(i)) * sizeTest(i) / sizeTrain(nearestCase(i)); for k = 1 : numTestFactors estimates(i) = estimates(i) * cdTestReady(i,k) / cdTrainReady(nearestCase(i),k); end end Analogy by K-neighbor: 16

Experiment2 – Retrieval Tendency 17

Experiment2 – Retrieval Tendency Diagonal( WC ) vs. Off-Diagonal( CC ) selection Percentages sorted Percentiles of diagonals and off-diagonals 18

Conclusion Cross performance is no worse than within performance Probability that estimator retrieves a training instance form cross/within data is the same Implication: Companies can learn from each other’s data Locality(N). Maybe, there are general effects in SE Effects that transcend boundaries of one company Local vs. Global Model… 19

Future work Check external validity After instance selection, Does cross == within ? Build more repositories More useful than previously thought for effort estimation Synonym discovery Can only use cross-data if it has the same ontology Auto-generate lexicons to map terms between data sets. ( “LOC” – “size”, “product complexity” ) 20

Thanks!Q& A ? 21

How to Find Relevant Data for Effort Estimation ?

How to Find Relevant Data for Effort Estimation ?

Presentation Transcript

Project Estimation and scheduling

Econometric Analysis of Panel Data

Generalized method of moments estimation

Estimation taking account of sample selection with Stata

Chapter 15

Articulated Pose Estimation with Flexible Mixtures of Parts

Time and Effort Basics From Commitments to Certification and Everything in Between

Discrete Choice Modeling

Comparative analysis of RNA- Seq data with DESeq and DEXSeq

Software Estimation

Texas Tech University Effort Coordinator Training

Chapter 6 The Theory and Estimation of Production

Searching the Internet

Deterministic Petrophysics

Comparison between various methods of age estimation in skeletal remains

Data Mining Course

orth

Lecture 5

TESTING…

Time and Effort Basics From Commitments to Certification and Everything in Between

Different Systems for Estimation of MARP Sizes