1 / 20

How to Find Relevant Data for Effort Estimation ?

How to Find Relevant Data for Effort Estimation ?. 毛 可 2012-03-28. 1. Author. Ekrem Kocaguneli ( Ph.d@WVU.LCSEE ) Tim Menzies Specialties : Data Mining, Effort Estimation 1 1’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation (TEAK)

ryann
Download Presentation

How to Find Relevant Data for Effort Estimation ?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Find Relevant Data for Effort Estimation ? 毛 可 2012-03-28 1

  2. Author Ekrem Kocaguneli ( Ph.d@WVU.LCSEE ) Tim Menzies Specialties: Data Mining, Effort Estimation 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation (TEAK) 11’ TSE: On the Value of Ensemble Effort Estimation 11’ ESEM: – 10’ ASE: When to Use Data from Other Projects for Effort Estimation(short) Pre: Relevancy Filtering for Defect Estimation 2

  3. Motivation (Why) The Locality(1) Assumption Data divides best on one attribute 1. project type;e.g. embedded, etc; 2. development centers of developers; 3. development language 4. application type(MIS; GNC; etc); 5. targeted hardware platform; 6. in-house vs out sourced projects; If Locality(1) Hard to use data across these boundaries confined model, need to collect local data 3

  4. Motivation (Why) The Locality(N) Assumption Data divides best on combination of attributes If Locality(N) Easier to use data across these boundaries 4

  5. Work Cross-vs-Within + “relevancy filtering” for effort estimation Cross as good as within Companies can use other’s data for their estimates If they first apply “relevancy filtering” "cross" same as "local" 5

  6. Technology (How) How to find relevant training data? 6

  7. Technology (How) Variance Pruning 7

  8. Technology (How) TEAK = ABE0 + Instance selection 11’ TSE: Exploiting the Essential Assumptions of Analogy-Based Effort Estimation ABE0 = ABE version 0 most commonly used Normalized numerics, 0 to 1 Euclidean distance equal weight to all attributes return median effort of k-nearest neighbors Instance selection smart way to adjust training data 8

  9. Technology (How) TEAK is a variance-based instance selector It is built via GAC trees (binary for even) TEAK is a two-pass system First pass selects low variance relevant projects (instance selection) Second pass retrieves projects to estimate from (instance retrieval) Variance Pruning > 10% * max ( σ2 ) > (100%+10%) * max ( σ2 ) ? 9

  10. Technology (How) TEAK finds local regions important to the estimation of particular cases TEAK finds those regions via locality(N) not locality(1) 11

  11. Experiments - Datasets Public availability: for reproducibility cross-within divisibility 6 out of 20+ datasets from PROMISE 12

  12. Experiments - Datasets For dataset X: subset X1,X2,X3 Within TEAK for X1, X2, X3 separately. LOOCV Cross X1 test, X2+X3 train. … N-Fold CV Repeat 20 times!As TEAK is greedy, vary according to input data order 13

  13. Experiments Win-Loss-Tie: Mann Whitney Test (95%) 检验两个总体的分布是否有显著的差别 14

  14. Experiment1 - Performace Comparison MAR: Mean Absolute Residual MdMRE: Median MRE 15

  15. Experiment1 - Performace Comparison Analogy by 1-neighbor: (PRED(25) > 0.3 on C81 Subsets ) for i = 1:numTestCases estimates(i) = effortTrain(nearestCase(i)) * sizeTest(i) / sizeTrain(nearestCase(i)); for k = 1 : numTestFactors estimates(i) = estimates(i) * cdTestReady(i,k) / cdTrainReady(nearestCase(i),k); end end Analogy by K-neighbor: 16

  16. Experiment2 – Retrieval Tendency 17

  17. Experiment2 – Retrieval Tendency Diagonal( WC ) vs. Off-Diagonal( CC ) selection Percentages sorted Percentiles of diagonals and off-diagonals 18

  18. Conclusion Cross performance is no worse than within performance Probability that estimator retrieves a training instance form cross/within data is the same Implication: Companies can learn from each other’s data Locality(N). Maybe, there are general effects in SE Effects that transcend boundaries of one company Local vs. Global Model… 19

  19. Future work Check external validity After instance selection, Does cross == within ? Build more repositories More useful than previously thought for effort estimation Synonym discovery Can only use cross-data if it has the same ontology Auto-generate lexicons to map terms between data sets. ( “LOC” – “size”, “product complexity” ) 20

  20. Thanks!Q& A ? 21

More Related