1 / 30

Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

Investigation of Macro Editing Techniques for Outlier Detection in Survey Data. Katherine Jenny Thompson Office of Statistical Methods and Research for Economic Programs. Simplified Survey Processing Cycle. Data Collection/ Analyst Review. Micro-editing And Imputation. Individual Returns.

corby
Download Presentation

Investigation of Macro Editing Techniques for Outlier Detection in Survey Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for Economic Programs

  2. Simplified Survey Processing Cycle Data Collection/ Analyst Review Micro-editing And Imputation Individual Returns Macro-editing Tabulated Initial Estimates Publication Estimates Analyst Investigation And Correction

  3. Identifying Outlying Estimates • Set of Estimates • Unknown parametric distribution (robust) • Contains outliers (resistant) • Outlier-identification problems (Multiple Outliers) • Masking: difficult to detect an individual outlier • Swamping: too many false outliers flagged

  4. Outlier Detection Approaches • Sets of “bivariate” (Ratio) comparisons • Same estimate from two consecutive collection periods (historic cell ratios) • Different estimates in same collection period (current cell ratios) • Multivariate comparisons • Current period data

  5. Method for Bivariate Comparisons • Resistant Fences Methods • Symmetrized Resistant fences • Asymmetric Fences • Robust Regression • Hidiroglou-Berthelot Edit

  6. Bivariate Comparisons (Current Cell Ratios) • Linear relationship between payroll and employment • No intercept

  7. “Traditional” Ratio Edit (Current Cell Ratio) Outlier Region Acceptance Region Outlier Region • “Cone-shaped” tolerances • Goes through origin • Strong statistical association

  8. Resistant Fences Methods q25-1.5H q75+1.5H q25 q75 • Different numbers of interquartile ranges (1.5 = Inner, 3 = Outer) • Implicitly assumes symmetry • May want to “symmetrize”, apply rule, use inverse transformation

  9. Asymmetric Fences Methods q25+3 (m – q25) q75+3 (q75- m) • Different numbers of interquartile ranges (3 = Inner, 6 = Outer) • Incorporates skewness of distribution in outlier rule (“Fences”)

  10. Robust Regression • Least Trimmed Squares Robust Regression • Resistant (minimizes median residual) • Outlier = |residual|  3  robust M.S.E.

  11. Issue at Origin (Historic Cell Ratio)

  12. Hidiroglou-Berthelot (HB) Edit • Accounts for magnitude of unit (variability at origin)

  13. Hidiroglou-Berthelot (HB) Edit • Two-step transformation (Ei) • Centering transformation on ratios • Magnitude transformation that accounts for the relative importance of large cases • Asymmetric Fences “Type” Outlier Rule • Key Parameter U = magnitude transformation parameter (0 U 1) C = controls width of outlier region

  14. Multivariate Methods: Mahalanobis Distance • Multivariate normal (,) • T(X) estimates  • C(X) estimates  • p is the number of distinct variables (items) • Prone to masking (difficult to detect individual outliers)

  15. Robust Alternatives • M-estimation (not considered) • “Production Method” • Minimum Volume Ellipse (MVE) • Resistant (50% breakdown) and robust • Minimum Covariance Determinant (MCD) • Resistant (50% breakdown) and robust • Assumption of Normality • Log-transformation

  16. Evaluation: Classify Item Estimates Input Value Reported Final Value Tabulated Ratio Input/Final Not an Outlier Potential Outlier Outlier

  17. Evaluation: Classify Ratios (Bivariate) • Conservative • Ratio is “outlier” if numerator or denominator is an outlier • Anti-Conservative • Ratio is “outlier” if numerator or denominator is an outlier or a potential outlier

  18. Evaluation: Classify Records (Multivariate) • Conservative • Record is “outlier” at least one estimate is an outlier • Anti-Conservative • Record is “outlier” at least one estimate is an outlier or a potential outlier

  19. Evaluation Statistics: Bivariate Comparisons • Individual Test Level • Type I Error Rate: proportion of false rejects • Type II Error Rate: proportion of false accepts • Hit Rate: proportion of flagged estimates that are outliers • All-Test Level • All-item Type II error rate

  20. Evaluation Statistics: Multivariate Comparisons • Type Ierror rate: the proportion of non-outlier records that are flagged as outliers • Type II error rate: the proportion of outlier records that are notflagged as outliers (missed “bad” values)

  21. Annual Capital Expenditures Survey (ACES) • Sample Survey (Stratified SRS-WOR) • ACE-1: Employer companies • ACE-2: Non-employer companies (not discussed) • New sample selection each year • Total and year-to-year change estimates • Total Capital Expenditures • Structures (New and Used) • Equipment (New and Used)

  22. Capital Expenditures Data • Characterized by • Low year-to-year correlation (same company) • Weak association with available auxiliary data • Editing procedures focus on additivity • Outlier correction at micro-level

  23. Bivariate Comparisons • Resistant Fences: (Symmetric or Asymmetric)  (Inner or Outer) • HB Edit: (U = 0.3 or 0.5)  (c = 10 or 20 )

  24. Results – Individual Tests • Robust Regression prone to swamping • High Type I error rate (false rejects) • Comparable performance with Asymmetric Inner Fences and HB Edit (U = 0.3, c = 10) • Low Type I error rates • High Hit Rates • High Type II error rates • Other variations of Resistant Fences and HB edit not as good

  25. Results – All-Tests • Very large Type II error rates (approx. 50%) • Robust regression • Symmetric resistant outer fences • HB edit with c = 20 • Improved Type II error rates (30% - 40%) • Asymmetric inner fences • HB edit (U = 0.3, C=10)

  26. Multivariate Results • Original Data: considered methods ineffective • Log-transformed data: improved performance (MCD and MVE) • Reduced Type I error rates • Comparable Type II error rates (to original-data MCD and MVE)

  27. Multivariate Versus Bivariate:Different Outcomes (Conservative) Combined HB edits flag more “outliers”: • Higher Type I error rate • Lower Type II error rates for the complete set of HB edits

  28. Comments • Economic data with inconsistent statistical association between items in each collection period • Critical values must be determined by the data set at hand (no “hard-coding”) • Dynamically • Standardize the comparisons (HB edit, log transformation) • Compute outlier limits • Could try hybrid approach: • Multivariate  a few current cell ratio tests with the HB edit • Perform all bivariate tests, but unduplicate cells before analyst review

  29. Final Thoughts/Next Steps • Examine one set of economic data and considered only two separate collections from this program. • Extrapolation would be foolish • My results need to be validated on other economic data sets • a more typical periodic business survey and/or • a well-constructed simulation study

  30. Any Questions? • Katherine Jenny Thompson • Katherine.J.Thompson@census.gov

More Related