1 / 26

Total Error Framework for Hybrid Estimation

This paper presents a total error framework for hybrid estimation, including simple and complex decomposition methods for error source targeting. It also provides a mathematical formulation and illustration comparing random samples with large censuses.

rfincham
Download Presentation

Total Error Framework for Hybrid Estimation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Total Error Framework for Hybrid Estimation Paul P. Biemer1,2 Asley Amaya1 1RTI International 2University of North Carolina at Chapel Hill BIGSURV18 Barcelona, Spain October 26, 2018

  2. Outline • Total error framework for the mean of some data set • Simple dichotomous decomposition • More complex decompositions for error source targeting • Mathematical formulation of concepts • Expression for the “Total” Mean Squared Error of the unweighted data set mean • Illustration comparing a small random sample with a large census • Data weighting and other error mitigation strategies

  3. Generalized TE Framework Total Error= Sample Recruitment Error + Data Encoding Error Sample Recruitment Error is a generalization of the concept of representation error Data Encoding Error is a generalization of the concept of measurement error

  4. Generalized TE Framework – Sample Recruitment Process No Population members 1. Chance to encounter recruitment system? Exit Process (Non-coverage) No Exit Process (Non-selection) 2. Encounters recruitment system? Exit Process (Nonresponse by noncontact) No 3. Receives stimulus for data input? Exit Process (Nonresponse by refusal) No 4.Inputs data? Data are recorded (observation)

  5. Generalized TE Framework – Sample Recruitment Process No Population members 1. Chance to encounter recruitment system? Exit Process (Non-coverage) No Exit Process (Non-selection) 2. Encounters recruitment system? Exit Process (Nonresponse by noncontact) No 3. Receives stimulus for data input? To be included in the data set, a population member must pass through all 4 gates. Exit Process (Nonresponse by refusal) No 4.Inputs data? Data are recorded (observation)

  6. Generalized TE Framework – Sample Recruitment Process No Population members 1. Chance to encounter recruitment system? Exit Process (Non-coverage) Let Ri = 1 if a population unit makes it through all 4 gates; Let Ri = 0 otherwise No Exit Process (Non-selection) 2. Encounters recruitment system? Exit Process (Nonresponse by noncontact) No 3. Receives stimulus for data input? To be included in the data set, a population member must pass through all 4 gates. Exit Process (Nonresponse by refusal) No 4.Inputs data? Data are recorded (observation)

  7. Generalized TE Framework – Data Encoding Process 1. Construct represented by data is the desired construct? Specification error Construct of interest, x No No Measurement error 2. Data captured without error? No Data processing error 3. Data processed without error? Final data value, y = x + ε

  8. Total Error Identity for the Mean of the Encoded Data Total Error= Data Enc Error + Samp Recr Error Specification Measurement Data Processing Non-coverage Non-selection Nonresponse (noncontact/ refusal)

  9. Data Encoding Error • xi is the true characteristic for the ith sample unit • yi is the encoded value of xi • εi = yi - xiis the error in the encoded value for the ith sample unit • Assume εi ~ i.i.d (Bε, ) Data capture error bias Data capture error variance

  10. Sample Recruitment Error (or Errors of Nonobservation) • Xi denotes the characteristic measured for the ith person in the Recruitment Process • , a measure of selection bias Bias induced by the data capture process Population variance Meng, 2017; Bethlehem, 1988

  11. Sample Recruitment Error • Xi denotes the characteristic measured for the ith person in the Recruitment Process • , a measure of selection bias Example: For SRS sampling and no nonresponse,

  12. Total Mean Squared Error of the Mean of a Generic Data Set Total Error= Data Enc Error + Samp Recr Error • Bias from • Specification error • Measurement error • Data processing error • Variance & Bias • from • Noncoverage • Nonselection • Nonresponse Recording error/capture error interaction • Variance from • Measurement error • Data processing error

  13. Interpretation of • Not much is known about ρRX for nonprobability samples. • However, ρRX has been studied extensively for surveys (through the estimation of nonresponse bias). • ρRXwill be smaller for nonprobability samples when gates 1, 2 and 3 are entered for all members of the population. • Ability to adjust for sample recruitment bias is better for surveys because • We have more control over who enters gates 1-3 and thus more control over ρRX • We often know a lot about sample recruitment failures and how to adjust for them through weighting and imputation.

  14. Illustration of Some Uses of these Results? Which is more accurate? • An estimate of the population average based upon an administrative data set with almost 100,000,000 population records and over 80% coverage? or • A national survey estimate based upon probability sample of size 3,300 completed cases with a 55% response rate?

  15. We try to answer this question for the US Residential Energy Consumption Survey (RECS) • Survey Data: 2015 RECS • Mode changed from face to face to web/mail • Respondent reports of housing unit square footage not reliable • Substituting administrative data could be more accurate • response rate ≈ 55% • n ≈ 6,000*.55 = 3,300 completed cases • Administrative data: Zillow real estate data base • Coverage ≈ 82% • n ≈ 100,000,000 records • Other data bases were also considered (Acxiom and CoreLogic)

  16. We try to answer this question for the US Residential Energy Consumption Survey (RECS) • HU square footage is primarily used for micro-econometric modeling • We will consider estimation of the U.S. average HU square footage to demonstrate the MSE analysis • Similar analysis could be considered for other parameters of interest (i.e., regression coefficients) • However, the current formulation would not be appropriate. • Our analysis will use results from Amaya (2017)

  17. Evidence of Nonsampling Error from the RECS RECS Average Reported Square Footage by Source Official Respondent Zillow

  18. Evidence of Nonsampling Error from the RECS RECS Average Reported Square Footage by Source Official Respondent Zillow

  19. Input Parameters for Computing MSE

  20. RMSEs as a Function of ρRX for Zillow and RECS Ignoring Data Encoding Error, Zillow has uniformly smaller RMSE Value of ρRX

  21. RMSEs as a Function of ρRX for Zillow and RECS With Data Encoding Error, RECS has the smallest RMSE for most of the range Value of ρRX

  22. RMSEs as a Function of ρRX for Zillow and RECS Value of ρRX

  23. RMSEs as a Function of ρRX for Zillow Data Encoding Error Component Sample Recruitment Error Component Value of ρRX

  24. Results Summary • Whether RMSEZillow < RMSERECS depends on value of ρRX • In this case, reducing ρRX may lead to larger RMSE because two biases are offsetting one another • Ideally, both biases should be minimized because an offsetting biases situation is not sustainable

  25. Potential Zillow Error Mitigation Strategies Data Encoding Error • Estimate the bias and adjust for it • May need ground truth square footage data to model this bias Sample Recruitment Error • Weight the Zillow data to reduce |ρRX| • Weights will approximate [E(Ri)]-1 • Modeling E(Ri|X) will require understanding how Ri varies by housing unit and other characteristics (X) of the sample recruitment process

  26. A Few Take Aways • As we move towards integrating survey and Big Data, need to consider the total error. • Sample recruitment bias is the least understood component of the total error. • Understanding this component will lead to statistical products of greater quality, utility and efficiency. • Focusing solely on data missingness may substantially underestimate total uncertainty. In this illustration, data encoding error dominated the MSE. • The monograph paper will further explore error mitigation strategies for dealing with both data encoding and sample recruitment error.

More Related