1 / 30

Comparing Methods for Data Combination

Comparing Methods for Data Combination. D.H. Judson July, 2006. Purpose of This Talk. Place our work in historical and philosophical context Describe and explain a standard model of combining data Show, via case studies, how this standard model can be used

breck
Download Presentation

Comparing Methods for Data Combination

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparing Methods for Data Combination D.H. Judson July, 2006

  2. Purpose of This Talk • Place our work in historical and philosophical context • Describe and explain a standard model of combining data • Show, via case studies, how this standard model can be used • Consider expanding the standard model into new territory

  3. The problem with “Official” Statistics(from the locality point of view) • Not fast enough • Lag from collection to dissemination • Not local enough • Geographic reporting insufficient for local needs • Not granular enough • Insufficient detail for important demographic groups • Not integrated enough • Differing definitions, time reference, etc.

  4. Two paradigms • Sample survey method • Direct estimates from sample surveys • Expensive to drive down to local areas • “Administrative records” method • Large collections of already-existing data • Typically lack control over data collection, sampling or model-based uncertainty measures

  5. What is “Information Integration”? • Information integration is the process of using multiple datasets in concert to construct statistical estimands for the purpose of answering questions about those estimands. • E.g., • What is the ethnicity-specific unemployment rate in Stockport in July, 2003? • How many uninsured children at 100-200% of the Federal Poverty Level are there in Washoe County, Nevada in 2004? • What is the population size, by mode of available transportation, in Bradford in December 2005?

  6. Attacking the problem:Three principles • Recognize that the estimand exists, but is not always observed directly • (Latent variable principle) • Recognize that none of the bits of data contributing to the estimand are without error or uncertainty • (Uncertainty principle) • Model the relationship between estimand and data sources • (Modeling principle)

  7. The Standard Model for Combining Information Population of Inference Target Database (X) Link function “Ground Truth” Carefully Collected Data (Y) X Representative (?) Sample of X Estimated Model: Y=f(X) Augmented Database, with X and estimated Y’s

  8. Case study • Determining the number of uninsured children at the county level • Problem: • Estimates not yet provided by the Fed statistical system (SCHIP expansion: 1997; first estimates: 2005) • Solution: • Develop own county-level estimates • Result: • Attempt to integrate state-level survey data (Current Population Survey) with county-level cohort-component population estimates • Combine two separate easier-to-construct quantities: Population estimates and uninsurance model

  9. Model Age/Race/Sex/Hispanic (ARSH) origin estimates Synthetic Match by ARSH Slight slippage: CPS does not cover inst. population CPS/SIPP uninsurance status by ARSH ARSH cells Estimated Model: P[uninsured] = F(ARSH data) ARSH cells + P[uninsured|ARSH]

  10. Case study • Evaluating program participants’ outcomes with unemployment insurance (UI) wage records versus a 13 week follow-up survey • Program completers get a follow-up survey • Performance measure = weekly wage • Performance standard: Avg. weekly wage of respondents > $xxx • Problem: Can UI records replace the survey? • What “UI weekly wage” ≈ “survey weekly wage”? • Solution: For linked records: • Regress UI weekly wage on survey weekly wage • Express performance standards on transformed scale

  11. Model Program completers SSN Link Notable slippage: UI does not cover out of state Unemployment Insurance Wage Records Survey data Estimated Models: P[working] = F(survey data) Mean[wage] = F(survey data) UI wage “equivalent” to survey response

  12. Case study • Improving race data on an administrative file • Social Security Administration “NUMIDENT” file • Limited race data, coding has changed over time • Base for population estimates system • Solution: Link to Current Population Survey (CPS), use CPS race in a model

  13. Model NUMIDENT (Social Security) file Link by SSN; Correct for nonlinks • Slight slippage: • NUMIDENT • is not a register • CPS race only • covers adults CPS race data X Estimated Model: CPS race = F(NUMIDENT data) Augmented NUMIDENT, with X and estimated Y’s

  14. Case study(NUMIDENT, round two) • Problems with first approach: • Family structure not accounted for (too many mixed race families) • CPS race only covers adults • Census 2000 race: “Mark all that apply” • Solution: Probabilistic link of NUMIDENT to Census 2000, use Census race in a model • Simple assignment for linked records • Modeled assignment for nonlinked records

  15. Model NUMIDENT (Social Security) file Probabilistic linkage • Slight slippage: • NUMIDENT • is not a register • Linkage error not • accounted for Census 2000 Race data X Estimated Model: Cen2000 race = F(NUMIDENT data) Augmented NUMIDENT, with X and estimated Y’s

  16. Questions to be answeredby the information integrator • Older/better vs. Newer/worse? • Data vintage is (surprisingly) important • Count vs. Model? • AR data never seem to match population of interest • Certainty vs. Uncertainty? • As yet unsolved problem • Synthetic Match vs. Statistical Matching vs. Record Linkage? • Yesterday: Technology almost exists • Today: Technology exists!

  17. Outstanding challenges(Lacunae in current theory) • Representing uncertainty • Adapting to differential temporal/spatial/ontological reference • Sampling and design weights with linked survey/census/administrative data • Record linkage and statistical matching error • Covariance between estimation components • Spatial concentration of change

  18. Lacuna #1 • How to represent uncertainty in the “administrative records method” • Sampling variance, model variance, and procedure variance? • Fisher/Gee (2004) general model: • Incorporate sampling variance where it is known • Incorporate model variance based on specification • Procedure variance: Bayesian estimate of uncertainty • Incorporate latent variable approach

  19. A More Sophisticated Model (Fisher/Gee) i=ith area; j=jth indicator variable; μ=true (latent) value of estimand; =mean (latent) value of estimand for all areas; σμ=S.D. of estimand for all areas; a,b,σ=regression relationship of jth indicator to ith area.

  20. Estimand: Log Number of Persons in Poverty POP Xj Xj  σ σ a a b b Xj h s m LNP j=CEN, TAX, NF j=CPS, ACS ith county j= FSP s a b

  21. Thus, Rethinking the Model… Target Database (X) Link function “Ground Truth” Carefully Collected Data (Y) X Representative (?) Sample of X Estimated Model: Latent variable Z=f(X, Y) Augmented Database, with X’s, Y’s, and estimated latent Z’s

  22. Lacuna #2 • Adapting to differential temporal/spatial/ontological reference (and adaptive temporal/spatial/ontological reference) • Information decays over time • The population of objects (people, housing units, areas) is changing (at different rates) • “Spatial and temporal slippage” (the difference between the reference dates/places of the data and the estimand of interest) • “Ontological slippage” (Barr; the difference between one representation of the object of interest and another)

  23. Temporal Information Decay (Stuart) Level III: Global Parameters: α: file inclusion probs λ: Global migration rate (decay) Level II:Person captured in AR file? wi and yi: capture probs Level II: Individual-level migration t0i, t1i : Start and end time Level I:Individual-level observation zi: indicator of capture Magic day (Census Day) Capture in file B Capture in file A Information decay (migration)

  24. Lacuna #3 • Sampling and design weights with linked survey/census/administrative data • We know how to analyze survey data with weights; but what about linked data? • Proposed weight (Chesher and Nesheim, 2004):

  25. Lacuna #4 • Record linkage and statistical matching error • Known effect – biasing effect on inference • False links vs. false nonlinks tradeoff • Posterior probabilities: • P[records true link | linkage comparison] • P[records false nonlink | linkage comparison] • Provide factors correcting for linkage error • Elaborating on Chesher and Nesheim’s weight:

  26. Lacuna #5 • Covariances between components • Functional relationships induce covariance • Soil example • Demographic analysis example • Spatial autocorrelation induces covariance • Relationships across levels of geography induce covariance between estimation components • Solutions? • Error propogation modeling (Heuvelink, 1998)? • Multilevel modeling?

  27. Lacuna #6 Spatial concentration of change • Ian Cope, ONS: Address changes tend to be concentrated (Manchester vs. Westminster) • Today: Small area estimation methods: • (demographic) miss change • (statistical) smooth out change • Solutions?

  28. Comments on OPUS:Points of Agreement • Bayesian belief networks appealing: • Able to account for various uncertainties • Able to fully account for model structure • Full posterior distributions valuable • Synthetic data: • “Synthetic data provides a view of the knowledge in the final state of a statistical model. It does not tell us directly about the real-world system” • ≈ Latent variable principle • Importance of the model: • “Because the model is central, its correctness and reliability are crucial” • ≈ Modeling principle

  29. Comments on OPUS:Points of Contention/Clarification • Has statistical matching and/or microdata record linkage been fully exploited? • Microdata link = better inference • Are weights (survey and/or linkage and/or coverage) incorporated? • Remember: The population of inference ≠ data in hand • Is information decay/slippage addressed? • How to address spatial concentration of change and error?

  30. Dean H. Judson U.S. Census Bureau Washington, DC 20233 Phone: 301-763-6037 Email: dean.h.judson@census.gov Contact Information

More Related