Comparing Methods for Data Combination

Comparing Methods for Data Combination D.H. Judson July, 2006

Purpose of This Talk • Place our work in historical and philosophical context • Describe and explain a standard model of combining data • Show, via case studies, how this standard model can be used • Consider expanding the standard model into new territory

The problem with “Official” Statistics(from the locality point of view) • Not fast enough • Lag from collection to dissemination • Not local enough • Geographic reporting insufficient for local needs • Not granular enough • Insufficient detail for important demographic groups • Not integrated enough • Differing definitions, time reference, etc.

Two paradigms • Sample survey method • Direct estimates from sample surveys • Expensive to drive down to local areas • “Administrative records” method • Large collections of already-existing data • Typically lack control over data collection, sampling or model-based uncertainty measures

What is “Information Integration”? • Information integration is the process of using multiple datasets in concert to construct statistical estimands for the purpose of answering questions about those estimands. • E.g., • What is the ethnicity-specific unemployment rate in Stockport in July, 2003? • How many uninsured children at 100-200% of the Federal Poverty Level are there in Washoe County, Nevada in 2004? • What is the population size, by mode of available transportation, in Bradford in December 2005?

Attacking the problem:Three principles • Recognize that the estimand exists, but is not always observed directly • (Latent variable principle) • Recognize that none of the bits of data contributing to the estimand are without error or uncertainty • (Uncertainty principle) • Model the relationship between estimand and data sources • (Modeling principle)

The Standard Model for Combining Information Population of Inference Target Database (X) Link function “Ground Truth” Carefully Collected Data (Y) X Representative (?) Sample of X Estimated Model: Y=f(X) Augmented Database, with X and estimated Y’s

Case study • Determining the number of uninsured children at the county level • Problem: • Estimates not yet provided by the Fed statistical system (SCHIP expansion: 1997; first estimates: 2005) • Solution: • Develop own county-level estimates • Result: • Attempt to integrate state-level survey data (Current Population Survey) with county-level cohort-component population estimates • Combine two separate easier-to-construct quantities: Population estimates and uninsurance model

Model Age/Race/Sex/Hispanic (ARSH) origin estimates Synthetic Match by ARSH Slight slippage: CPS does not cover inst. population CPS/SIPP uninsurance status by ARSH ARSH cells Estimated Model: P[uninsured] = F(ARSH data) ARSH cells + P[uninsured|ARSH]

Case study • Evaluating program participants’ outcomes with unemployment insurance (UI) wage records versus a 13 week follow-up survey • Program completers get a follow-up survey • Performance measure = weekly wage • Performance standard: Avg. weekly wage of respondents > $xxx • Problem: Can UI records replace the survey? • What “UI weekly wage” ≈ “survey weekly wage”? • Solution: For linked records: • Regress UI weekly wage on survey weekly wage • Express performance standards on transformed scale

Model Program completers SSN Link Notable slippage: UI does not cover out of state Unemployment Insurance Wage Records Survey data Estimated Models: P[working] = F(survey data) Mean[wage] = F(survey data) UI wage “equivalent” to survey response

Case study • Improving race data on an administrative file • Social Security Administration “NUMIDENT” file • Limited race data, coding has changed over time • Base for population estimates system • Solution: Link to Current Population Survey (CPS), use CPS race in a model

Model NUMIDENT (Social Security) file Link by SSN; Correct for nonlinks • Slight slippage: • NUMIDENT • is not a register • CPS race only • covers adults CPS race data X Estimated Model: CPS race = F(NUMIDENT data) Augmented NUMIDENT, with X and estimated Y’s

Case study(NUMIDENT, round two) • Problems with first approach: • Family structure not accounted for (too many mixed race families) • CPS race only covers adults • Census 2000 race: “Mark all that apply” • Solution: Probabilistic link of NUMIDENT to Census 2000, use Census race in a model • Simple assignment for linked records • Modeled assignment for nonlinked records

Model NUMIDENT (Social Security) file Probabilistic linkage • Slight slippage: • NUMIDENT • is not a register • Linkage error not • accounted for Census 2000 Race data X Estimated Model: Cen2000 race = F(NUMIDENT data) Augmented NUMIDENT, with X and estimated Y’s

Questions to be answeredby the information integrator • Older/better vs. Newer/worse? • Data vintage is (surprisingly) important • Count vs. Model? • AR data never seem to match population of interest • Certainty vs. Uncertainty? • As yet unsolved problem • Synthetic Match vs. Statistical Matching vs. Record Linkage? • Yesterday: Technology almost exists • Today: Technology exists!

Outstanding challenges(Lacunae in current theory) • Representing uncertainty • Adapting to differential temporal/spatial/ontological reference • Sampling and design weights with linked survey/census/administrative data • Record linkage and statistical matching error • Covariance between estimation components • Spatial concentration of change

Lacuna #1 • How to represent uncertainty in the “administrative records method” • Sampling variance, model variance, and procedure variance? • Fisher/Gee (2004) general model: • Incorporate sampling variance where it is known • Incorporate model variance based on specification • Procedure variance: Bayesian estimate of uncertainty • Incorporate latent variable approach

A More Sophisticated Model (Fisher/Gee) i=ith area; j=jth indicator variable; μ=true (latent) value of estimand; =mean (latent) value of estimand for all areas; σμ=S.D. of estimand for all areas; a,b,σ=regression relationship of jth indicator to ith area.

Estimand: Log Number of Persons in Poverty POP Xj Xj  σ σ a a b b Xj h s m LNP j=CEN, TAX, NF j=CPS, ACS ith county j= FSP s a b

Thus, Rethinking the Model… Target Database (X) Link function “Ground Truth” Carefully Collected Data (Y) X Representative (?) Sample of X Estimated Model: Latent variable Z=f(X, Y) Augmented Database, with X’s, Y’s, and estimated latent Z’s

Lacuna #2 • Adapting to differential temporal/spatial/ontological reference (and adaptive temporal/spatial/ontological reference) • Information decays over time • The population of objects (people, housing units, areas) is changing (at different rates) • “Spatial and temporal slippage” (the difference between the reference dates/places of the data and the estimand of interest) • “Ontological slippage” (Barr; the difference between one representation of the object of interest and another)

Temporal Information Decay (Stuart) Level III: Global Parameters: α: file inclusion probs λ: Global migration rate (decay) Level II:Person captured in AR file? wi and yi: capture probs Level II: Individual-level migration t0i, t1i : Start and end time Level I:Individual-level observation zi: indicator of capture Magic day (Census Day) Capture in file B Capture in file A Information decay (migration)

Lacuna #3 • Sampling and design weights with linked survey/census/administrative data • We know how to analyze survey data with weights; but what about linked data? • Proposed weight (Chesher and Nesheim, 2004):

Lacuna #4 • Record linkage and statistical matching error • Known effect – biasing effect on inference • False links vs. false nonlinks tradeoff • Posterior probabilities: • P[records true link | linkage comparison] • P[records false nonlink | linkage comparison] • Provide factors correcting for linkage error • Elaborating on Chesher and Nesheim’s weight:

Lacuna #5 • Covariances between components • Functional relationships induce covariance • Soil example • Demographic analysis example • Spatial autocorrelation induces covariance • Relationships across levels of geography induce covariance between estimation components • Solutions? • Error propogation modeling (Heuvelink, 1998)? • Multilevel modeling?

Lacuna #6 Spatial concentration of change • Ian Cope, ONS: Address changes tend to be concentrated (Manchester vs. Westminster) • Today: Small area estimation methods: • (demographic) miss change • (statistical) smooth out change • Solutions?

Comments on OPUS:Points of Agreement • Bayesian belief networks appealing: • Able to account for various uncertainties • Able to fully account for model structure • Full posterior distributions valuable • Synthetic data: • “Synthetic data provides a view of the knowledge in the final state of a statistical model. It does not tell us directly about the real-world system” • ≈ Latent variable principle • Importance of the model: • “Because the model is central, its correctness and reliability are crucial” • ≈ Modeling principle

Comments on OPUS:Points of Contention/Clarification • Has statistical matching and/or microdata record linkage been fully exploited? • Microdata link = better inference • Are weights (survey and/or linkage and/or coverage) incorporated? • Remember: The population of inference ≠ data in hand • Is information decay/slippage addressed? • How to address spatial concentration of change and error?

Dean H. Judson U.S. Census Bureau Washington, DC 20233 Phone: 301-763-6037 Email: dean.h.judson@census.gov Contact Information

Comparing Methods for Data Combination