Q2010 Special session 34 Data quality and inference under register information

Q2010 Special session 34Data quality and inference under register information Discussion by Carl-Erik Särndal

I thank the authors for four interesting papers, quite different , on one broad theme : Administrative register data They are also similar in this respect : Such data are collected for other purposes, hence “not quite good enough”, “secondary data”, “not exactly relevant”, but available, and cheap. Their presence cannot be ignored; what use can be made of them ?

Reijo Sund : A framework for evaluating the quality of administrative data for research purposes

Administrative data : Their value for research (e.g., a question in behavioural science) – not guaranteed: they were produced for some other purpose. The possibilities nevertheless offered by such data lie in information communication: the actual information (for the specific research objective) must be decoded from the data (“using an infological equation”). The paper offers framework to facilitate the process of “transforming raw register data into useful information”. How is the researcher (say a sociologist) assisted (in a more concrete manner) by the framework ?

Daas, Ossen and Tennekes : Determination of administrative data quality : recent results and new developments

Administrative data : Their utility for the NSI in producing (national) statistics; again, the problem is: They were produced for some other purpose. “It is essential that the NSI can determine the statistical usability (the quality) of the data source prior to use.” The proposals in the paper (work recently started (?) at CBS) should eventually result in “a new comprehensive quality-indicator instrument for admin. registers.” Method : evaluation of a number of components - source hyperdimension, metadata hyperdimension - objective: “combined into a single instrument” - “a new comprehensive quality indicator instrument”.

Thomas Laitila and Anders Holmberg : Comparison of sample and register survey estimators via MSE decomposition

Administrative data : Should they be used - because available and cheap - for a given objective of statistics production, or should we (the statistical agency) collect new (possibly “better”) data ? A choice situation : either use the register data, or incur the cost of collecting new data (by traditional sample survey), or perhaps do both. The issue of relevance of the register data for the given objective : the concept of relevance bias affects the MSE; we must compare with the MSE of the sample survey alternative (free of relevance error, but affected by sampling error) . But not only that : the other, “usual errors” (measurement error, non-response etc.) intervene in the MSE comparison ; assumptions about their magnitude enter into the comparison. Conceptual differences may exist in the interpretation of “variance”, “bias”.

Li-Chun Zhang : A statistical approach to surrogate data

Administrative data : Could such easy-to-come-by data (“surrogate data”) replace good data (“target data” - i.e., directly collected variables, for a smallish sample) - in such a way that valid inferences could be obtained by an analysis of the surrogate data base ? When is the surrogate data set “empirically equivalent to the target data set ? “ How do we generate “valid surrogate data … able to yield unbiased inference just like the target data” – so as to make possible statistical production from the surrogate data base (much larger than the target data set) - an issue in the comparability of the empirical distributions of data sets.

To what extent are the proposed ideas (the suggested methods) already implemented (as “tools”), first of all in the statistical agencies in question: Finland, Norway, Sweden, The Netherlands. My impression : still at “the ideas stage” rather than at “the methods stage” (but I may be wrong). In a longer perspective, how likely is their implementation more generally, outside the country of origin ? How do I rank the four papers in regard to “implementability” ?

Q2010 Special session 34 Data quality and inference under register information