On the general flow of editing

On the general flow of editing Jeroen Pannekoek and Li-Chun Zhang Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012

Introduction • An overall data editing process involves all activities to transform raw micro-data with errors and missing values into edited statistical micro-data that are suitable for production of publication figures. GSBPM: review, validate and edit, impute, output control. • For implementation of an E&I system we need more detailed descriptions called statistical functions that each perform some action on the data. • This paper tries to identify common statistical functions that are used as building blocks in different overall E&I processes or strategies. • The decomposition of the overall process can facilitate process design, re-use of methodological components and documentation and generic software tools.

Contents • Some classifications of data editing functions that are relevant for the process design. • A summary of statistical data editing functions in some detail. • Some process flow examples, using the statistical functions as building blocks, from the Netherlands and Norway. • Concluding remarks

Classification of functions by purpose • Verification Checking of hard and soft edit rules, calculation scores, detection of systematic errors. Input: rules and data → Output: quality indicators and measures Less formal: graphical macro-editing, output control. • Selection (for further processing) Selection of units for manual editing. Selection of variables to change, error localisation. Input: quality indicators and data → Output: selection of records or fields • Amending Modifying selected data values to resolve problems detected by verification, including imputation of missing values.

Unit-mode versus batch-mode operation Since manual editing is time-consuming it should start during the sometimes lengthy data collection period. This must then also hold for any automatic editing function that is applied before manual editing. • Unit-mode functions Proceed on a record-by-record basis and can be applied during the data collection phase. • Bach-mode functions Use all of the data (or a large subset) and can only be applied near the end of the data collection phase.

Editing functions: verification (1/2) • Edit-rules (unit-mode) Systems of connected balance edits: profit=turnover-total costs. total costs = costs of employees + costs of purchases +  Non-negativity edits and inequalities. Ratio edits (soft). • Score functions • Measure the potential effect that editing a unit may have on estimates of totals or other aggregate parameters of interest. Based on measures of the deviation between observed values and predicted or “anticipated” values si =f(xj,xja). • Unit-mode: xja is based historical data or other external source. Batch-mode: xja is based on current data. • Also applied to measure and check the actual effect of (automatic) editing instead of the potential effect of editing. Then xja is the edited value.

Editing functions: verification (2/2) • Extended score functions Score functions can be extended by adding indicators for further processing based on simple criteria, other than the regular score function. For instance: >0: regular score value -9: “crucial” (dominates the totals in its branch) → manual editing -8: influential and main variables are missing → re-contact -7: non-influential and main variables missing → unit nonrespons • Macro-verification Macro-verification functions are batch-mode by definition. They include all macro-editing activities: verifying aggregates, graphical inspection of distributions, graphical or model-based outlier detection etc.

Editing functions: selection • Selection of units for manual editing using regular scores By comparing to a predetermined threshold value – unit-mode. By ordering units on scores and select the highest ranking – batch-mode • Selection of variables for amendment: error localization (unit-mode). To resolve edit-failures, some values need to be changed. The error localization problem is the selection of which variables to be changed. A generic automatic approach (Felligi-Holt): select the fewest (weighted) number of variables to change • Macro-selection (batch mode) of units for manual editing Implausibleaggregates eventually lead to suspect units (down-drilling) Graphical verification leads to selection of the most extraordinary units.

Editing functions: amendment • Amendment of systematic errors (unit-mode) Errors with a detectable cause and reliable correction mechanism. Generic: Thousand errors, recognizable typos, rounding errors. Subject-related: specific “if-then” type of correction rules. • Deductive imputation of missing values (unit-mode) Some missing values can univocally be determined by the hard edit-rules. Which gives the only possible feasible imputation. • Model based imputation (batch- or unit-mode) For most missing value we need model-based predicted values to impute. Batch-mode if current data are used to estimate parameters. • Adjustment for inconsistency (unit-mode) Adjustment of imputation to ensure consistency with edit-rules

Illustration of automatic editing Data from child day care institutions: 500 records with 68 SBS-type variables and 40 hard edit-rules.

Process flow. Scenario A: Selective editing Input micro data 1a. Systematic errors 1b. Evaluation of scores 1. Primary automated processing 2. Micro-selection Yes 2a. Selection using scores 2b. (FH-)selection of fields No 4. Automatic amendment of uncritical units 3. Clerical interactive editing 4a. Imputation of missings 4b. Adjustments 5. Macro-selection 5. Macro-verification and selection Yes No Edited micro data

Process flow. Scenario B: More automatic editing Input micro data 1. All unit-mode automatic editing 1a. Systematic errors 1b. (FH-)selection of fields 1c. Imputation 1d. Adjustments 1e. Evaluation of scores 1. Primary automated processing 2. Micro-selection Yes No 4. Automatic amendment of uncritical units 3. Clerical interactive editing 4a. Batch-mode Imputation 4b. Adjustments 5. Macro-selection 5. Macro-verification and selection Yes No Edited micro data

Process flow: Scenario C. No timeliness problems, Input micro data 1. Systematic errors 1. Primary automated processing 3. (Partial) Clerical interactive editing 2. Macro-selection 2. Macro-verification and selection. Including batch-mode scores Yes No 3. (Partial) Clerical interactive editing. 4. Automatic amendment 4a. Imputation of missings 4b. Adjustments No Edited micro data

Concluding remarks • The shown description of the overall process can be helpful in the communication between editing staff, project managers, process designers and methodologists. It clarifies the organization of the process and the choices that must be made. • It also helps to define the functionalities and interfaces of generic software components by placing them in the context of the overall process scheme. • Increasing automatic editing can greatly reduce the amount of manual editing. This may involve automatic editing of influential units and subject specific “if-then” rules.

On the general flow of editing

On the general flow of editing

Presentation Transcript

General Flow Chart of Acquisitions

Editing: The Illusion of Continuity

General Afghanistan FMS Flow

The flow of energy

What’s the purpose of editing?

The flow of information

Photoshop Editing Work Flow

General Writing - Flow

General Approach for Pipe Flow

Internal Flow: General Considerations

FAQ on video editing

The Flow of Food

Editing 101 with Pinnacle Studio -- General Editing Tips --

The seven Cs of editing

The Elements of Editing

Review of the UNECE Glossary of Terms on Statistical Data Editing

The complex world of science editing TWELFTH GENERAL ASSEMBLY AND CONFERENCE

The Craft of Editing

The Effects of Weather on Freeway Traffic Flow

Self Editing of Paragraph on symbol

The Flow of Food

Influence of Blood Flow on the Coagulation Cascade