1 / 19

Automatic Editing with Soft Edits

Automatic Editing with Soft Edits. Sander Scholtus (Statistics Netherlands). Automatic editing. Goal: Detect and correct errors and missing values without human intervention Data is made consistent with respect to a set of edits Two steps:

orsen
Download Presentation

Automatic Editing with Soft Edits

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Editingwith Soft Edits Sander Scholtus (Statistics Netherlands)

  2. Automatic editing • Goal: Detect and correct errors and missing values without human intervention • Data is made consistent with respect to a set of edits • Two steps: • detecting erroneous and missing values (error localisation) • imputation of new values Automatic Editing with Soft Edits

  3. Automatic editing (2) • Fellegi-Holt paradigm for error localisation: Find the smallest subset of the variables that can be imputed to satisfy all edits • Generalised version uses confidence weights • At Statistics Netherlands: SLICE software Automatic Editing with Soft Edits

  4. SLICE • Branch-and-bound algorithm: x1 x1 erroneous x1 correct x2 x2 x2 erroneous x2 correct x2 erroneous x2 correct x3 x3 x3 x3 Automatic Editing with Soft Edits

  5. SLICE • Branch-and-bound algorithm: x1 eliminate x1 fix x1 x2 x2 eliminate x2 fix x2 eliminate x2 fix x2 x3 x3 x3 x3 Automatic Editing with Soft Edits

  6. SLICE (2) • Leaf nodes of the tree: • all variables have been either fixed or eliminated • interpretation: eliminated variables are incorrect • Associated sets of edits: • contain no variables • either empty or contain only trivial statements • Theorem (De Waal and Quere, 2003): A leaf node corresponds to a feasible solution of the error localisation problem, if and only if the associated set of edits contains no contradictions Automatic Editing with Soft Edits

  7. SLICE (3) • Application of SLICE in the production process: • automatic editing of micro data for the Dutch structural business statistics • approximately 100 variables and 100 edits • evaluation studies: sometimes large differences between automatic and manual editing Automatic Editing with Soft Edits

  8. Hard edits and soft edits • Examples of edits: • Profit = Turnover – Costs • Profit < 0.6 x Turnover • First example: • hard edit • has to hold by definition • Second example: • soft edit • can also be failed by correct values Automatic Editing with Soft Edits

  9. Hard edits and soft edits (2) • Manual editing uses both hard and soft edits • Current methods for automatic editing can only handle hard edits • Practical solutions: • ignore all soft edits • treat soft edits as hard edits • Can this be improved? Automatic Editing with Soft Edits

  10. Error localisation with soft edits • Current error localisation problem: Minimise, among subsets of variables that can be imputed to satisfy all edits, the sum of the confidence weights • Suggested new error localisation problem: Minimise, among subsets of variables that can be imputed to satisfy all hard edits, the sum of the confidence weights plus a cost term for failed soft edits Automatic Editing with Soft Edits

  11. Error localisation with soft edits (2) • The new error localisation problem can be solved by an extended version of the SLICE algorithm x1 eliminate x1 fix x1 x2 x2 eliminate x2 fix x2 eliminate x2 fix x2 x3 x3 x3 x3 Automatic Editing with Soft Edits

  12. Example • Variables: Turnover (T), Profit (P), Costs (C), Number of Employees (N) • Edits: Hard edits: Soft edits: • Confidence weights: Turnover: 2; Profit: 1; Costs: 1; Number of Employees: 3 • Contribution of each failed soft edit: 2 Automatic Editing with Soft Edits

  13. Example (2) • Original data and edits: T = 100; P = 40000; C = 60000; N = 5 Hard edits: Soft edits: Automatic Editing with Soft Edits

  14. Example (3) • Original data and edits: T = 100; P = 40000; C = 60000; N = 5 Hard edits: Soft edits: • Eliminate P from the original edits: Implied hard edits: Implied soft edits: Automatic Editing with Soft Edits

  15. Example (4) • According to the theory, P can be imputed to satisfy all hard edits, but the second soft edit is failed • Imputing only P is a feasible solution to the error localisation problem • The value of the target function equals 1 + 2 = 3 Automatic Editing with Soft Edits

  16. Example (5) • Data and edits after eliminating P: T = 100; C = 60000; N = 5 Implied hard edits: Implied soft edits: • Eliminate C from these edits: Implied hard edits: Implied soft edits: Automatic Editing with Soft Edits

  17. Example (6) • According to the theory, P and C can be imputed to satisfy all hard and soft edits • Imputing P and C is a feasible solution to the error localisation problem • The value of the target function equals 1 + 1 = 2 • This turns out to be the optimal solution • Possible corrected version of the record: T = 100; P = 40; C = 60; N = 5 Automatic Editing with Soft Edits

  18. Example (7) • Imputing only P is the optimal solution if the soft edits are ignored • Corrected version of the record: T = 100; P = -59900; C = 60000; N = 5 Automatic Editing with Soft Edits

  19. Discussion • Future work: • Implementation of the algorithm in R (in progress) • Test on realistic data (Dutch structural business statistics) • How to model the costs of failed soft edits Thank you for your attention! Automatic Editing with Soft Edits

More Related