1 / 27

Data Editing

Data Editing. United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile. Editing and Imputation Defined. Data editing: Identification and flagging of missing , invalid , inconsistent or anomalous entries Imputation: Resolves problems identified in editing. 2.

gunda
Download Presentation

Data Editing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile

  2. Editing and Imputation Defined • Data editing: Identification and flagging of missing, invalid, inconsistent or anomalous entries • Imputation: Resolves problems identified in editing 2

  3. Editing and Imputation Process Flow 1. 3. 2. 3

  4. A General Editing and Imputation Process • Identify and treat initial errors • At the data capture stage • At the data entry stage • Ex: Data entered into a table is shifted by a row • Identify and treat errors a: Interactively/Manually treat influential errors b: Automatically treat non-influential errors • Check the aggregated output 4

  5. Editing and Imputation Process Flow 1. 3. 2. 5

  6. Editing Errors • Two categories of errors • Systematic – reported consistently by some of the respondents • Ex: Gross values are reported instead of net values • Ex: Units are reported in thousands • Random – non-systematic or caused by accident • Ex: An extra digit is accidentally typed in the response • Manifestations of errors can be systematic or random • Missing • Ex: A variable is left blank because the respondent does not know the answer to the question, does not want to answer the question or does not understand the question • Outliers – values that deviate from a model • Ex: Unanticipated large values as compared to historic trend • Violation of logical or consistency rules • Ex: A total value is larger than the sum of its components • Edit rules are used to detect errors and often define how they should be treated

  7. Systematic Errors Errors that are reported consistently over time. Unit error Ex: xt-1 / xt <= 300 Sign error Bugs in the collection vehicle Misunderstanding a question or skip rules Ex: systematic missing values Detection High failure rates of edits Outlier detection (e.g. for unit errors) Knowledge of the survey and the raw data processing 7

  8. Systematic errors (2) Suggestions Improvements in the survey or processing procedures should be made When systematic errors are identified, they should be turned into edit rules Detecting and correcting is cost effective Should be treated before random errors 8

  9. Missing Values Stem from questions a respondent did not answer Detection is usually simple Suggestions Do not ignore missing values (→ bias and loss of estimate precision) Missing values may not be missing at random Do not replace with zeros (→ inaccurate results) Nonresponse indicators should be compiled and analyzed because missing values may be systematic 9

  10. Outliers Observations that do not fit well to a model Ex: Median-k*IQR < value < Median+k*IQR Ex: Month-on-month change <= 50% May be defined by one variable (univariate) or a set of variables (multivariate) Two types Representative: correct with similar units in population Non-representative: either incorrect or correct but unique Ex: correct – isolated labor strike at a plant 10

  11. Outliers (2) • Detection • Univariate • Multivariate • Periodic data (e.g. Hidiroglou-Berthelot) • Regression models or tree-models 11

  12. Edit Rules • Edit rules are used to determine whether a value is consistent or may be erroneous • Surveys are often created to allow these rules • Edit rules flag data in two ways • Fatal edit – indicates a value that is (almost) certainly in error • Query edit – indicates values that may be in error

  13. Types of Edit Rules • Validation edits – often in the form of if-then statements • Ex: if total hours worked > 0 then employees > 0 • Ex: if Σproduction quantity > 0 then Σproduction value > 0 • Ex: if revenue from manufacturing plant> 0 then • hours worked by machinery technicians > 0 • plant capacity utilization > 0 • Σproduction volume > 0 • Σproduction value > 0 • Balance edits – detail items must add to total • Ex: total employee remuneration = wages + salaries + employer contributions to social security + welfare benefits + profits distributed to workers

  14. Types of Edit Rules (2) • Ratio edits – the ratio of two data items is bounded by lower and upper bounds. The pairs should be correlated. • Ex: total hours/employee/day is between 6 and 10 (very correlated) • Ex: plant capacity utilization <= 20% change from prvs month • Ex: wages (W) should change within 10% of the change in total employment (E)(Et/Et-1 - 1) - .1 <= Wt/Wt-1 -1 <= (Et/Et-1 - 1) + .1 • Ex: Σproduct value / Σ product quantity <= 10% change from previous month

  15. Types of Edit Rules (3) • Hidiroglou-Berthelot is a particular type of ratio edit • Ex : Employee month-on-month change<=100 employees: <= 50% change from prvs month 100< emp < =200: <= 20% change from prvs month >200 emp: <= 10% change from prvs month

  16. Editing & Imputation Process • Interactive/Manual – a record with flagged data is manually reviewed, preferably by a subject matter expert • Automatic – a record with flagged data is automatically reviewed and corrected by a computer • Selective – designed to route edits/imputations into interactive or automatic streams • based on influential vs. non-influential errors • Marcroediting

  17. Editing and Imputation Process Flow 1. 3. 2. 17

  18. Selective Editing • Distinguishes between errors in values that have a significant influence on survey estimate and those that are insignificant to the estimate • Selective editing splits raw data into two streams: • critical stream: records that most likely contain influential errors and large companies • non-critical stream: records that are unlikely to contain influential errors • A score function determines which responses go into which stream 18

  19. Selective Editing (2) Local score function = influence * risk For example: Influence = Risk = Raw value Anticipated value Sampling weight 19

  20. Selective Editing (3) • Local score functions are aggregated into global score functions for each record • First local scores are scaled, e.g. dividing observed values by mean values • Scaled local scores are combined into a global score. For example: Minkowski metric (a common approach) • The influence of large local scores increases with αα = 1 : simple sum of local scoresα = 2 : Euclidean metricα -> ∞ : max local score

  21. Selective Editing (4) • GS cut-off threshold must be determined • All records above the cut-off are selected for interactive editing • A simulation can be performed on previous data to determine a threshold • Raw unedited values and corresponding edited values are used • The first p% of records are edited and the resultant estimate is compared with the fully edited estimate • Trial and error will lead to estimates that are the same and a corresponding cut-off value • Alternatively, a threshold doesn’t need to be used • Records can be edited in priority order until time or budget constraints tell one to stop

  22. Selective Editing (5) • A score function can be augmented in many ways • E.g. Size criteria where large enterprises are always selected for critical stream (influence irrespective of risk) • Selective editing improves efficiency 22

  23. Macro-Editing • Macro-editing techniques account for the distribution of variables and for the plausibility of estimates • Two forms of macro-editing • Aggregation method • Distribution method 23

  24. Macro-Editing - Aggregation • Verification whether figures to be published seem plausible • Compare estimates with • Previous estimate values • Values from other related sources • Related estimates (such as electricity production and consumption)

  25. Macro-Editing - Distribution • Available data used to characterize distribution of variables • Individual values are compared with this distribution • Records that contain values that are uncommon may require further inspection and possibly for editing

  26. Macro-Editing Example: Graphical Editing • Univariate plot • Bivariate scatter plot 26

  27. Editing and Imputation Process Flow 1. 3. 2. 27

More Related