Assessing the Impact of SDC Methods on Census Frequency Tables

74 Views

Download Presentation
## Assessing the Impact of SDC Methods on Census Frequency Tables

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Assessing the Impact of SDC Methods on Census Frequency**Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton**Topics:**• Introduction • Disclosure risk • SDC methods for protecting Census frequency tables • Disclosure risk and data utility measures • Description of table • Risk-Utility analysis • Summary of Analysis • Discussion and future work**Introduction**Identification Individual Attribute Disclosure • Disclosure risk in Census tables: • Need to protect many tables from one dataset containing population counts which can be linked and differenced • Need to consider output strategies for standard tables and web based table generating applications • Need to interact with users and develop SDC framework with a focus on both disclosure risk and data utility**Disclosure Risk**For Census tables: • 1’s and 2’s in cells are disclosive since these cells lead to identification, • 0’s may be disclosive if there are only a few non-zero cells in a row or column (attribute disclosure) Consideration of disclosure risk: • Threshold rules (minimum average cell size, ratio of small cells to zeros, etc.) • Proportion of high-risk cells (1 or 2) • Entropy (minimum of 0 if distribution has one non-zero cell and all others zero, maximum of (log K) if all cells are equal).**SDC Methods for Protecting Frequency Tables**• Pre-tabular methods (special case of PRAM) Random Record Swapping TargetedRecord Swapping In a Census context, geographical variables typically swapped to avoid edit failures and minimize biasImplementation: Randomly select p% of the households Draw a household matching on set of key variables (i.e. household size and broad sex-age distribution) and swap all geographical variables Can target records for swapping that are in high-risk cells of size 1 or 2**SDC Methods for Protecting Frequency Tables**• Rounding Unbiased random roundingEntries are rounded up or down to a multiple of the rounding base depending on pre-defined probabilities and a stochastic draw Example: For unbiased random rounding to base 3: 1 0 w.p of 2/3 1 3 w.p 1/3 2 0 w.p of 1/3 2 3 w.p 2/3 Expectation of rounding is 0 Margins and internal cells rounded separately Small cell rounding: internal cells aggregated to obtain margins**SDC Methods for Protecting Frequency Tables**• Rounding (cont.) Semi-controlled unbiased random rounding Control the selection strategy for entries to round, i.e. use a “without replacement” strategy Implementation: - Calculate the expected number of entries to round up - Draw an srswor sample from among the entries and round up, the rest round down. Can be carried out per row/column to ensure consistent totals on one dimension (key statistics) Eliminates extra variance as a result of the rounding**SDC Methods for Protecting Frequency Tables**• Rounding (cont.) Controlled rounding Feature in Tau-Argus(Salazar-González, Bycroft and Staggemeier, 2005) - Uses linear programming techniques to round entries up or down, results similar to deterministic rounding - All rounded entries add up to rounded margins - Method not unbiased and entries can jump a base**SDC Methods for Protecting Frequency Tables**3. Cell Suppression Hypercube method (Giessing, 2004) Feature in Tau-Argus and suited for large tables Uses heuristic based on suppressing corners of a hypercube formed by the primary suppressed cell with optimality conditions Imputing suppressed cells for utility evaluation: Replace suppressed cell by the average information loss in each row/column. Example: Two suppressed cells in a row and known margin is 500. The total of non-suppressed cells is 400. Each cell is replaced with a value of 50**Disclosure Risk Measures**Need to determine output strategies and SDC together • Hard-copy tables, non-flexible categories and geographies: can control SDC methods to suit the tables • Web-based tables and flexible categories and geographies: need to add noise or round for every query Disclosure risk measures: • Proportion of high-risk cells (C1 and C2) not protected • Percent true zeros out of total zeros**Utility Measures**• Distance metric - distortion to distributions (Gomatam and Karr, 2003): • Internal cells: • Let be a table for row k, the number ofrows, and the cell frequency for cell c, • Margins: • Let M be the margin, the number of categories, the number of persons in the category:**Utility Measures**• Impact on Tests for Independence:Cramer’s V measure of association: where is the Pearson chi-square statistic • Same utility measure for entropy and the Pearson chi- square statistics • Impact on log linear analysis for multi-dimensional tables, i.e. deviance**Utility Measures**• “Between” Variance: • Let be a target proportion for a cell c in row k, • and let be the overall • proportion across all rows of the table • The “between” variance is defined as: • and the utility measure is:**Utility Measures**• Variance of Cell Counts: • The variance of the cell count for row k: where is the number of columns The average variance across all rows: The utility measure is:**Description of Table**• 2001 UK Census Table: Rows: Output Areas (1,487) Columns: Economic Activity (9) * Sex (2)* Long- Term Illness (2) Table includes 317,064 persons between 16-74 in 53,532 internal cells Average cell size: 5.92 although table is skewed Number of zeros: 17,915 (33.5%) Number of small cells: 14,726 (27.5%)**Summary of Analysis**• Rounding eliminates small cells but need to protect against disclosure by differencing and linking when random rounding • Rounding adds more ambiguity into the zero counts • Random rounding to base 5 has greatest impact on distortions to distribution • Semi-controlled rounding has almost no effect on distortions to internal cells but has less distortion on marginal cells • Full controlled rounding has less distortion to internal cells since it is similar to deterministic rounding • Cell suppression with simple imputation method has highest utility (no perturbation on large cells) but difficult to implement in a Census**Summary of Analysis**• High percent of true small cells in record swapping and less ambiguity of zero cells • Record swapping has less distortion to internal cells than rounding which increases with higher swapping rates • Targeted swapping has more distortion on internal cells than random swapping but has less impact on marginal cells • Column margins of the table have no distortion because of controls in swapping • Combining record swapping with rounding results in more distortion but provides added protection**Summary of Analysis**• Record swapping across geographies attenuates: - loss of association (moving towards independence) - counts “flattening” out - proportions moving to the overall proportion • Attenuation increases with higher swapping rates Targeted record swapping has less attenuation than random swapping • Rounding introduces more zeros: - levels of association are higher - cell counts “sharper” Effects less severe for controlled rounding • Combing record swapping and rounding cancel out opposing effects depending on the direction and magnitude of each procedure separately**Discussion**• Choice of SDC method depends on tolerable risk thresholds and demands for “fit for purpose” data • Modifying and combining SDC methods (non-perturbative and perturbative methods) can produce higher utility, i.e. ABS developed microdata keys for consistency in rounding • Dissemination of quality measures and guidance for carrying out statistical analysis on protected tables • Future output strategies based on flexible table generating software. More need for research into disclosure risk by differencing and linking (collaboration with CS community) • Safe setting, remote access and license agreements for highly disclosive Census outputs (sample microdata and origin-destination tables)