Assessing the Impact  of SDC Methods on Census Frequency Tables
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Assessing the Impact of SDC Methods on Census Frequency Tables PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on
  • Presentation posted in: General

Assessing the Impact of SDC Methods on Census Frequency Tables. Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton . Topics:. Introduction Disclosure risk SDC methods for protecting Census frequency tables

Download Presentation

Assessing the Impact of SDC Methods on Census Frequency Tables

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Assessing the Impact of SDC Methods on Census Frequency Tables

Natalie Shlomo

Southampton Statistical Sciences Research Institute University of Southampton


Topics:

  • Introduction

  • Disclosure risk

  • SDC methods for protecting Census frequency tables

  • Disclosure risk and data utility measures

  • Description of table

  • Risk-Utility analysis

  • Summary of Analysis

  • Discussion and future work


Introduction

Identification

Individual Attribute Disclosure

  • Disclosure risk in Census tables:

  • Need to protect many tables from one dataset containing population counts which can be linked and differenced

  • Need to consider output strategies for standard tables and web based table generating applications

  • Need to interact with users and develop SDC framework with a focus on both disclosure risk and data utility


Disclosure Risk

For Census tables:

  • 1’s and 2’s in cells are disclosive since these cells lead to identification,

  • 0’s may be disclosive if there are only a few non-zero cells in a row or column (attribute disclosure)

    Consideration of disclosure risk:

  • Threshold rules (minimum average cell size, ratio of small cells to zeros, etc.)

  • Proportion of high-risk cells (1 or 2)

  • Entropy (minimum of 0 if distribution has one non-zero cell and all others zero, maximum of (log K) if all cells are equal).


SDC Methods for Protecting Frequency Tables

  • Pre-tabular methods (special case of PRAM)

    Random Record Swapping

    TargetedRecord Swapping

    In a Census context, geographical variables typically swapped to avoid edit failures and minimize biasImplementation:

    Randomly select p% of the households

    Draw a household matching on set of key variables (i.e. household size and broad sex-age distribution) and swap all geographical variables Can target records for swapping that are in high-risk cells of size 1 or 2


SDC Methods for Protecting Frequency Tables

  • Rounding

    Unbiased random roundingEntries are rounded up or down to a multiple of the rounding base depending on pre-defined probabilities and a stochastic draw

    Example: For unbiased random rounding to base 3:

    1 0 w.p of 2/3 1 3 w.p 1/3

    2 0 w.p of 1/3 2 3 w.p 2/3

    Expectation of rounding is 0

    Margins and internal cells rounded separately

    Small cell rounding: internal

    cells aggregated to obtain margins


SDC Methods for Protecting Frequency Tables

  • Rounding (cont.)

    Semi-controlled unbiased random rounding

    Control the selection strategy for entries to round, i.e. use a “without replacement” strategy

    Implementation:

    - Calculate the expected number of entries to round up

    - Draw an srswor sample from among the entries and round up, the rest round down.

    Can be carried out per row/column to ensure consistent totals on one dimension (key statistics)

    Eliminates extra variance as a result of the rounding


SDC Methods for Protecting Frequency Tables

  • Rounding (cont.)

    Controlled rounding

    Feature in Tau-Argus(Salazar-González, Bycroft and Staggemeier, 2005)

    - Uses linear programming techniques to round entries up or down, results similar to deterministic rounding

    - All rounded entries add up to rounded margins

    - Method not unbiased and entries can jump a base


SDC Methods for Protecting Frequency Tables

3. Cell Suppression

Hypercube method (Giessing, 2004)

Feature in Tau-Argus and suited for large tables

Uses heuristic based on suppressing corners of a hypercube formed by the primary suppressed cell with optimality conditions

Imputing suppressed cells for utility evaluation:

Replace suppressed cell by the average information loss in each row/column.

Example: Two suppressed cells in a row and known margin is 500. The total of non-suppressed cells is 400. Each cell is replaced with a value of 50


Disclosure Risk Measures

Need to determine output strategies and SDC together

  • Hard-copy tables, non-flexible categories and geographies: can control SDC methods to suit the tables

  • Web-based tables and flexible categories and geographies: need to add noise or round for every query

    Disclosure risk measures:

  • Proportion of high-risk cells (C1 and C2) not protected

  • Percent true zeros out of total zeros


Utility Measures

  • Distance metric - distortion to distributions (Gomatam and Karr, 2003):

  • Internal cells:

  • Let be a table for row k, the number ofrows, and the cell frequency for cell c,

  • Margins:

  • Let M be the margin, the number of categories, the number of persons in the category:


  • Utility Measures

  • Impact on Tests for Independence:Cramer’s V measure of association: where is the Pearson chi-square statistic

  • Same utility measure for entropy and the Pearson chi- square statistics

  • Impact on log linear analysis for multi-dimensional tables, i.e. deviance


  • Utility Measures

  • “Between” Variance:

  • Let be a target proportion for a cell c in row k,

  • and let be the overall

  • proportion across all rows of the table

  • The “between” variance is defined as:

  • and the utility measure is:


  • Utility Measures

  • Variance of Cell Counts:

  • The variance of the cell count for row k:

where is the number of columns

The average variance across all rows:

The utility measure is:


Description of Table

  • 2001 UK Census Table:

    Rows: Output Areas (1,487)

    Columns: Economic Activity (9) * Sex (2)* Long- Term Illness (2)

    Table includes 317,064 persons between 16-74 in 53,532 internal cells

    Average cell size: 5.92 although table is skewed

    Number of zeros: 17,915 (33.5%)

    Number of small cells: 14,726 (27.5%)


Summary of Analysis

  • Rounding eliminates small cells but need to protect against disclosure by differencing and linking when random rounding

  • Rounding adds more ambiguity into the zero counts

  • Random rounding to base 5 has greatest impact on distortions to distribution

  • Semi-controlled rounding has almost no effect on distortions to internal cells but has less distortion on marginal cells

  • Full controlled rounding has less distortion to internal cells since it is similar to deterministic rounding

  • Cell suppression with simple imputation method has highest utility (no perturbation on large cells) but difficult to implement in a Census


Summary of Analysis

  • High percent of true small cells in record swapping and less ambiguity of zero cells

  • Record swapping has less distortion to internal cells than rounding which increases with higher swapping rates

  • Targeted swapping has more distortion on internal cells than random swapping but has less impact on marginal cells

  • Column margins of the table have no distortion because of controls in swapping

  • Combining record swapping with rounding results in more distortion but provides added protection


Summary of Analysis

  • Record swapping across geographies attenuates:

    - loss of association (moving towards independence) - counts “flattening” out

    - proportions moving to the overall proportion

  • Attenuation increases with higher swapping rates Targeted record swapping has less attenuation than random swapping

  • Rounding introduces more zeros:

    - levels of association are higher

    - cell counts “sharper”

    Effects less severe for controlled rounding

  • Combing record swapping and rounding cancel out opposing effects depending on the direction and magnitude of each procedure separately


Discussion

  • Choice of SDC method depends on tolerable risk thresholds and demands for “fit for purpose” data

  • Modifying and combining SDC methods (non-perturbative and perturbative methods) can produce higher utility, i.e. ABS developed microdata keys for consistency in rounding

  • Dissemination of quality measures and guidance for carrying out statistical analysis on protected tables

  • Future output strategies based on flexible table generating software. More need for research into disclosure risk by differencing and linking (collaboration with CS community)

  • Safe setting, remote access and license agreements for highly disclosive Census outputs (sample microdata and origin-destination tables)


Natalie [email protected]


  • Login