- 103 Views
- Uploaded on
- Presentation posted in: General

Assessing the Impact of SDC Methods on Census Frequency Tables

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Assessing the Impact of SDC Methods on Census Frequency Tables

Natalie Shlomo

Southampton Statistical Sciences Research Institute University of Southampton

- Introduction
- Disclosure risk
- SDC methods for protecting Census frequency tables
- Disclosure risk and data utility measures
- Description of table
- Risk-Utility analysis
- Summary of Analysis
- Discussion and future work

Introduction

Identification

Individual Attribute Disclosure

- Disclosure risk in Census tables:
- Need to protect many tables from one dataset containing population counts which can be linked and differenced
- Need to consider output strategies for standard tables and web based table generating applications
- Need to interact with users and develop SDC framework with a focus on both disclosure risk and data utility

For Census tables:

- 1’s and 2’s in cells are disclosive since these cells lead to identification,
- 0’s may be disclosive if there are only a few non-zero cells in a row or column (attribute disclosure)
Consideration of disclosure risk:

- Threshold rules (minimum average cell size, ratio of small cells to zeros, etc.)
- Proportion of high-risk cells (1 or 2)
- Entropy (minimum of 0 if distribution has one non-zero cell and all others zero, maximum of (log K) if all cells are equal).

- Pre-tabular methods (special case of PRAM)
Random Record Swapping

TargetedRecord Swapping

In a Census context, geographical variables typically swapped to avoid edit failures and minimize biasImplementation:

Randomly select p% of the households

Draw a household matching on set of key variables (i.e. household size and broad sex-age distribution) and swap all geographical variables Can target records for swapping that are in high-risk cells of size 1 or 2

- Rounding
Unbiased random roundingEntries are rounded up or down to a multiple of the rounding base depending on pre-defined probabilities and a stochastic draw

Example: For unbiased random rounding to base 3:

1 0 w.p of 2/3 1 3 w.p 1/3

2 0 w.p of 1/3 2 3 w.p 2/3

Expectation of rounding is 0

Margins and internal cells rounded separately

Small cell rounding: internal

cells aggregated to obtain margins

- Rounding (cont.)
Semi-controlled unbiased random rounding

Control the selection strategy for entries to round, i.e. use a “without replacement” strategy

Implementation:

- Calculate the expected number of entries to round up

- Draw an srswor sample from among the entries and round up, the rest round down.

Can be carried out per row/column to ensure consistent totals on one dimension (key statistics)

Eliminates extra variance as a result of the rounding

- Rounding (cont.)
Controlled rounding

Feature in Tau-Argus(Salazar-González, Bycroft and Staggemeier, 2005)

- Uses linear programming techniques to round entries up or down, results similar to deterministic rounding

- All rounded entries add up to rounded margins

- Method not unbiased and entries can jump a base

3. Cell Suppression

Hypercube method (Giessing, 2004)

Feature in Tau-Argus and suited for large tables

Uses heuristic based on suppressing corners of a hypercube formed by the primary suppressed cell with optimality conditions

Imputing suppressed cells for utility evaluation:

Replace suppressed cell by the average information loss in each row/column.

Example: Two suppressed cells in a row and known margin is 500. The total of non-suppressed cells is 400. Each cell is replaced with a value of 50

Need to determine output strategies and SDC together

- Hard-copy tables, non-flexible categories and geographies: can control SDC methods to suit the tables
- Web-based tables and flexible categories and geographies: need to add noise or round for every query
Disclosure risk measures:

- Proportion of high-risk cells (C1 and C2) not protected
- Percent true zeros out of total zeros

Utility Measures

- Distance metric - distortion to distributions (Gomatam and Karr, 2003):
- Internal cells:
- Let be a table for row k, the number ofrows, and the cell frequency for cell c,
- Margins:
- Let M be the margin, the number of categories, the number of persons in the category:

- Utility Measures
- Impact on Tests for Independence:Cramer’s V measure of association: where is the Pearson chi-square statistic
- Same utility measure for entropy and the Pearson chi- square statistics
- Impact on log linear analysis for multi-dimensional tables, i.e. deviance

- Utility Measures
- “Between” Variance:
- Let be a target proportion for a cell c in row k,
- and let be the overall
- proportion across all rows of the table
- The “between” variance is defined as:
- and the utility measure is:

- Utility Measures
- Variance of Cell Counts:
- The variance of the cell count for row k:

where is the number of columns

The average variance across all rows:

The utility measure is:

- 2001 UK Census Table:
Rows: Output Areas (1,487)

Columns: Economic Activity (9) * Sex (2)* Long- Term Illness (2)

Table includes 317,064 persons between 16-74 in 53,532 internal cells

Average cell size: 5.92 although table is skewed

Number of zeros: 17,915 (33.5%)

Number of small cells: 14,726 (27.5%)

- Rounding eliminates small cells but need to protect against disclosure by differencing and linking when random rounding
- Rounding adds more ambiguity into the zero counts
- Random rounding to base 5 has greatest impact on distortions to distribution
- Semi-controlled rounding has almost no effect on distortions to internal cells but has less distortion on marginal cells
- Full controlled rounding has less distortion to internal cells since it is similar to deterministic rounding
- Cell suppression with simple imputation method has highest utility (no perturbation on large cells) but difficult to implement in a Census

- High percent of true small cells in record swapping and less ambiguity of zero cells
- Record swapping has less distortion to internal cells than rounding which increases with higher swapping rates
- Targeted swapping has more distortion on internal cells than random swapping but has less impact on marginal cells
- Column margins of the table have no distortion because of controls in swapping
- Combining record swapping with rounding results in more distortion but provides added protection

- Record swapping across geographies attenuates:
- loss of association (moving towards independence) - counts “flattening” out

- proportions moving to the overall proportion

- Attenuation increases with higher swapping rates Targeted record swapping has less attenuation than random swapping
- Rounding introduces more zeros:
- levels of association are higher

- cell counts “sharper”

Effects less severe for controlled rounding

- Combing record swapping and rounding cancel out opposing effects depending on the direction and magnitude of each procedure separately

- Choice of SDC method depends on tolerable risk thresholds and demands for “fit for purpose” data
- Modifying and combining SDC methods (non-perturbative and perturbative methods) can produce higher utility, i.e. ABS developed microdata keys for consistency in rounding
- Dissemination of quality measures and guidance for carrying out statistical analysis on protected tables
- Future output strategies based on flexible table generating software. More need for research into disclosure risk by differencing and linking (collaboration with CS community)
- Safe setting, remote access and license agreements for highly disclosive Census outputs (sample microdata and origin-destination tables)