Loading in 5 sec....

Assessing the Impact of SDC Methods on Census Frequency TablesPowerPoint Presentation

Assessing the Impact of SDC Methods on Census Frequency Tables

Download Presentation

Assessing the Impact of SDC Methods on Census Frequency Tables

Loading in 2 Seconds...

- 110 Views
- Uploaded on
- Presentation posted in: General

Assessing the Impact of SDC Methods on Census Frequency Tables

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Assessing the Impact of SDC Methods on Census Frequency Tables

Natalie Shlomo

Southampton Statistical Sciences Research Institute University of Southampton

- Introduction
- Disclosure risk
- SDC methods for protecting Census frequency tables
- Disclosure risk and data utility measures
- Description of table
- Risk-Utility analysis
- Summary of Analysis
- Discussion and future work

Introduction

Identification

Individual Attribute Disclosure

- Disclosure risk in Census tables:
- Need to protect many tables from one dataset containing population counts which can be linked and differenced
- Need to consider output strategies for standard tables and web based table generating applications
- Need to interact with users and develop SDC framework with a focus on both disclosure risk and data utility

For Census tables:

- 1’s and 2’s in cells are disclosive since these cells lead to identification,
- 0’s may be disclosive if there are only a few non-zero cells in a row or column (attribute disclosure)
Consideration of disclosure risk:

- Threshold rules (minimum average cell size, ratio of small cells to zeros, etc.)
- Proportion of high-risk cells (1 or 2)
- Entropy (minimum of 0 if distribution has one non-zero cell and all others zero, maximum of (log K) if all cells are equal).

- Pre-tabular methods (special case of PRAM)
Random Record Swapping

TargetedRecord Swapping

In a Census context, geographical variables typically swapped to avoid edit failures and minimize biasImplementation:

Randomly select p% of the households

Draw a household matching on set of key variables (i.e. household size and broad sex-age distribution) and swap all geographical variables Can target records for swapping that are in high-risk cells of size 1 or 2

- Rounding
Unbiased random roundingEntries are rounded up or down to a multiple of the rounding base depending on pre-defined probabilities and a stochastic draw

Example: For unbiased random rounding to base 3:

1 0 w.p of 2/3 1 3 w.p 1/3

2 0 w.p of 1/3 2 3 w.p 2/3

Expectation of rounding is 0

Margins and internal cells rounded separately

Small cell rounding: internal

cells aggregated to obtain margins

- Rounding (cont.)
Semi-controlled unbiased random rounding

Control the selection strategy for entries to round, i.e. use a “without replacement” strategy

Implementation:

- Calculate the expected number of entries to round up

- Draw an srswor sample from among the entries and round up, the rest round down.

Can be carried out per row/column to ensure consistent totals on one dimension (key statistics)

Eliminates extra variance as a result of the rounding

- Rounding (cont.)
Controlled rounding

Feature in Tau-Argus(Salazar-González, Bycroft and Staggemeier, 2005)

- Uses linear programming techniques to round entries up or down, results similar to deterministic rounding

- All rounded entries add up to rounded margins

- Method not unbiased and entries can jump a base

3. Cell Suppression

Hypercube method (Giessing, 2004)

Feature in Tau-Argus and suited for large tables

Uses heuristic based on suppressing corners of a hypercube formed by the primary suppressed cell with optimality conditions

Imputing suppressed cells for utility evaluation:

Replace suppressed cell by the average information loss in each row/column.

Example: Two suppressed cells in a row and known margin is 500. The total of non-suppressed cells is 400. Each cell is replaced with a value of 50

Need to determine output strategies and SDC together

- Hard-copy tables, non-flexible categories and geographies: can control SDC methods to suit the tables
- Web-based tables and flexible categories and geographies: need to add noise or round for every query
Disclosure risk measures:

- Proportion of high-risk cells (C1 and C2) not protected
- Percent true zeros out of total zeros

Utility Measures

- Distance metric - distortion to distributions (Gomatam and Karr, 2003):
- Internal cells:
- Let be a table for row k, the number ofrows, and the cell frequency for cell c,
- Margins:
- Let M be the margin, the number of categories, the number of persons in the category:

- Utility Measures
- Impact on Tests for Independence:Cramer’s V measure of association: where is the Pearson chi-square statistic
- Same utility measure for entropy and the Pearson chi- square statistics
- Impact on log linear analysis for multi-dimensional tables, i.e. deviance

- Utility Measures
- “Between” Variance:
- Let be a target proportion for a cell c in row k,
- and let be the overall
- proportion across all rows of the table
- The “between” variance is defined as:
- and the utility measure is:

- Utility Measures
- Variance of Cell Counts:
- The variance of the cell count for row k:

where is the number of columns

The average variance across all rows:

The utility measure is:

- 2001 UK Census Table:
Rows: Output Areas (1,487)

Columns: Economic Activity (9) * Sex (2)* Long- Term Illness (2)

Table includes 317,064 persons between 16-74 in 53,532 internal cells

Average cell size: 5.92 although table is skewed

Number of zeros: 17,915 (33.5%)

Number of small cells: 14,726 (27.5%)

- Rounding eliminates small cells but need to protect against disclosure by differencing and linking when random rounding
- Rounding adds more ambiguity into the zero counts
- Random rounding to base 5 has greatest impact on distortions to distribution
- Semi-controlled rounding has almost no effect on distortions to internal cells but has less distortion on marginal cells
- Full controlled rounding has less distortion to internal cells since it is similar to deterministic rounding
- Cell suppression with simple imputation method has highest utility (no perturbation on large cells) but difficult to implement in a Census

- High percent of true small cells in record swapping and less ambiguity of zero cells
- Record swapping has less distortion to internal cells than rounding which increases with higher swapping rates
- Targeted swapping has more distortion on internal cells than random swapping but has less impact on marginal cells
- Column margins of the table have no distortion because of controls in swapping
- Combining record swapping with rounding results in more distortion but provides added protection

- Record swapping across geographies attenuates:
- loss of association (moving towards independence) - counts “flattening” out

- proportions moving to the overall proportion

- Attenuation increases with higher swapping rates Targeted record swapping has less attenuation than random swapping
- Rounding introduces more zeros:
- levels of association are higher

- cell counts “sharper”

Effects less severe for controlled rounding

- Combing record swapping and rounding cancel out opposing effects depending on the direction and magnitude of each procedure separately

- Choice of SDC method depends on tolerable risk thresholds and demands for “fit for purpose” data
- Modifying and combining SDC methods (non-perturbative and perturbative methods) can produce higher utility, i.e. ABS developed microdata keys for consistency in rounding
- Dissemination of quality measures and guidance for carrying out statistical analysis on protected tables
- Future output strategies based on flexible table generating software. More need for research into disclosure risk by differencing and linking (collaboration with CS community)
- Safe setting, remote access and license agreements for highly disclosive Census outputs (sample microdata and origin-destination tables)