1 / 25

# Assessing the Impact of SDC Methods on Census Frequency Tables - PowerPoint PPT Presentation

Assessing the Impact of SDC Methods on Census Frequency Tables. Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton . Topics:. Introduction Disclosure risk SDC methods for protecting Census frequency tables

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Assessing the Impact of SDC Methods on Census Frequency Tables' - aleda

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Natalie Shlomo

Southampton Statistical Sciences Research Institute University of Southampton

Topics: Tables

• Introduction

• Disclosure risk

• SDC methods for protecting Census frequency tables

• Disclosure risk and data utility measures

• Description of table

• Risk-Utility analysis

• Summary of Analysis

• Discussion and future work

Introduction Tables

Identification

Individual Attribute Disclosure

• Disclosure risk in Census tables:

• Need to protect many tables from one dataset containing population counts which can be linked and differenced

• Need to consider output strategies for standard tables and web based table generating applications

• Need to interact with users and develop SDC framework with a focus on both disclosure risk and data utility

Disclosure Risk Tables

For Census tables:

• 1’s and 2’s in cells are disclosive since these cells lead to identification,

• 0’s may be disclosive if there are only a few non-zero cells in a row or column (attribute disclosure)

Consideration of disclosure risk:

• Threshold rules (minimum average cell size, ratio of small cells to zeros, etc.)

• Proportion of high-risk cells (1 or 2)

• Entropy (minimum of 0 if distribution has one non-zero cell and all others zero, maximum of (log K) if all cells are equal).

• Pre-tabular methods (special case of PRAM)

Random Record Swapping

TargetedRecord Swapping

In a Census context, geographical variables typically swapped to avoid edit failures and minimize biasImplementation:

Randomly select p% of the households

Draw a household matching on set of key variables (i.e. household size and broad sex-age distribution) and swap all geographical variables Can target records for swapping that are in high-risk cells of size 1 or 2

• Rounding

Unbiased random roundingEntries are rounded up or down to a multiple of the rounding base depending on pre-defined probabilities and a stochastic draw

Example: For unbiased random rounding to base 3:

1 0 w.p of 2/3 1 3 w.p 1/3

2 0 w.p of 1/3 2 3 w.p 2/3

Expectation of rounding is 0

Margins and internal cells rounded separately

Small cell rounding: internal

cells aggregated to obtain margins

• Rounding (cont.)

Semi-controlled unbiased random rounding

Control the selection strategy for entries to round, i.e. use a “without replacement” strategy

Implementation:

- Calculate the expected number of entries to round up

- Draw an srswor sample from among the entries and round up, the rest round down.

Can be carried out per row/column to ensure consistent totals on one dimension (key statistics)

Eliminates extra variance as a result of the rounding

• Rounding (cont.)

Controlled rounding

Feature in Tau-Argus(Salazar-González, Bycroft and Staggemeier, 2005)

- Uses linear programming techniques to round entries up or down, results similar to deterministic rounding

- All rounded entries add up to rounded margins

- Method not unbiased and entries can jump a base

3. Cell Suppression

Hypercube method (Giessing, 2004)

Feature in Tau-Argus and suited for large tables

Uses heuristic based on suppressing corners of a hypercube formed by the primary suppressed cell with optimality conditions

Imputing suppressed cells for utility evaluation:

Replace suppressed cell by the average information loss in each row/column.

Example: Two suppressed cells in a row and known margin is 500. The total of non-suppressed cells is 400. Each cell is replaced with a value of 50

Disclosure Risk Measures Tables

Need to determine output strategies and SDC together

• Hard-copy tables, non-flexible categories and geographies: can control SDC methods to suit the tables

• Web-based tables and flexible categories and geographies: need to add noise or round for every query

Disclosure risk measures:

• Proportion of high-risk cells (C1 and C2) not protected

• Percent true zeros out of total zeros

Utility Measures Tables

• Distance metric - distortion to distributions (Gomatam and Karr, 2003):

• Internal cells:

• Let be a table for row k, the number ofrows, and the cell frequency for cell c,

• Margins:

• Let M be the margin, the number of categories, the number of persons in the category:

• Utility Measures Tables

• Impact on Tests for Independence:Cramer’s V measure of association: where is the Pearson chi-square statistic

• Same utility measure for entropy and the Pearson chi- square statistics

• Impact on log linear analysis for multi-dimensional tables, i.e. deviance

• Utility Measures Tables

• “Between” Variance:

• Let be a target proportion for a cell c in row k,

• and let be the overall

• proportion across all rows of the table

• The “between” variance is defined as:

• and the utility measure is:

• Utility Measures Tables

• Variance of Cell Counts:

• The variance of the cell count for row k:

where is the number of columns

The average variance across all rows:

The utility measure is:

Description of Table Tables

• 2001 UK Census Table:

Rows: Output Areas (1,487)

Columns: Economic Activity (9) * Sex (2)* Long- Term Illness (2)

Table includes 317,064 persons between 16-74 in 53,532 internal cells

Average cell size: 5.92 although table is skewed

Number of zeros: 17,915 (33.5%)

Number of small cells: 14,726 (27.5%)

Summary of Analysis Tables

• Rounding eliminates small cells but need to protect against disclosure by differencing and linking when random rounding

• Rounding adds more ambiguity into the zero counts

• Random rounding to base 5 has greatest impact on distortions to distribution

• Semi-controlled rounding has almost no effect on distortions to internal cells but has less distortion on marginal cells

• Full controlled rounding has less distortion to internal cells since it is similar to deterministic rounding

• Cell suppression with simple imputation method has highest utility (no perturbation on large cells) but difficult to implement in a Census

Summary of Analysis Tables

• High percent of true small cells in record swapping and less ambiguity of zero cells

• Record swapping has less distortion to internal cells than rounding which increases with higher swapping rates

• Targeted swapping has more distortion on internal cells than random swapping but has less impact on marginal cells

• Column margins of the table have no distortion because of controls in swapping

• Combining record swapping with rounding results in more distortion but provides added protection

Summary of Analysis Tables

• Record swapping across geographies attenuates:

- loss of association (moving towards independence) - counts “flattening” out

- proportions moving to the overall proportion

• Attenuation increases with higher swapping rates Targeted record swapping has less attenuation than random swapping

• Rounding introduces more zeros:

- levels of association are higher

- cell counts “sharper”

Effects less severe for controlled rounding

• Combing record swapping and rounding cancel out opposing effects depending on the direction and magnitude of each procedure separately

Discussion Tables

• Choice of SDC method depends on tolerable risk thresholds and demands for “fit for purpose” data

• Modifying and combining SDC methods (non-perturbative and perturbative methods) can produce higher utility, i.e. ABS developed microdata keys for consistency in rounding

• Dissemination of quality measures and guidance for carrying out statistical analysis on protected tables

• Future output strategies based on flexible table generating software. More need for research into disclosure risk by differencing and linking (collaboration with CS community)

• Safe setting, remote access and license agreements for highly disclosive Census outputs (sample microdata and origin-destination tables)