Official Statistics and Confidentiality - PowerPoint PPT Presentation

official statistics and confidentiality n.
Skip this Video
Loading SlideShow in 5 Seconds..
Official Statistics and Confidentiality PowerPoint Presentation
Download Presentation
Official Statistics and Confidentiality

play fullscreen
1 / 48
Official Statistics and Confidentiality
Download Presentation
Download Presentation

Official Statistics and Confidentiality

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Official Statistics and Confidentiality Maura Bardos

  2. Outline • Overview of the Federal Statistical System • Agencies • Types of survey data collected • Challenges • Statistical Disclosure and confidentiality • Implications

  3. Federal Statistical System • Headed by a Chief Statistician • Decentralized System in the United States • 13 Agencies with a statistics oriented mission • Statistical Agencies are located throughout various agencies in the Federal Government • Examples: Census (Commerce Department), Energy Information Administration (Department of Energy), Bureau of Labor Statistics (Department of Labor)

  4. Data • Where do the numbers come from? • Survey data • Regulations by OMB • Response rates • Legal obligations • Confidentiality

  5. Confidentiality • Confidential Information Protection and Statistical Efficiency Act of 2002(CIPSEA)- places the onus on federal employees to limit disclosure • Took over 4 years to implement (Anderson and Seltzer) • 3 ways to reduce within agencies: • 1) Limiting identifiability of survey materials within the organization • 2) restricting access to data • 3) restricting the contents that may be released

  6. Statistical Disclosure and Confidentiality • Statistical Disclosure- “the identification of an individual (or of an attribute) through the matching of survey data with information available outside of the survey” (Groves, • The federal government identifies three different types of disclosure: • Identity: inappropriate attribution of information to a data subject, whether an individual or an organization. • Attribute: data subject is identified from a released file sensitive information about a data subject is revealed through the released file • Inferential: the released data make it possible to determine the value of some characteristic of an individual more accurately than otherwise would have been possible (FCSM)

  7. Example

  8. Challenges • Need to provide information • FOIA requests, Subpoenas • Satisfy requests for multiple clients. Must keep track of all withheld information • Maintain utility of data while preserving confidentiality • “Programming nightmare” to keep track of the relationship between variables, tables, and hierarchy

  9. How To Prevent • Specific Strategies • Data Swapping • Noise • Combining Cells • Rounding • Cell Suppression

  10. Strategy: Data Swapping • Exchange of reported data values across data records (Fienberg, Steele, Makov, 1996)

  11. Strategy: Swapping

  12. Select 10%

  13. Strategy: Swapping

  14. Strategy: Noise • Assign a multiplying factor, or noise factor to all data • For example: the value of a randomly generated variable might be added to each value in a dataset • “protect individual establishments without compromising the quality of our estimates” • Pro: More data can be published, less complicated, less time consuming • Problem: perturbing ALL data, non-sensitive and sensitive alike

  15. Strategy: Noise • How is this done: Use Multipliers • The standard is to perturb data by about 10% • Use multipliers ranging from .9 to 1.1 • Must preserve trend in data- otherwise useless for client’s analysis • Use distributions to control variance (examples)

  16. Strategy: Noise

  17. Example: Table with and without Noise

  18. Tables • Before Tabulation Strategies: Data Swapping; Data Perturbation (Noise) • Tables of Frequencies • Percent of population with certain characteristics • With outside knowledge- respondents with unique characteristics can be identified • Sensitive information: identified by threshold • Tables of magnitude data • Aggregate data, such as income of individuals, revenues of companies • Extreme values • Sensitive information: identified by linear sensitivity measure

  19. Strategy: Recoding Methods • Changing to values of outlier cases, since outliers are more likely to be sample or population uniques • Top coding- taking the largest values on a variable and giving them the same code value in dataset • For example- place all companies producing more than 100,000 barrels of oil per day in one category • Non-uniques are unperturbed

  20. Example of DisclosureHow do we fix this?

  21. Example Cont. Collapsing of categories

  22. Strategy: Rounding • Similar to noise. Cells are rounded, random decision is made whether to round up or down • Example: x -r = 5q • Round values to the a multiple of 5 • Where q = non negative integer r = remainder X = cell value, Rounded up, 5 x (q+1) probability of r/5 Rounded down, 5 x q probability of (1-r/5)

  23. Original Table

  24. Example: Rounding

  25. Strategy: Rounding, now with constraints

  26. How to identify cells with disclosure risks for magnitude data • n-k rule • p% rule

  27. P-Percent rule • If upper or lower estimates for the respondent’s value are closer to the reported value than some prespecified percentage (p) of the total cell value, the cell is sensitive (Groves, 372). • Assumptions: Any respondent can estimate the contribution of another respondent within 100% of its value • The second largest responded can use their reported value and attempt to estimate the largest reported value, X1

  28. P Percent Rule • A cell is sensitive if: S>0 where S = x1 - 100/p * (T – x2 - x1) For a given cell with N respondents, arrange the data in order from large to small: X1>X2>…>Xn>0

  29. Example Consider the cell 18,177. N=3; X1 = 17,000; X2 = 1,000; X3 = 177; p=15

  30. (n, k) Rule • If a small number (n) of the respondents contribute a large percentage (k) to the total cell value then the cell is sensitive (Groves 372)

  31. Example • We are publishing production data of how many barrels a day of crude oil each refinery produces. This is secret information. If our competitors found out, it could be detrimental to our business. • There are 4 collectors in the state with collections of 100, 50, 25, and 5 respectively • Find out if this information should be released or not using the n-k rule with (2, 85). The P Percent rule (p=35%)? • Using the P Percent rule, this cell is sensitive. However, it is not sensitive by the n-k rule

  32. Relationship between n-k and p% rule

  33. System of equations: P%: Z2 > 100 – 1.35Z1 (n,k): Z2 > 85 – Z1 Variable Constraints Z2 < Z1 Z1 + Z2 < 100

  34. Relationship between n-k and p% rule

  35. (55.56, 27.27)

  36. Strategy: Sensitive Cell Suppression • Primary Suppressions: The sensitive Cell • Complementary/Secondary Suppressions: Additional withheld data to ensure that the primary suppressions cannot be derived by linear combination • Goal: Minimize information lost. This is accomplished by selecting smallest possible cell values for complementary cell suppression • Problem: Often requires a substantial amount of data to be withheld. Potential for errors may lead to the release of confidential data

  37. Strategy: Sensitive Cell Suppression • Small Tables: • Manual suppression • Computerized audit procedures • Large Tables: • Much more complex, especially with related tables and hierarchical data • Consistency

  38. Real Example: Disclosure

  39. Cell Suppression Example • Let’s return to a previous example: Sales Revenue • We determined that we must the cell must be suppressed. How do we accomplish this?

  40. Example of a Solution

  41. Conclusion: Data is secure • High levels of security and suppression protect data are necessary as data guides real life policy issues. • Quality of this data is dependent on not only a high response rate, but accurate responses • Producing data is a function of “public trust” • However, the point of data collection is its use and analysis. The tradeoff between confidentiality and utilization must be examined

  42. …Or is it? • Patriot Act 2001 (Anderson & Seltzer) • Section 508: Disclosure information from National Center for Education Statistics Surveys • Justice Department is able to obtain and use for investigation and prosecution reports, records, and information (including individually identifiable information) • The Patriot Act overrides the 1994 National Center for Education Statistics Act that protections confidentiality

  43. Other examples from history • Second War Powers Act (1942-1947) • Repealed confidentiality protects of Title 13 governing the US Census Bureau (Anderson & Seltzer) • Japanese Americans and Internment camps (USA Today)

  44. 2004 data on Arab-Americans (NYT) • Released number of Arab-Americans per zip code • Categorized by country of origin: Egyptian, Iraqi, Jordanian, Lebanese, Moroccan, Palestinian, Syrian and two general categories, "Arab/Arabic" and "Other Arab." • Data obtained from a sample (the long form of the census)

  45. In conclusion… …the next time you fill out a survey, think about where your information may (or may not) be used.

  46. Sources • Clemetson, Lynette. “Homeland Secuirty given data on Arab-Americans.” New York Times. July 30, 2004. • El Nasser, Haya. “Papers show Census role in WWII Camps.” USA Today. March 30, 2007. • “DoD releases FY 2010 Budget Proposal.” US Department of Defense. May 7, 2009. • Seltzer, William and Margo Anderson. “NCES and the Patriot Act.” Paper prepared for the Joint Statistical Meetings. 2002. • Evans, Timothy, Laura Zayatz, and John Slanta. “Using Noise for Disclosure Limitation of Establishment Tabular Data.” US Census Bureau. 1996. • “Statistical Programs of the US Government.” Office of Management and Budget. 2009.

  47. Sources of examples • Sullivan, Colleen. “An Overview of Disclosure Principles.” US Census Bureau. 1992. • “Statistical Policy Working Paper: Report on Statistical Disclosure Methodology.” Federal Committee on Statistical Methodology. 2005. • Groves, Robert et. al. Survey Methodology. Hoboken, NJ: John Wiley & Sons. 2004.

  48. Additional Resources • • •