1 / 48

Official Statistics and Confidentiality

Official Statistics and Confidentiality. Maura Bardos. Outline. Overview of the Federal Statistical System Agencies Types of survey data collected Challenges Statistical Disclosure and confidentiality Implications . Federal Statistical System. Headed by a Chief Statistician

thyra
Download Presentation

Official Statistics and Confidentiality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Official Statistics and Confidentiality Maura Bardos

  2. Outline • Overview of the Federal Statistical System • Agencies • Types of survey data collected • Challenges • Statistical Disclosure and confidentiality • Implications

  3. Federal Statistical System • Headed by a Chief Statistician • Decentralized System in the United States • 13 Agencies with a statistics oriented mission • Statistical Agencies are located throughout various agencies in the Federal Government • Examples: Census (Commerce Department), Energy Information Administration (Department of Energy), Bureau of Labor Statistics (Department of Labor)

  4. Data • Where do the numbers come from? • Survey data • Regulations by OMB • Response rates • Legal obligations • Confidentiality

  5. Confidentiality • Confidential Information Protection and Statistical Efficiency Act of 2002(CIPSEA)- places the onus on federal employees to limit disclosure • Took over 4 years to implement (Anderson and Seltzer) • 3 ways to reduce within agencies: • 1) Limiting identifiability of survey materials within the organization • 2) restricting access to data • 3) restricting the contents that may be released

  6. Statistical Disclosure and Confidentiality • Statistical Disclosure- “the identification of an individual (or of an attribute) through the matching of survey data with information available outside of the survey” (Groves, et.al) • The federal government identifies three different types of disclosure: • Identity: inappropriate attribution of information to a data subject, whether an individual or an organization. • Attribute: data subject is identified from a released file sensitive information about a data subject is revealed through the released file • Inferential: the released data make it possible to determine the value of some characteristic of an individual more accurately than otherwise would have been possible (FCSM)

  7. Example

  8. Challenges • Need to provide information • FOIA requests, Subpoenas • Satisfy requests for multiple clients. Must keep track of all withheld information • Maintain utility of data while preserving confidentiality • “Programming nightmare” to keep track of the relationship between variables, tables, and hierarchy

  9. How To Prevent • Specific Strategies • Data Swapping • Noise • Combining Cells • Rounding • Cell Suppression

  10. Strategy: Data Swapping • Exchange of reported data values across data records (Fienberg, Steele, Makov, 1996)

  11. Strategy: Swapping

  12. Select 10%

  13. Strategy: Swapping

  14. Strategy: Noise • Assign a multiplying factor, or noise factor to all data • For example: the value of a randomly generated variable might be added to each value in a dataset • “protect individual establishments without compromising the quality of our estimates” • Pro: More data can be published, less complicated, less time consuming • Problem: perturbing ALL data, non-sensitive and sensitive alike

  15. Strategy: Noise • How is this done: Use Multipliers • The standard is to perturb data by about 10% • Use multipliers ranging from .9 to 1.1 • Must preserve trend in data- otherwise useless for client’s analysis • Use distributions to control variance (examples)

  16. Strategy: Noise

  17. Example: Table with and without Noise

  18. Tables • Before Tabulation Strategies: Data Swapping; Data Perturbation (Noise) • Tables of Frequencies • Percent of population with certain characteristics • With outside knowledge- respondents with unique characteristics can be identified • Sensitive information: identified by threshold • Tables of magnitude data • Aggregate data, such as income of individuals, revenues of companies • Extreme values • Sensitive information: identified by linear sensitivity measure

  19. Strategy: Recoding Methods • Changing to values of outlier cases, since outliers are more likely to be sample or population uniques • Top coding- taking the largest values on a variable and giving them the same code value in dataset • For example- place all companies producing more than 100,000 barrels of oil per day in one category • Non-uniques are unperturbed

  20. Example of DisclosureHow do we fix this?

  21. Example Cont. Collapsing of categories

  22. Strategy: Rounding • Similar to noise. Cells are rounded, random decision is made whether to round up or down • Example: x -r = 5q • Round values to the a multiple of 5 • Where q = non negative integer r = remainder X = cell value, Rounded up, 5 x (q+1) probability of r/5 Rounded down, 5 x q probability of (1-r/5)

  23. Original Table

  24. Example: Rounding

  25. Strategy: Rounding, now with constraints

  26. How to identify cells with disclosure risks for magnitude data • n-k rule • p% rule

  27. P-Percent rule • If upper or lower estimates for the respondent’s value are closer to the reported value than some prespecified percentage (p) of the total cell value, the cell is sensitive (Groves, 372). • Assumptions: Any respondent can estimate the contribution of another respondent within 100% of its value • The second largest responded can use their reported value and attempt to estimate the largest reported value, X1

  28. P Percent Rule • A cell is sensitive if: S>0 where S = x1 - 100/p * (T – x2 - x1) For a given cell with N respondents, arrange the data in order from large to small: X1>X2>…>Xn>0

  29. Example Consider the cell 18,177. N=3; X1 = 17,000; X2 = 1,000; X3 = 177; p=15

  30. (n, k) Rule • If a small number (n) of the respondents contribute a large percentage (k) to the total cell value then the cell is sensitive (Groves 372)

  31. Example • We are publishing production data of how many barrels a day of crude oil each refinery produces. This is secret information. If our competitors found out, it could be detrimental to our business. • There are 4 collectors in the state with collections of 100, 50, 25, and 5 respectively • Find out if this information should be released or not using the n-k rule with (2, 85). The P Percent rule (p=35%)? • Using the P Percent rule, this cell is sensitive. However, it is not sensitive by the n-k rule

  32. Relationship between n-k and p% rule

  33. System of equations: P%: Z2 > 100 – 1.35Z1 (n,k): Z2 > 85 – Z1 Variable Constraints Z2 < Z1 Z1 + Z2 < 100

  34. Relationship between n-k and p% rule

  35. (55.56, 27.27)

  36. Strategy: Sensitive Cell Suppression • Primary Suppressions: The sensitive Cell • Complementary/Secondary Suppressions: Additional withheld data to ensure that the primary suppressions cannot be derived by linear combination • Goal: Minimize information lost. This is accomplished by selecting smallest possible cell values for complementary cell suppression • Problem: Often requires a substantial amount of data to be withheld. Potential for errors may lead to the release of confidential data

  37. Strategy: Sensitive Cell Suppression • Small Tables: • Manual suppression • Computerized audit procedures • Large Tables: • Much more complex, especially with related tables and hierarchical data • Consistency

  38. Real Example: Disclosure

  39. Cell Suppression Example • Let’s return to a previous example: Sales Revenue • We determined that we must the cell must be suppressed. How do we accomplish this?

  40. Example of a Solution

  41. Conclusion: Data is secure • High levels of security and suppression protect data are necessary as data guides real life policy issues. • Quality of this data is dependent on not only a high response rate, but accurate responses • Producing data is a function of “public trust” • However, the point of data collection is its use and analysis. The tradeoff between confidentiality and utilization must be examined

  42. …Or is it? • Patriot Act 2001 (Anderson & Seltzer) • Section 508: Disclosure information from National Center for Education Statistics Surveys • Justice Department is able to obtain and use for investigation and prosecution reports, records, and information (including individually identifiable information) • The Patriot Act overrides the 1994 National Center for Education Statistics Act that protections confidentiality

  43. Other examples from history • Second War Powers Act (1942-1947) • Repealed confidentiality protects of Title 13 governing the US Census Bureau (Anderson & Seltzer) • Japanese Americans and Internment camps (USA Today)

  44. 2004 data on Arab-Americans (NYT) • Released number of Arab-Americans per zip code • Categorized by country of origin: Egyptian, Iraqi, Jordanian, Lebanese, Moroccan, Palestinian, Syrian and two general categories, "Arab/Arabic" and "Other Arab." • Data obtained from a sample (the long form of the census)

  45. In conclusion… …the next time you fill out a survey, think about where your information may (or may not) be used.

  46. Sources • Clemetson, Lynette. “Homeland Secuirty given data on Arab-Americans.” New York Times. July 30, 2004. http://www.nytimes.com/2004/07/30/politics/30census.html • El Nasser, Haya. “Papers show Census role in WWII Camps.” USA Today. March 30, 2007. http://www.usatoday.com/news/nation/2007-03-30-census-role_N.htm • “DoD releases FY 2010 Budget Proposal.” US Department of Defense. May 7, 2009. http://www.defenselink.mil/releases/release.aspx?releaseid=12652 • Seltzer, William and Margo Anderson. “NCES and the Patriot Act.” Paper prepared for the Joint Statistical Meetings. 2002. http://www.uwm.edu/~margo/govstat/jsm.pdf • Evans, Timothy, Laura Zayatz, and John Slanta. “Using Noise for Disclosure Limitation of Establishment Tabular Data.” US Census Bureau. 1996. http://www.census.gov/prod/2/gen/96arc/iiaevans.pdf • “Statistical Programs of the US Government.” Office of Management and Budget. 2009. http://www.whitehouse.gov/omb/assets/information_and_regulatory_affairs/09statprog.pdf

  47. Sources of examples • Sullivan, Colleen. “An Overview of Disclosure Principles.” US Census Bureau. 1992. http://www.2010census.biz/srd/papers/pdf/rr92-09.pdf • “Statistical Policy Working Paper: Report on Statistical Disclosure Methodology.” Federal Committee on Statistical Methodology. 2005. http://www.fcsm.gov/working-papers/SPWP22_rev.pdf • Groves, Robert et. al. Survey Methodology. Hoboken, NJ: John Wiley & Sons. 2004.

  48. Additional Resources • http://jpc.cylab.cmu.edu/journal/2009/vol01/issue01/issue01.pdf • http://www.census.gov/srd/sdc/papers.html • http://www.census.gov/srd/sdc/abowd-woodcock2001-appendix-only.pdf

More Related