1 / 36

Timothy M. Mulcahy Fritz Scheuren

Timothy M. Mulcahy Fritz Scheuren. Eurostat’s NTTS: Conference on New Techniques & Technologies for Statistics: 21 st Century Data Dissemination: Practice & Innovations Brussels, 23 February 2011. Overview. Introduction Context Data access modalities

megan
Download Presentation

Timothy M. Mulcahy Fritz Scheuren

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Timothy M. Mulcahy Fritz Scheuren Eurostat’s NTTS: Conference on New Techniques & Technologies for Statistics: 21st Century Data Dissemination: Practice & Innovations Brussels, 23 February 2011

  2. Overview • Introduction • Context • Data access modalities • Confidentiality, data utility & convenience • Data security • Researcher collaboration • Innovations

  3. Introduction • NORC at the University of Chicago is a nonprofit public interest research organization • Established in 1941 • Closely affiliated with the University of Chicago • Divisions: • Economics, Labor and Population Studies • Education and Child Development • Field Operations Center • Health Care Research • International Projects • Public Health Research • Security, Energy & Environment • Statistics and Methodology • Substance Abuse, Mental Health & Criminal Justice • Telephone Survey and Support Operations

  4. Introduction (cont.) NORC - University of Chicago Academic Research Centers • Center for Advancing Research and Communication in Science, Technology, Engineering, and Mathematics  • Center for Excellence in Survey Research • Center for the Study of Politics and Society  • Center on Demography and Economics of Aging • Cultural Policy Center   • Data Research and Development Center • Joint Center for Education Research • Population Research Center • Ogburn Stouffer Center for the Study of Social Organizations   • Alfred P. Sloan Center on Parents, Children and Work

  5. Context • Challenges to responsible data sharing • Selecting the most appropriate access modality • Confidentiality v. analytic utility tradeoff • Convenience v. analytic utility tradeoff • Data security • Researcher Collaboration

  6. 2nd Annual Conference on Microdata Access (2/10/11) “Responsible Data Sharing in the 21st Century”  KEYNOTE ADDRESS: ROBERT GROVES, Director, U.S. Census Bureau PANEL DISCUSSIONS: • “Motivations, Challenges, and Implications for Responsible Data Sharing” • “Responsible Data Sharing Among University Researchers”

  7. The Government’s Role in Statistics • Statistical information is key to an informed citizenry; an informed citizenry is key to a functioning democracy • To be useful, the statistical information must be credible • It must be viewed as nonpartisan • It must be viewed as relevant to key questions about the welfare of the society (Groves)

  8. Conflicting Principles for Data Producers? U.S. Government Statistical Agencies must simultaneously: • Maximize the richness of statistical information and insights based on the data provided • Widely, freely distribute statistics • Spur secondary analysis of data; and • Preserve its pledge of respondent confidentiality • Not just a principle, a law (Groves)

  9. Challenges to Maintaining Confidentiality • Many policy and research questions require estimates that can not be generated from publicly available data • Statistical disclosure control inherently infers modifying the inferences from the data • Research conducted on perturbing data puts the scientific discovery process at risk

  10. Public Use Files • Advantage: • PUFs may be made widely available for public consumption • Disadvantages • No training is provided; limited metadata • Some useful information must be at least partially suppressed to protect confidentiality • Widespread availability of other micro datasets that can be matched to public use microdata files or even tabulations to reidentify respondents

  11. The Role of Microdata Access • Increases the # of researchers working with the data • Allows authorized researchers to pose and answer their own questions (curiosity-driven research) • Facilitates secure exploratory analyses and testing & confirming models • Provides a means for interdisciplinary research • Allows for advance queries of confidential data that cannot be pursued with public use files • “In essence, research that extracts all important information from the data while respecting individual rights, as part of our obligation to the society.” (Groves)

  12. Selecting the Most Appropriate Data Access Modality Questions for data producers & providers: • What are my goals & objectives? • Who is my audience? • What is my risk tolerance? • Goal: design a set of customized data dissemination strategies that balance the level of risk tolerance and the need for data analytic utility

  13. Available Data Access Modalities • Licensing: very high disclosure risk • Remote batch processing: time-consuming and costly • Online tabulation engines: data suppression, perturbation • Synthetic microdata: costly, low analytic utility, highly dependent on model accuracy • RDC’s and data enclaves: high data analytic utility while maintaining high standard of data confidentiality Strategizing Data Dissemination and Secure Microdata Access

  14. Risk-Utility Tradeoff • The primary risk factor of data access is disclosure • Individual or firm level information must be handled very carefully • In the context of data access, there is a tradeoff between disclosure risk and data analytic utility • As additional measures are introduced to protect data confidentiality, data analytic utility will be reduced • In other words, the lower the risk, the lower the utility Strategizing Data Dissemination and Secure Microdata Access

  15. Confidentiality-Utility Curve Physical and/or Remote Access Data Enclaves Remote Batch Processing Confidentiality Synthetic Micro-Data Statistical Tables and Data Cubes Public Use Data-File Licensing Analytic Utility Strategizing Data Dissemination and Secure Microdata Access

  16. The Third Factor • Confidentiality and data utility are not the only factors that influence the choice of data access modality • The third factor: Convenience • Producers’ perspective: how easy is it to: • Implement an RDC or enclave? • Update and document the data? • Monitor researchers’ work and output requests? • Researchers’ perspective: • How far do they need to travel to the nearest RDC? • How easy is it for them to conduct follow-up work? • How quickly does the RDC review and approve output requests? • How easy is it for them to seek assistance? • Is there any peer-to-peer researcher interaction or peer review? Strategizing Data Dissemination and Secure Microdata Access

  17. Given the Same Utility… Physical Data Enclaves with Remote Access Physical Data Enclaves Remote Access Data Enclaves Confidentiality Value provided with a secure physical enclave Value added with remote access to diverse sensitive datasets Value added with flexible deployment of terminals Convenience Strategizing Data Dissemination and Secure Microdata Access

  18. Data Security • Data security – the ability to control disclosure risk, ensure privacy, and thus maintain data confidentiality • Both RDCs and data enclaves allow secure microdata access: similar level of data analytic utility • RDC: researchers physically access data stored at a secure physical facility • Data Enclave: researchers remotely access data stored at a file server through a secure system on a virtualized environment • Both modalities provide high confidentiality protection: information inflow & outflow are monitored and controlled. Strategizing Data Dissemination and Secure Microdata Access

  19. Portfolio Protection Approach • Legal • Educational / Training • Statistical • Technical • Operational (*Customized per data producer & dataset) Strategizing Data Dissemination and Secure Microdata Access

  20. Technical Protection • Encrypted connection with the data enclave using virtual private network (VPN) technology. VPN technology prevents outsiders from reading the data transmitted between the researcher’s computer and NORC’s network. • Users access the data enclave from a static or pre-defined narrow range of IP addresses. • Citrix Web-based security interface. • All applications and data run on the server at the data enclave. • Data enclave can prevent the user from transferring any data from data enclave to a local computer. • Data files cannot be downloaded from the remote server to the user’s local PC. • User cannot use the “cut and paste” feature in Windows to move data from the Citrix session. • User is prevented from printing the data on a local computer. • Audit logs and audit trails Strategizing Data Dissemination and Secure Microdata Access

  21. Enclave Security Features Locked down thin client with minimal software, hardware authentication and self-monitoring mechanisms Webcam for audit trails, user/room monitoring, face recognition Network connection control (fixed IP, no DNS resolution, etc.) 2-factor authentication (biometric, smartcard, token, etc.) Internet Enclave Security & Support Center monitors activity and provide remote assistance and system maintenance DE Security/support Team Strategizing Data Dissemination and Secure Microdata Access

  22. Enclave Security Features (cont.) • Ability to push out updates • Machines communicate with central server • Pings security configurations • Experimenting with GPS, biometrics, finger swipe/iris scan • 2-factor authentication

  23. Operational Control • Internal Review: NORC performs extensive disclosure analysis on all output and makes recommendation to producer • Primary disclosure • Secondary disclosure • Residual disclosure • External Review: Data producers perform additional review of all output and make final decision on all output releases • Internal + external statistical review = safe output Strategizing Data Dissemination and Secure Microdata Access

  24. The USDA Experience • Problem: USDA needed to disseminate survey data to the agricultural economics research community • The Agricultural Resources Management Survey • They already operated a network of RDCs in all 50 states • Implementing a remote access solution allowed them to engage a larger number of more dispersed researchers and centralize their researcher outreach Strategizing Data Dissemination and Secure Microdata Access

  25. The USDA Experience Geographically dispersed researchers travel to secure RDCs Instead of creating additional costly brick and mortar RDCs, USDA can now roll out a virtual RDC to any university in the country Thin client terminals are installed in secure locations at researchers’ universities Strategizing Data Dissemination and Secure Microdata Access

  26. Collaboration • Collaboration increases researcher productivity • Traditional data access modalities do not accommodate the need for research collaboration • Data Enclave facilitates collaboration • Provides platforms and tools for collaboration • Environment for interaction (i.e. instant messaging) • Allows encrypted file sharing • Develops group identity within research communities Strategizing Data Dissemination and Secure Microdata Access

  27. Enclave Collaboration Tools Strategizing Data Dissemination and Secure Microdata Access

  28. Collaboration Tools (cont.) PRODUCER PORTAL GENERAL INFORMATION KNOWLEDGE SHARING SUPPORT • Background info • Announcements • Calendar or events • About • Topic of the week • Discussion groups • Wiki • Shared libraries • Metadata / Report • Scripts • Research papers • Frequently Asked Questions • Technical Support • DE usage • Data usage • Quality Content fully editable by producers and researchers using a simple web based interface Private research group portals with similar functionalities are configured for each research project Strategizing Data Dissemination and Secure Microdata Access

  29. Collaboration Tools (cont.) Home Welcome, background information, contact, simple access to public data and documentation Researcher Services Collaborative Space My Datasets Create custom view of the data for use in project or sharing with community Wiki Capture knowledge surrounding the data. Initial content will be seeded with survey metadata. My Projects Bring together researchers in a virtual environment to share research ideas, data, documentation, and scripts. Library Searchable libraries of papers/references/documentation, scripts/programs, primary and secondary data. Most of the content is extracted automatically from the research space. My Publications Package research outputs (papers, documents, scripts/programs, secondary data) for preservation, dissemination and sharing Communication Events and news, Community driven discussion groups, FAQ/Answers, Chat My Profile Provide individual background information, research interests, set privacy options and configure notifications services Services Researcher Directory, Project Directory, Call for collaboration, Notification, Support, Training Infrastructure Primary and researcher data and metadata storage, databases, security (access, backups), web services Admin Services System and data usage reports, data/metadata management, user administration, etc. Strategizing Data Dissemination and Secure Microdata Access

  30. Kauffman Foundation Experience • Collaboration increases research productivity • The Kauffman Foundation sought a remote microdata access solution with the express intent of creating a collaborative research community around the KFS firm survey • Output since mid-2007: • Two books • Five book chapters • 10 peer reviewed articles • Two dissertations • 57 conference presentations • 23 research reports • Four best paper awards Strategizing Data Dissemination and Secure Microdata Access

  31. External Collaboration • External web-based collaboration tools allow researchers to share knowledge leveraging public data and metadata as well as online information • These tools provide prospective researchers with the opportunity to familiarize themselves with confidential datasets prior to being granted access. Strategizing Data Dissemination and Secure Microdata Access

  32. The NADA Data Catalog • The International Household Survey Network (IHSN) National Data Archive web-based tool (NADA) • Catalogs data in a DDI-compliant standard • Allows prospective researchers to browse metadata • Users can compare variables across surveys within the system • Multi-tiered access system allows each researcher to have their personalized account so that producers can • Disseminate public use files directly • Review data access requests from researchers Strategizing Data Dissemination and Secure Microdata Access

  33. The NADA Data Catalog Strategizing Data Dissemination and Secure Microdata Access

  34. Disclosure Review Innovations • Disclosure Control Process • Optimize management processes for archiving and disclosure review processes • Facilitate workflow for review and export of output request • Phase 1: • Develop simple request packaging tool for user • Support workflow (submit, enclave review, producer, review, delivery) • Institutional archiving and audit • Phase 2: • Link with DDI metadata to identify all data sources and variables • Facilitate review process, comparison of log requests over time to protect against residual disclosure, understanding of variable usage/utility • Phase 3: Automation Strategizing Data Dissemination and Secure Microdata Access

  35. Disclosure Review Innovations DocuStat and Coodle Script Import SAS, Stata, SPSS scripts are imported into Coodle and indexed by an Apache Lucene database. The files are also tagged to maintain attributes such as author, project, filename, date, etc. Coodle Variable Analysis Lucene is a free text indexing engine. Similar to Google, it provides search functionalities on vast amount of text. DDI metadata is used to retrieve variable names and Coodle perform name based searches in the repository. This returns the scripts and code snippets where a particular variable is in use. The results are stored in a XML document. Coodle Source Code Browser Interestingly, users can also query the repository to retrieve source code example or snippets for reuse Coodle Report Generator XML transformations are then used to generate various reports for the archive or producer Strategizing Data Dissemination and Secure Microdata Access

  36. Thank you! QUESTIONS?

More Related