1 / 22

Understanding Research Data Centres

Understanding Research Data Centres. Chuck Humphrey Data Library University of Alberta. Outline. Discuss a common goal of the DLI and RDC programs to show how they complement one another. Discuss some differences between the DLI and RDC in how they provide access to data.

tariq
Download Presentation

Understanding Research Data Centres

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding Research Data Centres Chuck Humphrey Data Library University of Alberta Atlantic DLI Workshop

  2. Outline • Discuss a common goal of the DLI and RDC programs to show how they complement one another. • Discuss some differences between the DLI and RDC in how they provide access to data. • Discuss what happens behind the security door of an RDC. Atlantic DLI Workshop

  3. Common Input Goal • A goal of DLI is to create affordable and equitable access to “standard data products” for post-secondary institutions. • A goal of the RDC program is to provide access to confidential data for approved research projects using procedures allowed under the conditions of the Statistics Act. Atlantic DLI Workshop

  4. Open Restricted Free Expensive Statistics Data ACCESS CHANNELS Custom Tabulations Research Data Centres Statistics Canada Website Data Liberation Initiative Remote Job Submission Depository Service Program Continuum of Access Atlantic DLI Workshop

  5. Open Restricted Free Expensive Statistics Data ACCESS CHANNELS Continuum of Access Custom Tabulations Research Data Centres Statistics Canada Website Data Liberation Initiative Remote Job Submission Depository Service Program Atlantic DLI Workshop

  6. Access Problem Being Solved • DLI: The problem that was being solved was the high costs of “standard data products.” • RDC: The problem that was being solved was access to the confidential files of the longitudinal surveys begun in the 1990’s. Atlantic DLI Workshop

  7. Access: Some Differences • DLI: Access is determined by a paid institutional membership and a license that defines approved users and uses of these data products. • RDC: Access is determined by a peer-approval process for projects, a security clearance prior to establishing “deemed employee” status, and a contract. Institutions must pay a $100,000 per year service fee to operate an RDC. Atlantic DLI Workshop

  8. Access: Some Differences • DLI: Access is to “standard data products”, which have been created for public dissemination. • RDC: Access is to confidential data, which are protected under the Statistics Act and are only available to STC employees or “deemed employees” who have been given approval to use the data. These data products have not been created for dissemination. Atlantic DLI Workshop

  9. Behind the Closed Doors • We’ve discussed in the past the conditions of working in an RDC: • Approved peer-reviewed research project; • Signed contract with STC to deliver a report based on the project; • Swear an oath to the Statistics Act; • Participate in an orientation; • Work only with the data approved in the project proposal; • Restricted printing and removal of output. Atlantic DLI Workshop

  10. Behind the Closed Doors • What does the RDC Analyst do? • Administer researcher procedures, including the researcher orientation, contracts, security procedures, and setting up accounts. • Administer operations within the RDC, including the management of the data and supporting the Academic Director. • Provide support to researchers by consulting on the the data and offering technical advice. • Participate in collaborative research and independent research. • Conduct Disclosure Analysis. Atlantic DLI Workshop

  11. Disclosure Issues • Direct Identifiers (name, address, health services number, etc.) that uniquely identify a respondent. These are all stripped from released data files. • Indirect Identifiers refer to variables such as age, marital status, occupation, ethnicity, postal code, type of business etc.) that when combined could identify a respondent. • Source: Irene Wong, RDC Analyst

  12. Disclosure Issues • Sensitive variables refer to information or characteristics relating to a respondent’s private life or business which are usually unknown to others (income, illness, behaviour etc.).

  13. Disclosure Risk • Combining indirect identifiers with sensitive variables poses a disclosure risk. • However, researchers often seek these kind of relationships in data and try to explain them. • Control methods are therefore introduced: restricted access, data reduction, disclosure analysis.

  14. Identity Disclosure • Identity Disclosure - When a respondent can be identified from the released data. • Combine identifier with sensitive variables Example: • Income, gender, occupation and residence within a Wolfville postal code

  15. Attribute Disclosure • Attribute Disclosure - When confidential information is revealed and then be attributed to an individual or a group. • All persons with characteristic x have characteristic y Examples: • 100% of female respondents of age 13 in Wolfville reported that they experimented with X

  16. Residual Disclosure • Residual disclosure - when confidential information is disclosed by combining previously released output or information. • Risk of residual disclosure is high in: • Subsequent cycles of longitudinal data files (e.g. NLSCY, NPHS, etc.) • Sample from dependent surveys (e.g. SLID and LFS) • Research projects using the same data file • Overlapping some geographical areas (e.g. Health Region and Economic Region)

  17. Lowering Disclosure Risk General rules used with household sample surveys: • Do not report statistics or table cells with small number of respondents (e.g. fewer than 5 respondents) • No anecdotal information may be given about specific respondents • ‘Zero’ and ‘Full’ cell restriction • Min. and Max. value restriction • Saturated models, covariance/correlation matrices treated like underlying tables

  18. Low frequency cells F, 0 is a low frequency cell. Solution? • Collapse column ‘M’ and ‘F’ = column ‘total’ • Collapse row ‘1’ and ‘0’ = row ‘total’ • Report either column ‘M’ and row ‘1’ but not along with the ‘total’

  19. Frequency distributions Frequency curve, e.g.: user wishes to release the the value of observation at the 99th percentile * child 1: family 1 child 2: family 1 child 3: family 2 child 4: family 2 child 5: family 3…. If < 5 respondents are above the 99th percentile, there is a problem. One solution is to describe the distribution using the 95th percentile. * If the survey is multilevel (NLSCY), then the 5 or more respondents from level 1 (child) must come from at least 3 different units from level 2 (household).

  20. ‘Zero’ and ‘Full’ cell • (F, 1) is a full cell • (F, 0) is a non-structural zero cell • Both could pose confidentiality problem • (Married, age <12) is a structural zero cell • Not a data confidentiality problem • Do not expect anyone to be in this category

  21. Implied Tables - residual disclosure • Implied tables are tables produced by subtracting results from one or more published tables from another published table • In this example, ‘non-married’ individuals can easily be calculated

  22. Reporting Information • Writing a report is no different than working with table output, avoid statements such as: • “… responded incomes ranging from $2,498 to $579,789.” • If necessary, give general indications (e.g. “no income was above $600,000”.) • “… all respondents of age 16 reported experimenting with drugs.” • This is equivalent to a full cell situation.

More Related