Access to
Download
1 / 105

Access to Confidential Data for Statistical Analysis - PowerPoint PPT Presentation


  • 158 Views
  • Updated On :

Access to Confidential Data for Statistical Analysis. Kenneth Harris, Director of Research Data Center. National Center for Health Statistics (NCHS).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Access to Confidential Data for Statistical Analysis' - casimir


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Access to Confidential Data for Statistical Analysis

Kenneth Harris, Director of Research Data Center


National center for health statistics nchs l.jpg
National Center for Health Statistics (NCHS)

Despite the wide dissemination of its data through publications, CD-ROMs, etc., the inability to release files with, for instance, lower levels of geography, severely limits the utility of some data for research, policy, and programmatic purposes and sets a boundary on one of the Center’s goals to increase its capacity to provide state and local area estimates.


Nchs cont l.jpg
NCHS (cont.)

In pursuit of this goal and in response to the research community’s interest in restricted data, NCHS established the Research Data Center (RDC), a mechanism whereby researchers can access detailed data files in a secure environment, without jeopardizing the confidentiality of the respondents.


Research data center l.jpg
Research Data Center

The NCHS Research Data Center, established in 1998, is a facility at the NCHS headquarters in Hyattsville, Maryland, where researchers are granted access to restricted data files needed to complete approved projects. Restricted data files may contain information, such as lower levels of geography, but do not contain direct identifiers (e.g., name or social security number).


Data restrictions l.jpg
Data Restrictions

Section 308 (d) of the Public Health Service Act and the NCHS Staff Confidentiality Manual do not permit the release of data that are either identified or identifiable to persons outside of NCHS.


Data restrictions cont l.jpg
Data Restrictions (cont.)

Identifiable data include not only direct identifiers such as name, social security number, etc., but also data that can serve to allow inferential identification of either individual or institutional respondents by a number of means.


Data restrictions cont7 l.jpg
Data Restrictions (cont.)

Research indicates that identifiability is greatly enhanced if geographic identifiers for state, county, census tract, block-group or block are released on public use files.


Key issues for research data availability l.jpg
Key Issues for Research Data Availability

CONFIDENTIALITY

The dissemination of data in a manner that would allow public identification of the respondent or would in any way be harmful to him/her is prohibited and the data are immune from legal process.


Key issues for research data availability cont l.jpg
Key Issues for Research Data Availability (cont.)

DISCLOSURE

Disclosure relates to inappropriate attribution of information to a data subject, whether an individual or an organization. Disclosure occurs when a data subject is identified from a released file (identity disclosure), sensitive information about a data subject is revealed through the released file (attribute disclosure), or the released data make it possible to determine the value of some characteristic of an individual more accurately than otherwise would have been possible (inferential disclosure).


Appendix i rules for the release of micro data files l.jpg
Appendix I – Rules for the Release of Micro Data Files

  • The data file must not contain any detailed

    information about the subject that could facilitate identification and that is not essential for research purposes (e.g., exact date of the subject’s birth).

  • Geographic places that have fewer than 100,000 people are not to be identified on the data file.

  • Characteristics of an area are not to appear on the data file if they would uniquely identify an area of less than 100,000 people.


Appendix i rules for the release of micro data files cont l.jpg
Appendix I – Rules for the Release of Micro Data Files (cont.)

  • Information on the drawing of the sample which might assist in identifying a data subject must not be released outside the Center. Thus, the identities of primary sampling units are not to be made available outside the Center.

  • Before any new or revised micro data files are published, they, together with their full documentation, must be approved for publication by the NCHS Director or Deputy Director.

  • A micro data file containing confidential data on unidentified individuals or facilities may not be released to any person or organization outside NCHS until that person, or a responsible representative of that organization, has first signed the statement on the Order Form, whereby he gives assurance that the data provided will be used only for statistical reporting or research purposes.


Why nchs does not release files with lower levels of geography l.jpg
Why NCHS Does Not Release Files With Lower Levels of Geography

Research suggests that in the case of personal surveys nine commonly collected variables result in the table below.


Why nchs does not release files with lower levels of geography cont l.jpg
Why NCHS Does Not Release Files With Lower Levels of Geography (cont.)

Notes: A geopolitical area may be a county, city, town, or other place with well- defined boundaries.

In this case, identification refers to certaintyidentification.


How does rdc operate l.jpg
How Does RDC Operate? Geography (cont.)

  • On-Site Access

  • Remote Access

  • Staff Assisted Analytical Session


User procedures l.jpg
User Procedures Geography (cont.)

To gain access to NCHS restricted data through

either method, user must:

  • Submit a research proposal.

    • An advisory and proposal review committee receives, reviews, and approves researcher proposals

    • Proposals are evaluated primarily on the confidentiality disclosure risk.

    • Scientific merit isnot an evaluation criteria.

  • Sign an affidavit of confidentiality and promise not to use any method to attempt to identify respondents.


User procedures cont l.jpg
User Procedures (cont.) Geography (cont.)

  • Not take any materials or equipment into RDC unless approved by RDC staff.

  • Submit data files to be merged onto NCHS data ahead of time – allmerging is done by RDC staff.

  • Subject all output and/or materials removed from the RDC to a disclosure limitation review.

  • May not remove any NCHS restricted data files nor linked data files.


Researcher affidavit of confidentiality l.jpg
Researcher Affidavit of Confidentiality Geography (cont.)

I certify that no confidential data or information viewed or otherwise obtained while I am a researcher in the National Center for Health Statistics (NCHS), Research Data Center (RDC) will be removed from NCHS. Further, I understand that NCHS will perform a disclosure review and must provide approval to me before I remove any data from the RDC, whether it be in electronic or paper form. I acknowledge NCHS Confidentiality Statute, 308(d) of the Public Health Service Act stated below and fully understand my legal obligations to NCHS to protect all confidential data. Further I understand any violation I may perform is punishable under 18 United States Code (USC), 1001 which carries a fine of up to $10,000 or up to 5 years in prison.


Researcher affidavit of confidentiality cont l.jpg
Researcher Affidavit of Confidentiality Geography (cont.)(cont.)

NCHS 308(d) Confidentiality Statute - No information, if an establishment or person supplying the information or described in it is identified, obtained in the course of activities undertaken or supported under section 304, 305, 306, 307, or 309 may be used for any purpose other than the purpose for which it was supplied unless such establishment or person has consented to its use for such other purpose and in the case of information obtained in the course of health statistical or epidemiological activities under section 304 or 306, such information may not be published or released in other form if the particular establishment or person supplying the information or described in it is identifiable unless such establishment or person has consented to its publication or release in other form.


Researcher affidavit of confidentiality cont19 l.jpg
Researcher Affidavit of Confidentiality Geography (cont.)(cont.)

18 United States Code, 1001 - Deliberately making a false statement in any matter within the jurisdiction of any Department or Agency of the Federal Government violates 18 USC 1001 and is punishable by a fine of up to $10,000 or up to 5 years in prison.

____________________ _______________ Researcher’s Signature Date

____________________ _______________

NCHS Witness Date


Can researcher merge his her data with nchs l.jpg
Can Researcher Merge his/her Data with NCHS ? Geography (cont.)

  • Must Interact with RDC staff to ensure

    that their data can be merged with the

    NCHS data.

  • User-supplied data will be merged with

    NCHS data by RDC staff only.

  • The NCHS RDC policy states that merged

    and user-supplied data will not be made

    available for analysis to anyone without

    the written consent of the user.


The cost per project l.jpg
The Cost per Project Geography (cont.)

On Site

$200 per day (2 day minimum)

Remote Access

  • NSFG-CDF = $500/ year

  • NHIS-polio = $500/ year

  • NHIS Linked Mort. File = $250/Month

  • NHANES Linked Mort. File = $250/Month


The cost per project cont l.jpg
The Cost per Project (cont.) Geography (cont.)

  • Files <= 130k records = $500 per month

  • Files > 130k records = $1000 per month

    Staff Assisted Variable

    File Construction and Setup

    For Mortality Files = $250 per day

    For all Other Files = $500 per day


Do doctors perform defensive cesareans l.jpg
Do Doctors perform “defensive Cesareans”? Geography (cont.)

Overview: This topic re-examined the issues of “defensive medicine” and state reforms designed to limit malpractice risk on the use of cesarean section delivery.

NCHS Data Used: National Hospital Discharge Survey (NHDS)

Years of Data Used: 1980 through 1992, inclusive.

User’s Data Merged with NCHS?Yes

Method of Access to NCHS Data:Remote and

On-site Access

Statistical Software Used:SAS


Slide24 l.jpg
Economic Model to Explain the Incidence of Sexual Activity, Contraceptive Use, STD, and Pregnancy Among Teenage Girls.

Overview: National Survey of Family Growth Data provide extensive socio-demographic information and reports of the sexual histories of these women. Researcher focused on the effects of a number of policies measured at the state-level. These included:

  • Parental notification of consent laws.

  • Medicaid funding of abortions.

  • Welfare generosity.

    NCHS Data Used:National Survey of Family Growth (NSFG)

    User’s Data Merged with NCHS? Yes

    Method of Access to NCHS Data:Remote Access

    Statistical Software Used:SAS


Nursing home admission and payment source l.jpg
Nursing Home Admission and Contraceptive Use, STD, and Pregnancy Among Teenage Girls.Payment Source?

Overview: This project tested if patients with Medicare were being discriminated against because their reimbursement rate was significantly below the private pay rate for nursing homes.

NCHS Data Used: National Nursing Home Survey (NNHS)

Years of Data Used: 1985, 1995, and 1997

User’s Data Merged with NCHS? No

Method of Access to NCHS Data: Remote Access

Statistical Software Used:SAS


Hardware and software l.jpg
Hardware and Software Contraceptive Use, STD, and Pregnancy Among Teenage Girls.

All RDC hardware and software are standard.

Hardware

Pentium IV computers with Windows 2000

Software

SAS (only language on ANDRE)

Sudaan

Fortran

HLM

Stata

Limdep

text editors/viewers

  • Onsite workstations do NOT have email or internet access

  • Only access to printer is through RDC staff


Record linkage for epidemiologic research accessing linked data at the nchs research data center l.jpg

U.S. DEPARTMENT OF HEALTH AND HUMAN SERVICES Contraceptive Use, STD, and Pregnancy Among Teenage Girls.

Centers for Disease Control and Prevention

National Center for Health Statistics

Record Linkage for Epidemiologic Research: Accessing Linked data at the NCHS Research Data Center

Christine S. Cox

NCHS Data Users Conference

July 12, 2006


Slide28 l.jpg

What is Record Linkage? Contraceptive Use, STD, and Pregnancy Among Teenage Girls.

NCHS Surveys

Administrative records

Linked Data File


Nchs linked data major activities l.jpg
NCHS Linked Data: Contraceptive Use, STD, and Pregnancy Among Teenage Girls.Major Activities

  • Mortality

    • National Death Index

  • Health Care Utilization and Costs

    • Medicare Data

  • Retirement and Disability

    • Social Security Data


Nchs linked data mortality l.jpg
NCHS Linked Data: Mortality Contraceptive Use, STD, and Pregnancy Among Teenage Girls.

  • Eligibility status

  • Assigned vital status

  • Date of death

  • Age at death

  • Underlying and multiple causes of death

  • Adjusted sample weights


Research potential of linked mortality data l.jpg
Research Potential of Contraceptive Use, STD, and Pregnancy Among Teenage Girls.Linked Mortality Data

The Income-Associated Burden of Disease in the United States

P Muennig, P Franks, H Jia, E Lubetkin and MR Gold

Excess Deaths Associated with Underweight, Overweight, and ObesityKM Flegal, BI Graubard, DF Williamson; MH GailJAMA. 2005;293:1861-1867.

Living and Dying in the USA: Behavioral, Health, and Social Differentials of Adult Mortality

RG Rogers, CB Nam, RA Hummer

A Semiparametric Analysis of the Body Mass Index’s Relationship to Mortality

JT Gronniger


Nchs linked data medicare l.jpg
NCHS Linked Data: Medicare Contraceptive Use, STD, and Pregnancy Among Teenage Girls.

  • Medicare entitlement and health care utilization and payment data for 1991-2000

    • Denominator file

    • MEDPAR Inpatient hospitalization

    • MEDPAR Skilled nursing facility

    • Hospital outpatient

    • Home Health Care

    • Hospice

    • Carrier (physician/supplier Part B file)

    • Durable Medical Equipment


Research potential of linked medicare data l.jpg
Research Potential of Contraceptive Use, STD, and Pregnancy Among Teenage Girls.Linked Medicare Data

  • Examine risk factors for health conditions

  • Examine reliability of survey data

    • Examine survey report of disability with program participation eligibility criteria

    • Compare survey reported health conditions to claims records

  • Examine disparities in Medicare service utilization


Nchs linked data retirement disability l.jpg
NCHS Linked Data: Retirement/Disability Contraceptive Use, STD, and Pregnancy Among Teenage Girls.

  • Social Security data from Retirement, Survivors, and Disability Insurance (RSDI) and Supplemental Security Insurance (SSI) programs

    • Master Beneficiary Record (MBR)

      • 1962-2003

    • Payment History Update System (PHUS)

      • 1984-2003

    • Supplemental Security Record (SSR)

      • 1974-2003


Research potential of linked social security data l.jpg
Research Potential of Contraceptive Use, STD, and Pregnancy Among Teenage Girls.Linked Social Security Data

  • Examine reliability of survey information for SSA program participation and benefits

  • Compare the health characteristics of those who take early (age 62) Social Security benefits to those who postpone benefits

  • Policy analysis using validated survey data

    • Predicting the number of people who will become disabled based upon survey reported health conditions

    • Determining whether current disability entitlement funding levels will be adequate as the population ages


Summary nchs data linkage l.jpg

Mortality (NDI) Contraceptive Use, STD, and Pregnancy Among Teenage Girls.

Medicare (CMS)

Retirement & Disability (SSA)

NHIS 1986-2000

X

NHIS 1994-1998

X

X

X

LSOA II

X

X

X

NHANES I

X

X

X

NHANES II

X

X

NHANES III

X

X

X

NNHS 1985

X

X

Summary NCHS Data Linkage


Slide37 l.jpg

www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm


Why can t you just give me the data l.jpg
Why can’t you just give me the data?www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • NCHS does not “own” the linked administrative data

  • NCHS data confidentiality rules prohibit the release of potentially identifiable data – special considerations concerning the protection of linked data

  • The RDC is the only option for access for now….


Overview data access procedures l.jpg
Overview: www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmData Access Procedures

  • Proposal Requirements

  • Access Methods

  • Helpful Tips

  • Where to get help?


Proposal requirements l.jpg
Proposal Requirementswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Proposal is evaluated by review committee

  • Review criteria

    • Scientific and technical feasibility

    • Availability of RDC resources

    • Disclosure risk for restricted information

    • The extent to which project is in accordance with the mission of NCHS

  • Special note: NCHS does not try to determine if proposals are duplicative


Proposal requirements41 l.jpg
Proposal Requirementswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Cover letter

  • Project title

  • Abstract (maximum 300 words summarizing project)

  • Full contact information

    • Institutional affiliation

    • Mail address, phone, email

  • Dates of proposed time at RDC (or indication of using remote access)

  • Source of funding for proposed research


Proposal requirements42 l.jpg
Proposal Requirementswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Study background

    • Key study questions or hypotheses

    • Public health benefits

  • Methods

    • Analytic approach and statistical methods

    • Statistical software requirements

  • Description of intended output for nondisclosure review, e.g.

    • Table shells

    • Model equations

    • Test statistics that researcher plans to remove from RDC


Proposal requirements43 l.jpg
Proposal Requirementswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Explanation of why restricted data are needed, e.g. describe why publicly available data are insufficient

  • Summary of data requirements to be included in analytic file

    • Identification of sample

    • Identification of variables

  • Description of additional data to be supplied by researcher to be merged with NCHS or other data source (must clearly identify source of other data)


Proposal requirements appendices l.jpg
Proposal Requirements: Appendiceswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Current Curriculum Vitae or resume for each investigator

  • Data dictionary – complete listing of specific data requested and its source(s) and indicate if public use or restricted access variables

    • specific files and years

    • sample

    • variables (dependent, independent, matching/linking)


Proposal requirements appendices45 l.jpg
Proposal Requirements: Appendiceswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • For remote-access applicants

    • Description of the computer and email system to be used to receive output

    • Security provisions for the computer and email systems

  • For students

    • Letter from department chair or academic advisor stating that student is working under the direction of the department


Overview rdc data access procedures l.jpg
Overview: www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmRDC Data Access Procedures

  • Proposal Requirements

  • Access Methods

  • Helpful Tips

  • Where to get help?


Access methods l.jpg
Access Methodswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Once approved, three methods to access restricted data

    • on-site - use local computing resources in the NCHS RDC, Hyattsville, MD

    • remote – submit programs electronically to be executed in the RDC with output returned by email

    • staff assisted – RDC staff provide on-site programming for off-site approved researchers

  • For all methods of access, restricted data files remain in RDC and output is inspected for disclosure violations


On site access l.jpg
On-Site Accesswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • RDC staff constructs necessary data files, including merged user data

  • Most statistical packages available with sufficient lead time

  • Output subject to disclosure review

  • Open only during normal working hours


Remote access method l.jpg
Remote Access Methodwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • RDC staff constructs necessary data files, including merged user data

  • SAS programs only (certain procedures and functions not allowed) – additional software options expected

  • Both submitted programs and output undergo a programmed disclosure limitation review


Rdc staff assisted programming method l.jpg
RDC Staff-assisted Programming Methodwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Subcontract with the RDC staff to perform programming tasks

  • Useful for those planning to use statistical software not available for the remote system and who are not able to travel to the RDC facility

  • Cost is estimated for each research project


Overview rdc data access procedures51 l.jpg
Overview: www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmRDC Data Access Procedures

  • Proposal Requirements

  • Access Methods

  • Helpful Tips

  • Where to get help?


Rdc helpful tips l.jpg
RDC Helpful Tipswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Be clear about research and data requirements (helps to determine feasibility of project)

    • Clearly identify the sample to be included in the analytic file

    • Provide data dictionaries for both

      • Public use data

      • Restricted data

    • Provide examples of expected output


Overview rdc data access procedures53 l.jpg
Overview: www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmRDC Data Access Procedures

  • Proposal Requirements

  • Access Methods

  • Helpful Tips

  • Where to get help?


Slide54 l.jpg

Visit the RDC at: www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

www.cdc.gov/nchs/r&d/rdc.htm or email: [email protected]


Slide55 l.jpg

LINKED DATA, CONTEXTUAL DATA, and GEO-CODINGwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmON-SITE and STAFF-ASSISTED DATA ACCESS

Christopher Rogers

Research Data Center

[email protected]


Why link data sets l.jpg
Why Link Data Sets?www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Improve modeling and make use of existing data.

  • Compensate for increased difficulties taking surveys.

  • Open your mind.

    Common Example:

    Economic variables versus Ethnic variables


Historical trends l.jpg
Historical Trendswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • More linking of scientific data sets between government agencies. Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA.)

  • Confused political and social situation in US.


Quality nchs resources l.jpg
Quality NCHS Resourceswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Linked Birth and Infant Death Data with Fetal Death Data.

  • Geo-coded NHIS 1986-2003 (2004-2005).

  • Geo-coded NHANES III.

  • Cycles 4, 5, and 6 NSFG Contextual Data.

  • Linked Data Sets described earlier.


Linked birth and infant death l.jpg
Linked Birth and Infant Deathwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Designed to study factors in infant death.

  • Links birth and death certificates for deaths under one year of age. Includes fetal deaths for 1995-1997

  • Years: 1983-1991 and 1995-1997

  • Numerator File (for deceased children): Parental information and behavior, prenatal care, infant health variables, demographics, cause of death.

  • Denominator File (for control group): Parental information and behavior, prenatal heath, infant health, demographics.

  • Fetal Death Data: 1995-1997

  • Restricted Data: County/City of mother’s residence or County of child’s birth or death when under 250,000. 100,000 starting 1989.


Data example l.jpg
Data Examplewww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • From the Division of Vital Statistics. Proposals or questions can go either to the RDC or the DVS.

  • Fetal Death Data portion. Given 1989-1999.

  • Linked to county level contextual data.

  • Goal to model fetal death with emphasis on ground water quality. Estimates death rates for each county.


Geo coded nhis l.jpg
Geo-Coded NHISwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • National Health Interview Survey. RDC has access to files from 1963 to present. Previously geo-coded households for 1986-1994. Recently geo-coded by RDC from 1995-2003. 2004-2005 coding in progress.

  • State (2 digits), County (3 digits), Tract (6 digits), Block Group (1 digit), and Block (3-4 digits) levels. Households coded to 1990 and 2000 Censuses.


Geo coded nhanes iii l.jpg
Geo-Coded NHANES IIIwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • NHANES III is also linked to NDI Mortality data.

  • NHANES III has been geo-coded twice. The RDC has done it at the same level of detail as NHIS.

  • Continuous NHANES has not been geo-coded yet.

  • Example: Large project with neighborhood, economic, ethnic, and individual medical and behavioral variables. Multi-level models.


Nsfg contextual data l.jpg
NSFG Contextual Datawww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Contextual variables available with Cycles 4, 5, and 6. Supplied for each individual in sample.

  • Cycle 6: 1054 contextual variables at the state, county, tract, and block group levels. For respondent addresses in 2000 and 2002.

  • Contextual data include both economic and demographic characteristics of locations. Easily merged by case ID to individual characteristics, behaviors, and histories.


Simple nsfg example l.jpg
Simple NSFG Examplewww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • A simple example relating economics on state level, ethnicity, and behavior, but not using contextual variables.

  • Treatment States given waiver to offer more family planning services (FPS).

  • Questions:

    • FPS effects on behavior

    • FPS effect on pregnancy rates

    • Differential impacts across demographic subgroups?


Change of topic accessing data l.jpg
Change of Topic: Accessing Data www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • On-site access to data at the RDC in Hyattsville.

  • Staff-assisted remote access to data via e-mail.

  • Researchers often use both types of access.

  • Potential Designated Agent status. (CIPSEA)

  • The RDC has put many resources into automated remote access.


On site access66 l.jpg
On-Site Accesswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Rules in 24 page file GuidelinesRDC11-8-05.pdf available on-line.

  • The RDC and NCHS surveys have knowledgeable professional staffs that review proposals carefully. Clients can only remove what has been approved. Checked by staff.

  • Exploratory Data Analysis. If needed, ask. Recent example: Checking general shapes of variables for model validity. OKed by survey.

  • Modeling needs. Recent example: Nested randomized geo-codes.

  • Estimation problems. Example: Single PSU in a Stratum.


Staff assisted remote access l.jpg
Staff-Assisted Remote Accesswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Analysis done through a particular staff member. Usually efficient, but could be very busy.

  • Staff member determines costs based on time.

  • Staff usually not asked to do much programming.

  • Staff creates data, runs e-mailed programs, checks, and returns output to researcher.

  • Staff can do exploratory analysis, if needed.

  • Staff can help check modeling problems.

  • Commonly done after on-site visit.


Our mission l.jpg
Our Missionwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • The RDC has a professional staff dedicated to helping researchers uncover knowledge and advance understanding.


Slide69 l.jpg

Remote Access Systemwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

Vijay Gambhir


Remote access system l.jpg
Remote Access Systemwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Envisioned as an integral Part of RDC

    • Pre – onsite usage

    • Post – onsite usage

  • Super store/ Convenience store


Basics of remote access system l.jpg
Basics of Remote Access Systemwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Object oriented, event driven system based upon the principles of distributed computing

  • About two years of development efforts

  • Set of applications called in service by resident component

  • Advanced pattern recognition techniques


Analytic data research by email andre l.jpg
Analytic Data Research by Email (ANDRE)www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • NCHS has been providing remote data access to researchers through ANDRE since April 1998.

  • In the past five years, ANDRE has served 45 different data analysts and executed over 9,500 SAS programs for their research programs.


Main features of andre l.jpg
Main Features of ANDREwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Completely automated system

  • Operates round the clock

    without any human intervention

  • Registered subscribers only

    • Proposals already reviewed and approved

    • Have an agreement with NCHS/RDC

  • Unlimited Access during the subscription period


Data requests l.jpg
Data Requestswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Registered user can submit data requests by email from anywhere and at any time.

  • Results of the data request released to a specified email address that has been certified as secure by the subscriber and approved by NCHS/RDC.


Authentication l.jpg
Authenticationwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Multi-levels of system security:

    • Submission syntax

    • User id

    • Password

    • Email/code word

    • Package

    • Path info


Data request analysis l.jpg
Data Request Analysiswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Compliance with the disclosure limitation constraints of NCHS

  • Integrity of the system

    • Resource constraints (CPU time & Storage requirements)

    • Protection of ANDRE’s work environment


Prevention of direct disclosure l.jpg
Prevention of Direct Disclosurewww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Cleaning up of the Log File

  • Categorization of SAS commands/words

    • Forbidden Commands

    • Modifications to the Commands

    • Output suppression


Sample original log l.jpg
Sample: Original Logwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

1 options nocenter;

2 Data one;

3 Infile 'd:\nchs\respnd95.dat' lrecl=13064;

4 Input

5 TODAYSPG 6847-6847

6 CONSTAT1 11934-11935

7 CONSTAT2 11936-11937

8 CONSTAT3 11938-11939

9 CONSTAT4 11940-11941

10 SEX1MTHD 11945-11946

11 POST_WT 12350-12359;

12 if constat1 = 'ab' then vjvar=1; else vjvar = 2;

13 WGT1000=POST_WT/1000;

14 title 'NSFG cycle 1995';

NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).

12:15

NOTE: The infile 'd:\nchs\respnd95.dat' is:

File Name=d:\nchs\respnd95.dat,

RECFM=V,LRECL=13064

NOTE: Invalid numeric data, 'ab' , at line 12 column 15.

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0

1 1000000111260837511521 1 1050 12 106921124112411189

101 2

201 19211059110611197

……


Sample original log cont l.jpg
Sample: Original Log (cont.)www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

……

12901 11232521101 05267213103033921811931011103 01030000000321120000392702210611511200403 1344 1316

13001 622501001006034

TODAYSPG=1 CONSTAT1=5 CONSTAT2=88 CONSTAT3=88 CONSTAT4=88 SEX1MTHD=1 POST_WT=2545.7569 vjvar=2 WGT1000=2.5457569 _ERROR_=1

_N_=20

NOTE: 10847 records were read from the infile 'd:\nchs\respnd95.dat'.

The minimum record length was 13064.

The maximum record length was 13064.

NOTE: The data set WORK.ONE has 10847 observations and 9 variables.

NOTE: DATA statement used:

real time 39.88 seconds

cpu time 12.10 seconds

15 proc freq;

16 tables CONSTAT1 vjvar;

17 run;

NOTE: There were 10847 observations read from the data set WORK.ONE.

NOTE: PROCEDURE FREQ used:

real time 0.49 seconds

cpu time 0.04 seconds


Sample cleaned log l.jpg
Sample: Cleaned Logwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

1 options nocenter;

2 Data one;

3 Infile 'd:\nchs\respnd95.dat' lrecl=13064;

4 Input

5 TODAYSPG 6847-6847

6 CONSTAT1 11934-11935

7 CONSTAT2 11936-11937

8 CONSTAT3 11938-11939

9 CONSTAT4 11940-11941

10 SEX1MTHD 11945-11946

11 POST_WT 12350-12359;

12 if constat1 = 'ab' then vjvar=1; else vjvar = 2;

13 WGT1000=POST_WT/1000;

14 title 'NSFG cycle 1995';

NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).

12:15

NOTE: The infile 'd:\nchs\respnd95.dat' is:

File Name=d:\nchs\respnd95.dat,

RECFM=V,LRECL=13064

NOTE: Invalid numeric data, 'ab' , at line 12 column 15.


Sample cleaned log cont l.jpg
Sample: Cleaned Log (cont.)www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

NOTE: 10847 records were read from the infile 'd:\nchs\respnd95.dat'.

The minimum record length was 13064.

The maximum record length was 13064.

NOTE: The data set WORK.ONE has 10847 observations and 9 variables.

NOTE: DATA statement used:

real time 39.88 seconds

cpu time 12.10 seconds

15 proc freq;

16 tables CONSTAT1 vjvar;

17 run;

NOTE: There were 10847 observations read from the data set WORK.ONE.

NOTE: PROCEDURE FREQ used:

real time 0.49 seconds

cpu time 0.04 seconds


Forbidden commands l.jpg
Forbidden Commandswww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Commands That Pose Unacceptable Disclosure Risks

    OR

  • Disallowed to Protect Integrity/Internal Environment of ANDRE

    Add firstobs report iml

    Print first. Pctn nofreq

    Obs last. Pctsum nocum

    Firstobs nocol tabulate editor

    Browse summary list put


Commands modification l.jpg
Commands Modificationwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Modify user’s program to enforce restrictions on options allowed with certain SAS procedures to prevent objectionable info appearing in the output

    PROC MEANS n mean std;


Output suppression l.jpg
Output Suppressionwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Wiping out of extreme values from the output of Proc Univariate.

  • Suppressing complete output line (Procs Means, corr, Univariate, etc) where sample size less than the minimum acceptable value.


Proc means suppression l.jpg
Proc Means Suppressionwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

The MEANS Procedure

Variable Label N Mean Std Dev

--------------------------------------------------------------------------------------------

EXPEND_R Current expend/pupil in public schl/1000 5424 5.0830820 1.3958710

*** Values Suppressed ***

RPUB87 exp. for contr. serv. and supplies 1997$ 5424 23472052.60 18806802.86

RPUB92 exp. for contr. serv. and supplies 1997$ 5424 34800922.98 30481634.59

PRGPRO Coordinated Pregnancy Prevention Program 1708 0.0679157 0.2516749

HIVED HIV/AIDS Education 1708 3.5146370 0.8044378

*** Values Suppressed ***

PRGPRO87 Coordinated Pregnancy Prevention Program 5424 0.0540192 0.2260764

HIVED87 HIV/AIDS Education 5424 3.4968658 0.8008324

WT_PER15 % Wt females aged 15-19/total 15-19 5424 0.7279681 0.1265796

BK_PER15 % Bk females aged 15-19/total 15-19 5424 0.1409869 0.0932332

HS_PER15 % Hs females aged 15-19/total 15-19 5424 0.0962413 0.1055191

TEENMMC2 Teenmom by cohort (1,2,3r) 1201 1.7119067 0.7715351

C18_2_1S R in C2 (vs 1) at 18-19 endpt (1,2) 1770 1.5248588 0.4995228

TM2_1S18 R tnmm in Coh 2 (vs 1)-age 18 @ ext 358 1.4804469 0.5003168

AGE_12 Date R = 12 in century months 6450 979.5613953 69.3124265

STRTST IA5 Date R started living in current sta 3870 1132.55 753.2066507

BDAYCENM R date of birth 6450 835.5613953 69.3124265

RAVPAY95 real av. an. pay 95 dollars 5424 26933.93 2826.80

PERCAFDC percent of households receiving AFDC 5424 0.0422254 0.0127307

SALARY teacher salaries real 96-97$$$ 5424 35338.66 5729.11

--------------------------------------------------------------------------------------------


Proc univariate output unsuppressed l.jpg
Proc Univariate Outputwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmUnsuppressed

The SAS System 9

14:09 Sunday, October 24, 1999

Univariate Procedure

Variable=AVHRATET

Moments Quantiles(Def=5)

N 2283 Sum Wgts 2283 100% Max -0.25314 99% -1.62008

Mean -4.66219 Sum -10643.8 75% Q3 -3.56179 95% -2.37588

Std Dev 1.892017 Variance 3.57973 50% Med -4.50491 90% -2.79152

Skewness -2.11919 Kurtosis 6.892929 25% Q1 -5.30374 10% -6.07639

USS 57792.36 CSS 8168.944 0% Min -13.5463 5% -7.19645

CV -40.5821 Std Mean 0.039598 1% -12.7402

T:Mean=0 -117.738 Pr>|T| 0.0001 Range 13.29321

Num ^= 0 2283 Num > 0 0 Q3-Q1 1.741949

M(Sign) -1141.5 Pr>=|M| 0.0001 Mode -13.5463

Sgn Rank -1303593 Pr>=|S| 0.0001

Extremes

Lowest Obs Highest Obs

-13.5463( 1547) -0.90519( 649)

-13.5397( 1836) -0.81756( 1094)

-13.4637( 2084) -0.76928( 1739)

-13.4413( 1127) -0.5907( 21)

-13.4402( 1088) -0.25314( 400)


Proc univariate output suppressed l.jpg
Proc Univariate Outputwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmSuppressed

The SAS System 9

14:09 Sunday, October 24, 1999

Univariate Procedure

Variable=AVHRATET

Moments Quantiles(Def=5)

N 2283 Sum Wgts 2283 100% Max -0.25314 99% -1.62008

Mean -4.66219 Sum -10643.8 75% Q3 -3.56179 95% -2.37588

Std Dev 1.892017 Variance 3.57973 50% Med -4.50491 90% -2.79152

Skewness -2.11919 Kurtosis 6.892929 25% Q1 -5.30374 10% -6.07639

USS 57792.36 CSS 8168.944 0% Min -13.5463 5% -7.19645

CV -40.5821 Std Mean 0.039598 1% -12.7402

T:Mean=0 -117.738 Pr>|T| 0.0001 Range 13.29321

Num ^= 0 2283 Num > 0 0 Q3-Q1 1.741949

M(Sign) -1141.5 Pr>=|M| 0.0001 Mode -13.5463

Sgn Rank -1303593 Pr>=|S| 0.0001


Proc univariate output suppressed sample size 1 l.jpg
Proc Univariate Outputwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmSuppressed (sample size = 1)

Univariate Procedure

Variable=FREQ (sum) freq

Moments Quantiles(Def=5)

Serious Disclosure limitation Violations

Values too low to release

Output of Proc Univariate withheld


Proc freq suppression one way tables l.jpg
Proc Freq Suppressionwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm (One-Way Tables)

  • Suppress at least two consecutive rows to prevent derivation of suppressed values from cumulative totals.

  • Disallow single row output.


One way freq table suppressed l.jpg
One-Way Freq Tablewww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmSuppressed

Cumulative Cumulative

  • LOGRNTOPAT Frequency Percent Frequency Percent

  • -----------------------------------------------------------------

  • 0.2277839309 ????? ????? ????? ?????

  • 0.2277839309 ????? ????? ????? ?????

  • 0.2305236586 5 0.08 6429 97.99

  • 0.231111721 5 0.08 6434 98.06

  • 0.232058915 ????? ????? ????? ?????

  • 0.232058915 ????? ????? ????? ?????

  • 0.2436220827 ????? ????? ????? ?????

  • 0.2436220827 ????? ????? ????? ?????

  • 0.2498117984 6 0.09 6456 98.40

  • 0.2504106777 6 0.09 6462 98.49

  • 0.2513144283 18 0.27 6480 98.77

  • 0.2595111955 6 0.09 6486 98.86

  • 0.2670627852 ????? ????? ????? ?????

  • 0.2670627852 ????? ????? ????? ?????

  • 0.2736958305 5 0.08 6500 99.07

  • 0.2814124594 5 0.08 6505 99.15

  • 0.3022808719 6 0.09 6511 99.24

  • 0.3364722366 10 0.15 6521 99.39


One way freq table suppressed cont l.jpg
One-Way Freq Tablewww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmsuppressed (cont.)

Cumulative Cumulative

LOGRNTOPAT Frequency Percent Frequency Percent

-----------------------------------------------------------------

0.3403258059 ????? ????? ????? ?????

0.3403258059 ????? ????? ????? ?????

0.3715635564 6 0.09 6537 99.63

0.3856624808 ????? ????? ????? ?????

0.3856624808 ????? ????? ????? ?????

0.6931471806 6 0.09 6550 99.83

1.2527629685 ????? ????? ????? ?????

1.2527629685 ????? ????? ????? ?????

1.2527629685 ????? ????? ????? ?????


Proc freq suppression two way tables l.jpg
Proc Freq Suppressionwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm (Two-way Tables)

  • Rows and columns totals preserved

  • Cells with values less than the acceptable minimum are suppressed

  • Additional suppressions to ensure that no row and no column has single suppression.

  • Logical stitching of horizontal and vertical splits.


Proc freq two way tables suppression l.jpg
Proc Freq: Two-way Tables Suppressionwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

TABLE OF FAMREL BY FAMSIZER

FAMREL FAMSIZER

Frequency|

Percent |

Row Pct |

Col Pct | 2| 3| 4| 5| Total

---------+--------+--------+--------+--------+

3 | 94 | 388 | 792 | 533 | 2206

| 3.97 | 16.40 | 33.47 | 22.53 | 93.24

| 4.26 | 17.59 | 35.90 | 24.16 |

| 98.95 | 96.28 | 96.12 | 94.34 |

---------+--------+--------+--------+--------+

4 | ?????? | 9 | 22 | 27 | 104

| ?????? | 0.38 | 0.93 | 1.14 | 4.40

| ?????? | 8.65 | 21.15 | 25.96 |

| ?????? | 2.23 | 2.67 | 4.78 |

---------+--------+--------+--------+--------+

6 | ?????? | 6 | 10 | 5 | 56

| ?????? | 0.25 | 0.42 | 0.21 | 2.37

| ?????? | 10.71 | 17.86 | 8.93 |

| ?????? | 1.49 | 1.21 | 0.88 |

---------+--------+--------+--------+--------+

Total 95 403 824 565 2366

4.02 17.03 34.83 23.88 100.00

(Continued)


Proc freq two way tables suppression cont l.jpg
Proc Freq: Two-way Tables Suppression (Cont.)www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

checking frequencies 4

12:01 Thursday, May 6, 1999

TABLE OF FAMREL BY FAMSIZER

FAMREL FAMSIZER

Frequency|

Percent |

Row Pct |

Col Pct | 6| 7| 8| 9| Total

---------+--------+--------+--------+--------+

3 | 209 | 98 | 19 | 73 | 2206

| 8.83 | 4.14 | 0.80 | 3.09 | 93.24

| 9.47 | 4.44 | 0.86 | 3.31 |

| 90.48 | 83.05 | 59.38 | 74.49 |

---------+--------+--------+--------+--------+

4 | 13 | 10 | ?????? | 12 | 104

| 0.55 | 0.42 | ?????? | 0.51 | 4.40

| 12.50 | 9.62 | ?????? | 11.54 |

| 5.63 | 8.47 | ?????? | 12.24 |

---------+--------+--------+--------+--------+

6 | 9 | 10 | ?????? | 13 | 56

| 0.38 | 0.42 | ?????? | 0.55 | 2.37

| 16.07 | 17.86 | ?????? | 23.21 |

| 3.90 | 8.47 | ?????? | 13.27 |

---------+--------+--------+--------+--------+

Total 231 118 32 98 2366

9.76 4.99 1.35 4.14 100.00


Fully automated and expert system l.jpg
Fully Automated and Expert system?www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Fully automated?

    • Reboot to deal with memory leakage.

  • Confidentiality Expert? How reliable?

    • As good as underlying algorithms. Needs constant monitoring


What is new l.jpg
What is new?www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Improved and expanded hardware platform

  • Two machines dedicated to heavy remote access usage

  • Three additional machines dedicated to general remote access usage


What is new102 l.jpg
What is New?www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Sudaan now available to remote access users

  • Proc Crosstab

  • Proc Rlogist

  • Proc Regress

  • Proc Multilog

  • Proc Survival


What is new103 l.jpg
What is newwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Proc Descript

  • Other new Sudaan procedures will be made available shortly

  • Plans to make Stata available through remote access


What is new104 l.jpg
What is newwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

  • Web Component of ANDRE under construction.

  • On-line scanning of users’ code

  • Valuable research tools and information readily available to the users.


Contact information l.jpg
Contact Informationwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm

For general Questions/Comments

Email: [email protected] Phone: (301) 458-4732

For On-site Info:

Email: [email protected] Phone: (301) 458-4097

For Remote Access Info:

Email: [email protected] Phone: (301) 458-4226


ad