Using secondary data analysis for outcomes research
Sponsored Links
This presentation is the property of its rightful owner.
1 / 48

Using Secondary Data Analysis for Outcomes Research PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Using Secondary Data Analysis for Outcomes Research. Epi 211 April 2010 Michael Steinman, MD. Disclosures and acknowledgements. Disclosures: None Acknowledgements: J. Michael McWilliams Ann Nattinger SGIM Research Committee. Question:.

Download Presentation

Using Secondary Data Analysis for Outcomes Research

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Using Secondary Data Analysis for Outcomes Research

Epi 211

April 2010

Michael Steinman, MD

Disclosures and acknowledgements




J. Michael McWilliams

Ann Nattinger

SGIM Research Committee


  • You are a fellow / junior faculty member interested in studying...

    • Impact of nurse-led HTN clinics on clinical outcomes in patients with HTN

    • Impact of implementing EMRs on appropriate prescribing in ambulatory surgical patients

    • Whether quality measures of asthma control in children correlate with actual clinical outcomes in this population


  • Here’s your choice:

    • A. Get a multimillion dollar grant to conduct a multi-center, multi-year RCT

    • B. Analyze existing data

Learning objectives

  • Appreciate key conceptual and methodologic issues involved in outcomes research employing secondary data analysis

  • Identify and use online tools for locating and learning about datasets relevant to your research

  • Understand the range of resources and support required to successfully complete a secondary data analysis


  • Working with secondary data

    • Conceptual and methodologic issues

  • Overview of high-value datasets and web-based resources

  • Planning your dataset project

    • Practical advice

  • Q&A

Working with Secondary Data

(My) Definition of Secondary Data

Data that have been collected

but not for you

Types of Secondary Data

  • Survey

  • Administrative (claims)

  • Discharge

  • Medical chart / EMR

  • Disease registries

  • Aggregate (ARF, US Census)

  • Combinations and linkages

What kinds of research can be conducted with secondary data?

Anything but randomized trials

  • By discipline:

    • Outcomes research

    • Epidemiology

    • Health services research

  • By question:

    • Descriptive

    • Comparative

    • Causal

Conceiving a Project

  • Which comes first: question or dataset?

    • Research question first

    • Dataset first

    • Trick question: both

  • Hybrid approach

    • Identify research focus, broad question

    • Consider candidate datasets

    • Hone question

    • Iterate between 2 and 3

What makes a good research question? (FINER)

  • Feasible—data, variables, & resources accessible & available

  • Interesting—to researcher and audience

  • Novel—extends what is already known

  • Ethical—upholds standards

  • Relevant—to patient care, clinical outcomes, policy, etc.

Cummings et al. Conceiving the research question. In: Hulley SB, Cummings SR, Browner WS, et al, eds. Designing clinical research, 3rd ed. Philadelphia: Lippincott Williams & Wilkins, 2007:17-26

Selecting a Database

  • Compatibility with research question(s)

  • Availability and expense

  • Sample: representativeness, power

  • Measures of interest present and valid

  • Messiness and missingness

  • Local expertise

  • Linkages

Key elements for outcomes research

  • Measure of intervention

  • Measure of outcome(s)

    • Intermediate outcomes

      • % of patients receiving treatment, measures of glycemic or lipid control (A1c, serum LDL)

    • Clinical outcomes

      • Death, hospitalization, satisfaction, etc.

  • Measures of important confounders

Advantages of Secondary Data

  • They are not primary data!

    • Efficiency: fast and cheap

    • No regrets

  • Scale and scope

    • Size and detail not otherwise feasible for individual research team

    • Generalizable

  • Novel and creative research questions

  • Often easier IRB review process

Challenges and Pitfalls

  • Data mining/overfitting

    • When the analysis precedes the question

    • Does urine cortisol predict Catholicism?

  • Causal inference

    • Inherently limited with observational data

    • But does not preclude quasi-experimental designs to recover causal effects

Challenges and Pitfalls

  • Validity of measures

    • Beware of assumptions

    • Problems: coding, reporting, recall biases

    • Solutions: direct validation in subgroup or another data source, literature review, sensitivity analyses

  • Complexity of file structure

    • Row in dataset may not be unit of analysis

    • Skip patterns, proxy respondents

What You Want and What You Have

  • Want to measure time preferences

    • Behavioral economics: people tend to overvalue the present

    • Explanation for unhealthy habits, underuse of cancer screening?

  • Have measures on financial planning horizons

  • Are the two equivalent?

  • Might financial planning also depend on:

    • Income

    • Source of income, employment status

    • Dependents

    • Inheritance

A Simple Question?

Ask: IF ((piRTab1X007AFinFam = FAMILYR) OR (piRTab1X007AFinFam = FINANCIAL_FAMILYR)) AND ((ACTIVELANGUAGE <> EXTENG) AND (ACTIVELANGUAGE <> EXTSPN)) AND (piInitA106_NumContactKids > 0) AND (piInitA100_NumNRKids > 0)


Section: E Level: Household Type: Numeric Width: 1 Decimals: 0

CAI Reference: SecE.KidStatus.E012_

2000 Link: G19802002 Link: HE012


[Do any of your children who do not live with you/Does CHILD NAME] live within 10 miles of you (in R's NURSING HOME CITY, STATE (CS25b/A067))?


[Do any of your children (who do not live with you)/Does CHILD NAME] live within 10 miles of you (in MAIN RESIDENCE [CITY/CITY, STATE STATE])?

6802 1. YES

4720 5. NO

32 8.DK (Don't Know); NA (Not Ascertained)

4 9. RF (Refused)

2087 Blank. INAP (Inapplicable)

* From the Health and Retirement Study

Challenges and Pitfalls

  • Representativeness of Sample

    • External validity (generalizability)

    • Internal validity (selection bias)

    • Example: comparing outcomes for insured and uninsured patients using hospital discharge data

      • Must be hospitalized to enter sample

      • Not only limits generalizability (to outpatients)

      • But inferences about the sample may be wrong

        • Sample would need to include uninsured who would have been hospitalized if insured

Statistical Considerations:Missing Data

  • Sources

    • Non-response: unit and item

    • Variability in data collection (e.g. across states or over time, collected on subset due to expense)

    • Incomplete linkages

  • Language

    • MCAR: M╨Y  strong assumption, can ignore

    • MAR: M╨Y|X  weaker assumption, can fix

    • Non-ignorable, informative: M predicts Y  can’t fix

Statistical Considerations:Missing Data

  • Approaches

    • Listwise deletion, complete case (ok if MCAR)

    • Imputation

      • Mean imputation (biased standard errors)

      • Multiple imputation (MAR)

    • Weighting techniques (MAR)

    • Random effects models (MAR)

Statistical Considerations:Analyzing Survey Data

  • Complex survey designs

  • Example multistage probability sample (NAMCS):

    US divided into PSUs (counties / MSAs)  sample of PSUs selected  within each PSU, stratify MDs by specialty  sample of MDs within each stratum  quasi-random sample of patients seen by each MD

  • Survey design

    • Clustering: convenience, ↓precision

    • Stratification: ↑precision , ↑representativeness (protects against a bad sample)

    • Oversampling: ↑representation and precision for subgroup of interest

Statistical Considerations:Analyzing Survey Data

  • Survey weights: affect point estimates

    • Individuals may have unequal selection probabilities

    • Need to apply weights to recover representativeness

    • W = 1/p(selection) = # people represented

    • W’s reflect sampling design, adjustments to match to census totals, non-response

  • Survey strata, clusters: affect se’s

    • Need variance estimators that account for correlated data

  • Most statistical packages able to handle

Finding the Right Dataset

Finding the Right Dataset

  • Contain variables of interest

    • predictor, outcome, confounders

  • Relevant time frame

    • Cross-sectional, longitudinal

  • Feasible

    • Access: time, bureaucracy, cost

    • Usable

  • No perfect datasets -> hybrid approach of developing research question

Administrative Data (VA)

  • VA has multiple high-value administrative databases

    • Outpatient visit information

      • Visit date, type of clinic, provider, ICD9 diagnoses

    • Inpatient information

      • Admitting dx(s), discharge dx(s), CPT codes, bed section, meds administered

    • Lab data

      • >40 labs

    • Pharmacy data

      • All inpatient and outpatient fills

    • Academic affiliation

    • etc

Administrative Data (VA)

  • Huge bureaucracy and paperwork

Administrative Data (VA)

  • Messy data

  • Huge size

    • 2 TB server

  • Data analyst

Survey Data (NHANES)

  • National Health and Nutrition Examination Survey (NHANES)

    • Nationally representative sample of >10K patients every 2 years

    • Extensive interview data on clinical history (including diseases, behaviors, psychosocial parameters, etc.)

    • Physical exam information (e.g. VS)

    • Labs, biomarkers

Survey Data (NHANES)

  • Free and easy to download

  • (Relatively) easy to use

    • Although requires careful reading of documentation

  • Serial cross-sectional

  • Disease data self-report

  • Very limited information about providers and systems of care

Survey Data (NAMCS)

  • National Ambulatory Medical Care Survey (NAMCS) and National Hospital Ambulatory Medical Care Survey (NHAMCS)

  • Nationally representative sample of ~70K outpatient and ED visits per year

  • Physician-completed form about office visit

Survey Data (NAMCS)

  • Data more from physician perspective (diagnoses, treatments Rx’ed, etc) and some info on providers (e.g., clinic organization, use of EMRs, etc)

  • Serial cross-sectional

    • Visit-focused

    • Not comprehensive, ? value for chronic diseases

Discharge Data (NIS)

  • National Inpatient Sample (NIS)

    • Database of inpatient hospital stays collected from ~20% of US community hospitals by AHRQ

    • Diagnoses and procedures, severity adjustment elements, payment source, hospital organizational characteristics

    • Hospital and county identifiers that allow linkage to the American Hospital Association Annual Survey and Area Resource File

Discharge Data (NIS)

  • Relatively easy to access (DUA, $200/yr)

  • Relatively easy to use

    • Though need close attention to documentation

  • Limited data elements

  • Huge data files

Web-Based Resources

  • Society of General Internal Medicine (SGIM) Research Dataset Compendium


  • UCSF K-12 Data Resource Center


  • Partners in Information Access for the Public Health Workforce


Finding Additional Resources

  • National Information Center on Health Services Research and Health Care Technology (NICHSR)

  • Inter-University Consortium for Political and Social Research (ICPSR)

  • Partners in Information Access for the Public Health Workforce

  • Roadmap K-12 Data Resource Center (UCSF)

  • List of datasets from the American Sociologic Association

  • Canadian Research Data Centers – Data Sets and Research Tools (Canada)

  • Directory of Health and Human Services Data Resources

  • Publicly Available Databases from National Institute on Aging (NIA)

  • Publicly Available Databases from National Heart, Lung, & Blood Institute (NHLBI)

  • National Center for Health Statistics (NCHS) Data Warehouse

  • Medicare Research Data Assistance Center (RESDAC); and Centers for Medicare and Medicaid Services (CMS) Research, Statistics, Data & Systems

  • Veterans Affairs (VA) data

(all available at

National Information Center on Health Services Research and Health Care Technology (NICHSR)

  • Databases, data repositories, health statistics

  • Fellowship and funding opportunities

  • Glossaries, research and clinical guidelines

  • Evidence-based practice and health technology assessment

  • Specialized PubMed searches on healthcare quality and costs

Inter-University Consortium for Political and Social Research (ICPSR)

  • World’s largest archive of social science data

  • Searchable

  • Many sub-archives relevant to HSR

    • Health and Medical Care Archive

    • National Archive of Computerized Data on Aging

Planning Your Dataset Project

Resources Needed

  • Your effort

  • Computer resources and security

  • Programmer and/or statistician effort

  • PhD statistical support – complex sampling or analyses

  • Time timeline

IRB Issues

  • Exempt if no identifiers

    - Still need IRB determination of exempt status

  • Limited data set – DUA

    - Usually expedited for IRB

  • Full data use agreement

    -Still may be expedited for IRB

    -May need to explain lack of consent to funders/IRB

Summary: 10 Tips for Success inSecondary Data Analyses

  • Start with a clear research question and hypothesis

  • Get to know your data source:

    • Why does the database exist?

    • Who reports the data?

    • What are the incentives for accurate reporting?

    • How are the data audited, if at all?

    • Can you link the data to other large databases?

  • Get good documentation of the cohort, variables, and data layout, then read the fine print

  • Consult or collaborate with researchers who have used the database

Provided by John Ayanian, MD, MPP, Ellen McCarthy, PhD, “Research with Large Databases”, Harvard School of Public Health

Summary: 10 Tips for Success inSecondary Data Analyses

  • Line up computing resources before data arrive

  • Allow time to receive data if not publicly available

  • Learn SAS, Stata, or other statistical software so you can analyze data yourself (or collaborate)

  • Assess data quality (e.g., outliers & missing data) with plots or frequency tables

  • Consult or collaborate with a statistician on your analysis plan, especially for complex surveys with sampling weights

  • Use clinical intuition to interpret results and consult experts as needed

Provided by John Ayanian, MD, MPP, Ellen McCarthy, PhD, “Research with Large Databases”, Harvard School of Public Health

  • Login