Using Secondary Data Analysis for Outcomes Research

Using Secondary Data Analysis for Outcomes Research Epi 211 April 2010 Michael Steinman, MD

Disclosures and acknowledgements Disclosures: None Acknowledgements: J. Michael McWilliams Ann Nattinger SGIM Research Committee

Question: • You are a fellow / junior faculty member interested in studying... • Impact of nurse-led HTN clinics on clinical outcomes in patients with HTN • Impact of implementing EMRs on appropriate prescribing in ambulatory surgical patients • Whether quality measures of asthma control in children correlate with actual clinical outcomes in this population

Question: • Here’s your choice: • A. Get a multimillion dollar grant to conduct a multi-center, multi-year RCT • B. Analyze existing data

Learning objectives • Appreciate key conceptual and methodologic issues involved in outcomes research employing secondary data analysis • Identify and use online tools for locating and learning about datasets relevant to your research • Understand the range of resources and support required to successfully complete a secondary data analysis

Overview • Working with secondary data • Conceptual and methodologic issues • Overview of high-value datasets and web-based resources • Planning your dataset project • Practical advice • Q&A

Working with Secondary Data

(My) Definition of Secondary Data Data that have been collected but not for you

Types of Secondary Data • Survey • Administrative (claims) • Discharge • Medical chart / EMR • Disease registries • Aggregate (ARF, US Census) • Combinations and linkages

What kinds of research can be conducted with secondary data? Anything but randomized trials • By discipline: • Outcomes research • Epidemiology • Health services research • By question: • Descriptive • Comparative • Causal

Conceiving a Project • Which comes first: question or dataset? • Research question first • Dataset first • Trick question: both • Hybrid approach • Identify research focus, broad question • Consider candidate datasets • Hone question • Iterate between 2 and 3

What makes a good research question? (FINER) • Feasible—data, variables, & resources accessible & available • Interesting—to researcher and audience • Novel—extends what is already known • Ethical—upholds standards • Relevant—to patient care, clinical outcomes, policy, etc. Cummings et al. Conceiving the research question. In: Hulley SB, Cummings SR, Browner WS, et al, eds. Designing clinical research, 3rd ed. Philadelphia: Lippincott Williams & Wilkins, 2007:17-26

Selecting a Database • Compatibility with research question(s) • Availability and expense • Sample: representativeness, power • Measures of interest present and valid • Messiness and missingness • Local expertise • Linkages

Key elements for outcomes research • Measure of intervention • Measure of outcome(s) • Intermediate outcomes • % of patients receiving treatment, measures of glycemic or lipid control (A1c, serum LDL) • Clinical outcomes • Death, hospitalization, satisfaction, etc. • Measures of important confounders

Advantages of Secondary Data • They are not primary data! • Efficiency: fast and cheap • No regrets • Scale and scope • Size and detail not otherwise feasible for individual research team • Generalizable • Novel and creative research questions • Often easier IRB review process

Challenges and Pitfalls • Data mining/overfitting • When the analysis precedes the question • Does urine cortisol predict Catholicism? • Causal inference • Inherently limited with observational data • But does not preclude quasi-experimental designs to recover causal effects

Challenges and Pitfalls • Validity of measures • Beware of assumptions • Problems: coding, reporting, recall biases • Solutions: direct validation in subgroup or another data source, literature review, sensitivity analyses • Complexity of file structure • Row in dataset may not be unit of analysis • Skip patterns, proxy respondents

What You Want and What You Have • Want to measure time preferences • Behavioral economics: people tend to overvalue the present • Explanation for unhealthy habits, underuse of cancer screening? • Have measures on financial planning horizons • Are the two equivalent? • Might financial planning also depend on: • Income • Source of income, employment status • Dependents • Inheritance

A Simple Question? Ask: IF ((piRTab1X007AFinFam = FAMILYR) OR (piRTab1X007AFinFam = FINANCIAL_FAMILYR)) AND ((ACTIVELANGUAGE <> EXTENG) AND (ACTIVELANGUAGE <> EXTSPN)) AND (piInitA106_NumContactKids > 0) AND (piInitA100_NumNRKids > 0) JE012 CHILDREN LIVE WITHIN 10 MILES Section: E Level: Household Type: Numeric Width: 1 Decimals: 0 CAI Reference: SecE.KidStatus.E012_ 2000 Link: G19802002 Link: HE012 IF {R DOES NOT HAVE SPOUSE/PARTNER and DOES NOT STILL HAVE HOME OUTSIDE NURSING HOME {(CS11/A028=1) and (CS26/A070 NOT 1)}} or {R & SPOUSE/PARTNER} LIVE IN SAME NURSING HOME (CS11/A028=1 and CS12/A030=1): [Do any of your children who do not live with you/Does CHILD NAME] live within 10 miles of you (in R's NURSING HOME CITY, STATE (CS25b/A067))? OTHERWISE: [Do any of your children (who do not live with you)/Does CHILD NAME] live within 10 miles of you (in MAIN RESIDENCE [CITY/CITY, STATE STATE])? 6802 1. YES 4720 5. NO 32 8.DK (Don't Know); NA (Not Ascertained) 4 9. RF (Refused) 2087 Blank. INAP (Inapplicable) * From the Health and Retirement Study

Challenges and Pitfalls • Representativeness of Sample • External validity (generalizability) • Internal validity (selection bias) • Example: comparing outcomes for insured and uninsured patients using hospital discharge data • Must be hospitalized to enter sample • Not only limits generalizability (to outpatients) • But inferences about the sample may be wrong • Sample would need to include uninsured who would have been hospitalized if insured

Statistical Considerations:Missing Data • Sources • Non-response: unit and item • Variability in data collection (e.g. across states or over time, collected on subset due to expense) • Incomplete linkages • Language • MCAR: M╨Y  strong assumption, can ignore • MAR: M╨Y|X  weaker assumption, can fix • Non-ignorable, informative: M predicts Y  can’t fix

Statistical Considerations:Missing Data • Approaches • Listwise deletion, complete case (ok if MCAR) • Imputation • Mean imputation (biased standard errors) • Multiple imputation (MAR) • Weighting techniques (MAR) • Random effects models (MAR)

Statistical Considerations:Analyzing Survey Data • Complex survey designs • Example multistage probability sample (NAMCS): US divided into PSUs (counties / MSAs)  sample of PSUs selected  within each PSU, stratify MDs by specialty  sample of MDs within each stratum  quasi-random sample of patients seen by each MD • Survey design • Clustering: convenience, ↓precision • Stratification: ↑precision , ↑representativeness (protects against a bad sample) • Oversampling: ↑representation and precision for subgroup of interest

Statistical Considerations:Analyzing Survey Data • Survey weights: affect point estimates • Individuals may have unequal selection probabilities • Need to apply weights to recover representativeness • W = 1/p(selection) = # people represented • W’s reflect sampling design, adjustments to match to census totals, non-response • Survey strata, clusters: affect se’s • Need variance estimators that account for correlated data • Most statistical packages able to handle

Finding the Right Dataset

Finding the Right Dataset • Contain variables of interest • predictor, outcome, confounders • Relevant time frame • Cross-sectional, longitudinal • Feasible • Access: time, bureaucracy, cost • Usable • No perfect datasets -> hybrid approach of developing research question

Administrative Data (VA) • VA has multiple high-value administrative databases • Outpatient visit information • Visit date, type of clinic, provider, ICD9 diagnoses • Inpatient information • Admitting dx(s), discharge dx(s), CPT codes, bed section, meds administered • Lab data • >40 labs • Pharmacy data • All inpatient and outpatient fills • Academic affiliation • etc

Administrative Data (VA) • Huge bureaucracy and paperwork

Administrative Data (VA) • Messy data • Huge size • 2 TB server • Data analyst

Survey Data (NHANES) • National Health and Nutrition Examination Survey (NHANES) • Nationally representative sample of >10K patients every 2 years • Extensive interview data on clinical history (including diseases, behaviors, psychosocial parameters, etc.) • Physical exam information (e.g. VS) • Labs, biomarkers

Survey Data (NHANES) • Free and easy to download • (Relatively) easy to use • Although requires careful reading of documentation • Serial cross-sectional • Disease data self-report • Very limited information about providers and systems of care

Survey Data (NAMCS) • National Ambulatory Medical Care Survey (NAMCS) and National Hospital Ambulatory Medical Care Survey (NHAMCS) • Nationally representative sample of ~70K outpatient and ED visits per year • Physician-completed form about office visit

Survey Data (NAMCS) • Data more from physician perspective (diagnoses, treatments Rx’ed, etc) and some info on providers (e.g., clinic organization, use of EMRs, etc) • Serial cross-sectional • Visit-focused • Not comprehensive, ? value for chronic diseases

Discharge Data (NIS) • National Inpatient Sample (NIS) • Database of inpatient hospital stays collected from ~20% of US community hospitals by AHRQ • Diagnoses and procedures, severity adjustment elements, payment source, hospital organizational characteristics • Hospital and county identifiers that allow linkage to the American Hospital Association Annual Survey and Area Resource File

Discharge Data (NIS) • Relatively easy to access (DUA, $200/yr) • Relatively easy to use • Though need close attention to documentation • Limited data elements • Huge data files

Web-Based Resources • Society of General Internal Medicine (SGIM) Research Dataset Compendium • www.sgim.org/go/datasets • UCSF K-12 Data Resource Center • http://www.epibiostat.ucsf.edu/courses/RoadmapK12/PublicDataSetResources/ • Partners in Information Access for the Public Health Workforce • http://phpartners.org/health_stats.html

Finding Additional Resources • National Information Center on Health Services Research and Health Care Technology (NICHSR) • Inter-University Consortium for Political and Social Research (ICPSR) • Partners in Information Access for the Public Health Workforce • Roadmap K-12 Data Resource Center (UCSF) • List of datasets from the American Sociologic Association • Canadian Research Data Centers – Data Sets and Research Tools (Canada) • Directory of Health and Human Services Data Resources • Publicly Available Databases from National Institute on Aging (NIA) • Publicly Available Databases from National Heart, Lung, & Blood Institute (NHLBI) • National Center for Health Statistics (NCHS) Data Warehouse • Medicare Research Data Assistance Center (RESDAC); and Centers for Medicare and Medicaid Services (CMS) Research, Statistics, Data & Systems • Veterans Affairs (VA) data (all available at www.sgim.org/go/datasets)

National Information Center on Health Services Research and Health Care Technology (NICHSR) • Databases, data repositories, health statistics • Fellowship and funding opportunities • Glossaries, research and clinical guidelines • Evidence-based practice and health technology assessment • Specialized PubMed searches on healthcare quality and costs http://www.nlm.nih.gov/hsrinfo/datasites.html

Inter-University Consortium for Political and Social Research (ICPSR) • World’s largest archive of social science data • Searchable • Many sub-archives relevant to HSR • Health and Medical Care Archive • National Archive of Computerized Data on Aging http://www.icpsr.umich.edu/icpsrweb/ICPSR/partners/archives.jsp

Planning Your Dataset Project

Resources Needed • Your effort • Computer resources and security • Programmer and/or statistician effort • PhD statistical support – complex sampling or analyses • Time timeline

IRB Issues • Exempt if no identifiers - Still need IRB determination of exempt status • Limited data set – DUA - Usually expedited for IRB • Full data use agreement -Still may be expedited for IRB -May need to explain lack of consent to funders/IRB

Summary: 10 Tips for Success inSecondary Data Analyses • Start with a clear research question and hypothesis • Get to know your data source: • Why does the database exist? • Who reports the data? • What are the incentives for accurate reporting? • How are the data audited, if at all? • Can you link the data to other large databases? • Get good documentation of the cohort, variables, and data layout, then read the fine print • Consult or collaborate with researchers who have used the database Provided by John Ayanian, MD, MPP, Ellen McCarthy, PhD, “Research with Large Databases”, Harvard School of Public Health

Summary: 10 Tips for Success inSecondary Data Analyses • Line up computing resources before data arrive • Allow time to receive data if not publicly available • Learn SAS, Stata, or other statistical software so you can analyze data yourself (or collaborate) • Assess data quality (e.g., outliers & missing data) with plots or frequency tables • Consult or collaborate with a statistician on your analysis plan, especially for complex surveys with sampling weights • Use clinical intuition to interpret results and consult experts as needed Provided by John Ayanian, MD, MPP, Ellen McCarthy, PhD, “Research with Large Databases”, Harvard School of Public Health

Using Secondary Data Analysis for Outcomes Research