Innovative statistical approaches in health services research: multiple informant analyses

Innovative statistical approaches in health services research:multiple informant analyses Nicholas Horton Department of Mathematics Smith College, Northampton MA nhorton@email.smith.edu http://www.biostat.harvard.edu/multinform

Acknowledgements • Joint work with Garrett Fitzmaurice and Nan Laird, Harvard School of Public Health • Jane Murphy and the Stirling County Study for use of their example dataset • Supported by NIH grant RO1-MH54693

Outline • Motivation for multiple source data • Examples of multiple sources/informants • Models for correlated multiple source data • Accounting for complex survey design • Accounting for incomplete/missing data • Example (Stirling County Study) • Conclusions

Why multiple source data? • to provide better measures of some underlying construct that is difficult to measure or likely to be missing • also known as multiple informant reports, proxy reports, co-informants, etc. • discordance is expected, otherwise there is no need to collect multiple reports

Definition of multiple source data • data obtained from multiple informants or raters (e.g., self-reports, family members, health care providers, teachers) • or via different/parallel instruments or methods (e.g., symptom rating scales, standardized diagnostic interviews, or clinical diagnoses) • None of the reports is a “gold’’ standard • We consider multiple source data that are commensurate (multiple measures of the same underlying variable on a similar scale)

Examples of multiple source data • child psychopathology (ask parents, teachers and children about underlying psychological state) • service utilization studies (collect information from subjects and databases) • medical comorbidity (query providers and charts to assess medical problems)

Examples of multiple source data (cont.) • adherence studies (collect self-report of adherence, electronic pill caps [MEMS] plus pharmacy records) • nutritional epidemiology (utilize multiple dietary instruments such as food frequency questionnaires, 24-hour recalls, food diaries)

Incomplete/missing reports • Multiple source reports are commonly incomplete since, by definition, they are collected from sources other than the primary subject of the study • This missingness may be by design or happenstance (or both!)

Example: missing source reports • Consider service utilization studies that collect information from subjects and databases • Subjects may be lost to follow-up (or only contacted periodically) • Databases may be incomplete (lack of consent, lack of appropriate coverage)

Analytic approach • Multiple sources can provide information on outcomes or predictors (risk factors) • Multiple source outcome: what is the prevalence of child psychopathology? (measured using parallel parent and teacher reports) • Fitzmaurice et al (AJE, 1995), Horton et al (HSOR, 2002), Horton and Fitzmaurice (SIM tutorial, in press)

Analytic approach (cont.) • Multiple source predictor: what are the odds of developing depression in adulthood, conditional on parallel reports of anxiety (collected from a child and a parent)? • Examples: Horton et al (AJE, 2001), Lash et al (AJE, 2003), Liddicoat et al (JGIM, 2004), Horton and Fitzmaurice (SIM tutorial, in press) • We will focus on an example using multiple source predictors

Notation • Let Y denote a univariate outcome for a given subject • Let denote the l’th multiple source predictor • Let Z denote a vector of other covariates for the subject • To simplify exposition, we consider two sources with dichotomous reports (L=2)

Questions to consider • Are the sources reporting on the same underlying construct (are they commensurate or interchangeable?) • Is it possible to combine the reports in some fashion? • How to handle missing reports?

Analytic approaches • Reviewed in Horton, Laird and Zahner (IJMPR, 1999) • Use only one source • Fit separate models

Analytic approaches (cont.) • Combine (pool) the reports in some fashion • Include both reports in the model

Analytic approaches (cont.) • We considered simultaneous estimation of the marginal models: • Non-standard application of GEE • Method independently suggested by Pepe et al (SIM, 1999)

Advantages of new approach • can be used to test for source differences in association with the outcome • can test if the effects of other risk factors on the outcome differ by source

Advantages of new approach • different source effects where necessary • a pooled model can be fit if no significant source effects (potentially more efficient) • can be fit using general purpose statistical software

Accounting for survey design • Many health services or epidemiologic studies arise from complex survey samples • Need to address stratification, multi-stage clustering and unequal sampling weights • Failing to properly account for survey design may lead to bias and incorrect estimation of variability

Accounting for survey design (cont.) • Estimation proceeds using the approximate (quasi) log-likelihood (weighted version of the usual score equations for a GLM, accounting for the multi-stage clustering, including multiple source reports) • Can be fit using general purpose statistical software (e.g. Stata)

Accounting for incomplete source reports • Missing source reports in this setting are missing predictors • Account for MAR missingness by weighted estimating equation methodology of Robins et al (JASA, 1994) and Xie and Paik (Biometrics, 1997) • Adds an additional “missingness weight” • Complications to variance estimation

Example: Stirling County • Outcome: time to event (death) over 16 year follow-up period (1952-1968) (n=1079) • multiple source predictors: partially observed dichotomous physician report or self report of psychiatric disorder • other predictors: age (3 categories), gender • statistical model: piecewise exponential survival with 4 intervals each of 4 years duration (subjects contribute time at risk in each interval)

Stirling County survey design Strata 1 Stratum 1 Stratum k Stratum K PSU 1 PSU J PSU j self- report phys.- report

Stirling County missingness • Complete data on mortality • Relatively few reports of diagnosis missing (5% physician, 7% self) • For missing physicians, MCAR plausible • Missing self-reports associated with demographics and physician report • Accounting for missingness did not affect results (Horton et al, AJE, 2001)

Results (separate parameters) • Initially fit model with separate parameters • No evidence for any non-zero source terms • Implies that the association between risk factors and mortality did not differ by source • Dropped these terms from the model, yielding parsimonious shared parameter model with smaller standard errors

Results (shared parameters)

Interpretation of results (annual mortality rate)

Conclusions • new methods of analysis of multiple source data are available • can be implemented using existing software • methods allow the assessment of the relative association of each source • each source yielded similar conclusions: association between psychiatric disorder and mortality is stronger for younger subjects • unified model has less variability, pools information after testing for systematic differences

Conclusions (cont.) • methods account for complex survey designs • methods incorporate partially observed subjects to contribute, under MAR assumptions • multiple source reports arise in many settings (not just for children anymore!)

Innovative statistical approaches in health services research:multiple informant analyses Nicholas Horton Department of Mathematics Smith College, Northampton MA nhorton@email.smith.edu http://www.biostat.harvard.edu/multinform

Future work • Maximum-likelihood estimation instead of GEE approach • May yield efficiency gains • Particularly useful for missing reports • Non-commensurate reports • Different scales • Different underlying constructs • Consider latent variable models (e.g. work of Landrum, Normand)

Innovative statistical approaches in health services research: multiple informant analyses