137 Views

Download Presentation
##### Graphical models for combining multiple data sources

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Graphical models for combining multiple data sources**Nicky Best Sylvia Richardson Chris Jackson Imperial College BIAS node with thanks to Peter Green**Outline**• Overview of graphical modelling • Case study 1: Water disinfection byproducts and adverse birth outcomes • Modelling multiple sources of bias in observational studies • Case study 2: Socioeconomic factors and limiting long term illness • Combining individual and aggregate level data • Simulation study • Application to Census and Health Survey for England**Graphical modelling**Modelling Mathematics Algorithms Inference**1. Mathematics**• Key idea: conditional independence • X and Y are conditionally independent given Z if, knowing Z, discovering Y tells you nothing more about X P(X | Y, Z) = P(X | Z) Modelling Mathematics Algorithms Inference**Z**X Y Example: Mendelian inheritance • Z = genotype of parents • X, Y = genotypes of 2 children • If we know the genotype of the parents, then the children’s genotypes are conditionally independent**Joint distributions and graphical models**Use ideas from graph theory to: • represent structure of a joint probability distribution….. • …..by encoding conditional independencies A C D F B E • Factorization thm: Jt distribution P(V) = P(v | parents[v])**Where does the graph come from?**• Genetics • pedigree (family tree) • Physical, biological, social systems • supposed causal effects • Contingency tables • hypothesis tests on data • Gaussian case • non-zeros in inverse covariance matrix**Conditional independence provides mathematical basis for**splitting up large system into smaller components A C D F B E**Conditional independence provides mathematical basis for**splitting up large system into smaller components C D A C D F B E E**2. Modelling**• Graphical models provide framework for building probabilistic models for empirical data Modelling Mathematics Algorithms Inference**Building complex models**Key idea • understand complex system • through global model • built from small pieces • comprehensible • each with only a few variables • modular**Example: Case study 1**• Epidemiological study of birth defects and mothers’ exposure to water disinfection byproducts • Background • Chlorine added to tap water supply for disinfection • Reacts with natural organic matter in water to form unwanted byproducts (including trihalomethanes, THMs) • Some evidence of adverse health effects (cancer, birth defects) associated with exposure to high levels of THM • We are carrying out study in Great Britain using routine data, to investigate risk of birth defects associated with exposure to different THM levels**Data sources**• National postcoded births register • National and local congenital anomalies registers • Routinely monitored THM concentrations in tap water samples for each water supply zone within 14 different water company regions • Census data – area level socioeconomic factors • Millenium cohort study (MCS) – individual level outcomes and confounder data on sample of mothers • Literature relating to factors affecting personal exposure (uptake factors, water consumption, etc.)**qz**s2 f THMztj THMzt [raw] [tap] THMzk THMzi [pers] [pers] yzk yzi b[T] pzi pzk b[c] czk czi Model for combining data sources**qz**s2 f THMztj THMzt [raw] [tap] THMzk THMzi [pers] [pers] yzi yzk b[T] pzi pzk b[c] czk czi Model for combining data sources Regression model fornational data relating risk of birth defects (pzk) to mother’s THM exposure and other confounders (czk)**qz**s2 f THMztj THMzt [raw] [tap] THMzk THMzi [pers] [pers] yzi yzk b[T] pzi pzk b[c] czk czi Model for combining data sources Regression model forMCS data relating risk of birth defects (pzi) to mother’s THM exposure and other confounders (czi)**qz**s2 f THMztj THMzt [raw] [tap] THMzk THMzi [pers] [pers] yzi yzk b[T] pzi pzk b[c] czk czi Model for combining data sources Missing data model to estimate confounders (czk) for mothers in national data, using information on within area distribution of confounders in MCS**qz**s2 f THMztj THMzt [raw] [tap] THMzk THMzi [pers] [pers] yzi yzk b[T] pzi pzk b[c] czk czi Model for combining data sources Model to estimate true tap water THM concentration from raw data**qz**s2 f THMztj THMzt [raw] [tap] THMzk THMzi [pers] [pers] yzi yzk b[T] pzi pzk b[c] czk czi Model for combining data sources Model to predict personal exposure using estimated tap water THM level and literature on distribution of factors affecting individual uptake of THM**3. Inference**Modelling Mathematics Algorithms Inference**Bayesian Full Probability Modelling**• Graphical approach to building complex models lends itself naturally to Bayesian inferential process • Graph defines joint probability distribution on all the ‘nodes’ in the model • Condition on parts of graph that are observed (data) • Update probabilities of remaining nodes using Bayes theorem • Automatically propagates all sources of uncertainty**4. Algorithms**Modelling • Many algorithms, including MCMC, are able to exploit graphical structure • MCMC: subgroups of variables updated randomly • Ensemble converges to equilibrium (e.g. posterior) dist. Mathematics Algorithms Inference**?**? - need only look at neighbours Updating MCMC Key idea exploited by WinBUGS software**Case study 2**• Socioeconomic factors affecting health • Background • Interested in individual versus contextual effects of socioeconomic determinants of health • Often investigated using multi-level studies (individuals within areas) • Ecological studies also widely used in epidemiology and social sciences due to availability of small-area data • investigate relationships at level of group, rather than individual • outcome and exposures are available as group-level summaries • usual aim is to transfer inference to individual level**ai**s2 x[c]ik yik x[b]ik pik Building the model Multilevel model for individual data b[c] b[b]**ai**s2 x[c]ik yik x[b]ik pik Building the model Multilevel model for individual data yik ~ Bernoulli(pik), person k, area i b[c] b[b]**ai**s2 x[c]ik yik x[b]ik pik Building the model Multilevel model for individual data yik ~ Bernoulli(pik), person k, area i log pik = ai + b[c] x[c]ik + b[b] x[b]ik b[c] b[b]**ai**s2 x[c]ik yik x[b]ik pik Building the model Multilevel model for individual data yik ~ Bernoulli(pik), person k, area i log pik = ai + b[c] x[c]ik + b[b] x[b]ik b[c] ai ~ Normal(0, s2) b[b]**ai**s2 x[c]ik yik x[b]ik pik Building the model Multilevel model for individual data yik ~ Bernoulli(pik), person k, area i log pik = ai + b[c] x[c]ik + b[b] x[b]ik b[c] ai ~ Normal(0, s2) b[b] Prior distributions on s2, b[c], b[b]**ai**s2 X[c]i V[c]i X[b]i Ni Building the model Ecological model b[c] b[b] qi Yi**ai**s2 X[c]i V[c]i X[b]i Ni Building the model Ecological model Yi ~ Binomial(qi,Ni), area i b[c] b[b] qi Yi**ai**s2 X[c]i V[c]i X[b]i Ni Building the model Ecological model Yi ~ Binomial(qi,Ni), area i qi = pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] b[c] b[b] qi Yi**ai**s2 X[c]i V[c]i X[b]i Ni Building the model Ecological model Yi ~ Binomial(qi,Ni), area i qi = pik(x[b], x[c]) fi(x[b], x[c]) dx[c]dx[c] Assuming x[b], x[c] independent, with X[b]i = proportion exposed to ‘b’ in area i and fi(x[c]) = Normal(X[c]i, V[c]i), then qi = q0i(1-X[b]i) + q1iX[b]i where q0i = marginal prob of disease for unexposed = exp(ai + b[c]X[c]I + b2[c]V[c]i/2) b[c] b[b] qi Yi**ai**s2 X[c]i V[c]i X[b]i Ni Building the model Ecological model Yi ~ Binomial(qi,Ni), area i qi = pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] Assuming x[b], x[c] independent, with X[b]i = proportion exposed to ‘b’ in area i and fi(x[c]) = Normal(X[c]i, V[c]i), then qi = q0i(1-X[b]i) + q1iX[b]i where q1i = marginal prob of disease for exposed = exp(ai + b[b] + b[c]X[c]I + b2[c]V[c]i/2) b[c] b[b] qi Yi**ai**s2 X[c]i V[c]i X[b]i Ni Building the model Ecological model Yi ~ Binomial(qi,Ni), area i qi = pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] ai ~ Normal(0, s2) b[c] b[b] qi Yi**ai**s2 X[c]i V[c]i X[b]i Ni Building the model Ecological model Yi ~ Binomial(qi,Ni), area i qi = pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c] ai ~ Normal(0, s2) b[c] Prior distributions on s2, b[b], b[c] b[b] qi Yi**Combining individual and aggregate data**• Individual level survey data often lack power to inform about contextual and/or individual-level effects • Even when correct (integrated) model used, ecological data often contain little information about some or all effects of interest • Can we improve inference by combining both types of model / data?**s2**ai ai s2 X[c]i V[c]i x[c]ik yik x[b]ik X[b]i Ni pik Combining individual and aggregate data Multilevel model for individual data Ecological model b[c] b[c] b[b] b[b] qi Yi**ai**s2 X[c]i V[c]i x[c]ik yik X[b]i Ni x[b]ik pik Combining individual and aggregate data Hierarchical Related Regression (HRR) model b[c] b[b] qi Yi**Comments**• Inference from aggregate data can be unbiased provided exposure contrasts between areas are high (and appropriate integrated model used) • Combining aggregate data with small samples of individual data can reduce bias when exposure contrasts are low • Combining individual and aggregate data can reduce MSE of estimated compared to individual data alone • Individual data cannot help if individual-level model is misspecified**Application to LLTI**• Health outcome • Limiting Long Term Illness (LLTI) in men aged 40-59 yrs living in London • Exposures • ethnicity (white/non-white), income, area deprivation • Data sources • Aggregate: 1991 Census aggregated to ward level and ACORN (CACI) ward level income data • Individual: Health Survey for England (with ward identifier) • 1-9 observations per ward (median 1.6)**Prevalence of LLTI**Prevalence of LLTI Prevalence of LLTI Mean income % non white Deprivation Deprivation Mean income Mean income % non white % non white Deprivation Ward level data**Concluding Remarks**• Graphical models are powerful and flexible tool for building realistic statistical models for complex problems • Applicable in many domains • Allow exploiting of subject matter knowledge • Allow formal combining of multiple data sources • Built on rigorous mathematics • Principled inferential methods Thank you for your attention!