Graphical models for combining multiple data sources
Download
1 / 49

Graphical models for combining multiple data sources - PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on

Graphical models for combining multiple data sources. Nicky Best Sylvia Richardson Chris Jackson Imperial College BIAS node with thanks to Peter Green. Outline. Overview of graphical modelling Case study 1: Water disinfection byproducts and adverse birth outcomes

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Graphical models for combining multiple data sources' - ryo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Graphical models for combining multiple data sources

Graphical models for combining multiple data sources

Nicky Best

Sylvia Richardson

Chris Jackson

Imperial College BIAS node

with thanks to Peter Green


Outline
Outline

  • Overview of graphical modelling

  • Case study 1: Water disinfection byproducts and adverse birth outcomes

    • Modelling multiple sources of bias in observational studies

  • Case study 2: Socioeconomic factors and limiting long term illness

    • Combining individual and aggregate level data

    • Simulation study

    • Application to Census and Health Survey for England


Graphical modelling
Graphical modelling

Modelling

Mathematics

Algorithms

Inference


1 mathematics
1. Mathematics

  • Key idea: conditional independence

  • X and Y are conditionally independent given Z if, knowing Z, discovering Y tells you nothing more about X

    P(X | Y, Z) = P(X | Z)

Modelling

Mathematics

Algorithms

Inference


Example mendelian inheritance

Z

X

Y

Example: Mendelian inheritance

  • Z = genotype of parents

  • X, Y = genotypes of 2 children

  • If we know the genotype of the parents, then the children’s genotypes are conditionally independent


Joint distributions and graphical models
Joint distributions and graphical models

Use ideas from graph theory to:

  • represent structure of a joint probability distribution…..

  • …..by encoding conditional independencies

A

C

D

F

B

E

  • Factorization thm:

    Jt distribution P(V) =  P(v | parents[v])


Where does the graph come from
Where does the graph come from?

  • Genetics

    • pedigree (family tree)

  • Physical, biological, social systems

    • supposed causal effects

  • Contingency tables

    • hypothesis tests on data

  • Gaussian case

    • non-zeros in inverse covariance matrix


A

C

D

F

B

E


C

D

A

C

D

F

B

E

E


2 modelling
2. Modelling splitting up large system into smaller components

  • Graphical models provide framework for building probabilistic models for empirical data

Modelling

Mathematics

Algorithms

Inference


Building complex models
Building complex models splitting up large system into smaller components

Key idea

  • understand complex system

  • through global model

  • built from small pieces

    • comprehensible

    • each with only a few variables

    • modular


Example case study 1
Example: Case study 1 splitting up large system into smaller components

  • Epidemiological study of birth defects and mothers’ exposure to water disinfection byproducts

  • Background

    • Chlorine added to tap water supply for disinfection

    • Reacts with natural organic matter in water to form unwanted byproducts (including trihalomethanes, THMs)

    • Some evidence of adverse health effects (cancer, birth defects) associated with exposure to high levels of THM

    • We are carrying out study in Great Britain using routine data, to investigate risk of birth defects associated with exposure to different THM levels


Data sources
Data sources splitting up large system into smaller components

  • National postcoded births register

  • National and local congenital anomalies registers

  • Routinely monitored THM concentrations in tap water samples for each water supply zone within 14 different water company regions

  • Census data – area level socioeconomic factors

  • Millenium cohort study (MCS) – individual level outcomes and confounder data on sample of mothers

  • Literature relating to factors affecting personal exposure (uptake factors, water consumption, etc.)


Model for combining data sources

q splitting up large system into smaller componentsz

s2

f

THMztj

THMzt

[raw]

[tap]

THMzk

THMzi

[pers]

[pers]

yzk

yzi

b[T]

pzi

pzk

b[c]

czk

czi

Model for combining data sources


Model for combining data sources1

q splitting up large system into smaller componentsz

s2

f

THMztj

THMzt

[raw]

[tap]

THMzk

THMzi

[pers]

[pers]

yzi

yzk

b[T]

pzi

pzk

b[c]

czk

czi

Model for combining data sources

Regression model fornational data relating risk of birth defects (pzk) to mother’s THM exposure and other confounders (czk)


Model for combining data sources2

q splitting up large system into smaller componentsz

s2

f

THMztj

THMzt

[raw]

[tap]

THMzk

THMzi

[pers]

[pers]

yzi

yzk

b[T]

pzi

pzk

b[c]

czk

czi

Model for combining data sources

Regression model forMCS data relating risk of birth defects (pzi) to mother’s THM exposure and other confounders (czi)


Model for combining data sources3

q splitting up large system into smaller componentsz

s2

f

THMztj

THMzt

[raw]

[tap]

THMzk

THMzi

[pers]

[pers]

yzi

yzk

b[T]

pzi

pzk

b[c]

czk

czi

Model for combining data sources

Missing data model to estimate confounders (czk) for mothers in national data, using information on within area distribution of confounders in MCS


Model for combining data sources4

q splitting up large system into smaller componentsz

s2

f

THMztj

THMzt

[raw]

[tap]

THMzk

THMzi

[pers]

[pers]

yzi

yzk

b[T]

pzi

pzk

b[c]

czk

czi

Model for combining data sources

Model to estimate true tap water THM concentration from raw data


Model for combining data sources5

q splitting up large system into smaller componentsz

s2

f

THMztj

THMzt

[raw]

[tap]

THMzk

THMzi

[pers]

[pers]

yzi

yzk

b[T]

pzi

pzk

b[c]

czk

czi

Model for combining data sources

Model to predict personal exposure using estimated tap water THM level and literature on distribution of factors affecting individual uptake of THM


3 inference
3. Inference splitting up large system into smaller components

Modelling

Mathematics

Algorithms

Inference


Bayesian splitting up large system into smaller components


… or non Bayesian splitting up large system into smaller components


Bayesian full probability modelling
Bayesian Full Probability Modelling splitting up large system into smaller components

  • Graphical approach to building complex models lends itself naturally to Bayesian inferential process

  • Graph defines joint probability distribution on all the ‘nodes’ in the model

  • Condition on parts of graph that are observed (data)

  • Update probabilities of remaining nodes using Bayes theorem

  • Automatically propagates all sources of uncertainty


4 algorithms
4. Algorithms splitting up large system into smaller components

Modelling

  • Many algorithms, including MCMC, are able to exploit graphical structure

  • MCMC: subgroups of variables updated randomly

  • Ensemble converges to equilibrium (e.g. posterior) dist.

Mathematics

Algorithms

Inference


? splitting up large system into smaller components

?

- need only look at neighbours

Updating

MCMC

Key idea exploited by WinBUGS software


Case study 2
Case study 2 splitting up large system into smaller components

  • Socioeconomic factors affecting health

  • Background

    • Interested in individual versus contextual effects of socioeconomic determinants of health

    • Often investigated using multi-level studies (individuals within areas)

    • Ecological studies also widely used in epidemiology and social sciences due to availability of small-area data

      • investigate relationships at level of group, rather than individual

      • outcome and exposures are available as group-level summaries

      • usual aim is to transfer inference to individual level


Building the model

a splitting up large system into smaller componentsi

s2

x[c]ik

yik

x[b]ik

pik

Building the model

Multilevel model for individual data

b[c]

b[b]


Building the model1

a splitting up large system into smaller componentsi

s2

x[c]ik

yik

x[b]ik

pik

Building the model

Multilevel model for individual data

yik ~ Bernoulli(pik), person k, area i

b[c]

b[b]


Building the model2

a splitting up large system into smaller componentsi

s2

x[c]ik

yik

x[b]ik

pik

Building the model

Multilevel model for individual data

yik ~ Bernoulli(pik), person k, area i

log pik = ai + b[c] x[c]ik + b[b] x[b]ik

b[c]

b[b]


Building the model3

a splitting up large system into smaller componentsi

s2

x[c]ik

yik

x[b]ik

pik

Building the model

Multilevel model for individual data

yik ~ Bernoulli(pik), person k, area i

log pik = ai + b[c] x[c]ik + b[b] x[b]ik

b[c]

ai ~ Normal(0, s2)

b[b]


Building the model4

a splitting up large system into smaller componentsi

s2

x[c]ik

yik

x[b]ik

pik

Building the model

Multilevel model for individual data

yik ~ Bernoulli(pik), person k, area i

log pik = ai + b[c] x[c]ik + b[b] x[b]ik

b[c]

ai ~ Normal(0, s2)

b[b]

Prior distributions on s2, b[c], b[b]


Building the model5

a splitting up large system into smaller componentsi

s2

X[c]i

V[c]i

X[b]i

Ni

Building the model

Ecological model

b[c]

b[b]

qi

Yi


Building the model6

a splitting up large system into smaller componentsi

s2

X[c]i

V[c]i

X[b]i

Ni

Building the model

Ecological model

Yi ~ Binomial(qi,Ni), area i

b[c]

b[b]

qi

Yi


Building the model7

a splitting up large system into smaller componentsi

s2

X[c]i

V[c]i

X[b]i

Ni

Building the model

Ecological model

Yi ~ Binomial(qi,Ni), area i

qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c]

b[c]

b[b]

qi

Yi


Building the model8

a splitting up large system into smaller componentsi

s2

X[c]i

V[c]i

X[b]i

Ni

Building the model

Ecological model

Yi ~ Binomial(qi,Ni), area i

qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[c]dx[c]

Assuming x[b], x[c] independent, with

X[b]i = proportion exposed to ‘b’ in area

i and fi(x[c]) = Normal(X[c]i, V[c]i), then

qi = q0i(1-X[b]i) + q1iX[b]i

where

q0i = marginal prob of disease for unexposed

= exp(ai + b[c]X[c]I + b2[c]V[c]i/2)

b[c]

b[b]

qi

Yi


Building the model9

a splitting up large system into smaller componentsi

s2

X[c]i

V[c]i

X[b]i

Ni

Building the model

Ecological model

Yi ~ Binomial(qi,Ni), area i

qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c]

Assuming x[b], x[c] independent, with

X[b]i = proportion exposed to ‘b’ in area

i and fi(x[c]) = Normal(X[c]i, V[c]i), then

qi = q0i(1-X[b]i) + q1iX[b]i

where

q1i = marginal prob of disease for exposed

= exp(ai + b[b] + b[c]X[c]I + b2[c]V[c]i/2)

b[c]

b[b]

qi

Yi


Building the model10

a splitting up large system into smaller componentsi

s2

X[c]i

V[c]i

X[b]i

Ni

Building the model

Ecological model

Yi ~ Binomial(qi,Ni), area i

qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c]

ai ~ Normal(0, s2)

b[c]

b[b]

qi

Yi


Building the model11

a splitting up large system into smaller componentsi

s2

X[c]i

V[c]i

X[b]i

Ni

Building the model

Ecological model

Yi ~ Binomial(qi,Ni), area i

qi =  pik(x[b], x[c]) fi(x[b], x[c]) dx[b]dx[c]

ai ~ Normal(0, s2)

b[c]

Prior distributions on s2, b[b], b[c]

b[b]

qi

Yi


Combining individual and aggregate data
Combining individual and aggregate data splitting up large system into smaller components

  • Individual level survey data often lack power to inform about contextual and/or individual-level effects

  • Even when correct (integrated) model used, ecological data often contain little information about some or all effects of interest

  • Can we improve inference by combining both types of model / data?


Combining individual and aggregate data1

s splitting up large system into smaller components2

ai

ai

s2

X[c]i

V[c]i

x[c]ik

yik

x[b]ik

X[b]i

Ni

pik

Combining individual and aggregate data

Multilevel model for individual data

Ecological model

b[c]

b[c]

b[b]

b[b]

qi

Yi


Combining individual and aggregate data2

a splitting up large system into smaller componentsi

s2

X[c]i

V[c]i

x[c]ik

yik

X[b]i

Ni

x[b]ik

pik

Combining individual and aggregate data

Hierarchical Related Regression (HRR) model

b[c]

b[b]

qi

Yi


Simulation study
Simulation Study splitting up large system into smaller components


Simulation study1
Simulation Study splitting up large system into smaller components


Simulation study2
Simulation Study splitting up large system into smaller components


Comments
Comments splitting up large system into smaller components

  • Inference from aggregate data can be unbiased provided exposure contrasts between areas are high (and appropriate integrated model used)

  • Combining aggregate data with small samples of individual data can reduce bias when exposure contrasts are low

  • Combining individual and aggregate data can reduce MSE of estimated compared to individual data alone

  • Individual data cannot help if individual-level model is misspecified


Application to llti
Application to LLTI splitting up large system into smaller components

  • Health outcome

    • Limiting Long Term Illness (LLTI) in men aged 40-59 yrs living in London

  • Exposures

    • ethnicity (white/non-white), income, area deprivation

  • Data sources

    • Aggregate: 1991 Census aggregated to ward level

    • Individual: Health Survey for England (with ward identifier)

      • 1-9 observations per ward (median 1.6)


Ward level data

Prevalence of LLTI splitting up large system into smaller components

Prevalence of LLTI

Prevalence of LLTI

Mean income

% non white

Deprivation

Deprivation

Mean income

Mean income

% non white

% non white

Deprivation

Ward level data


Results
Results splitting up large system into smaller components


Concluding remarks
Concluding Remarks splitting up large system into smaller components

  • Graphical models are powerful and flexible tool for building realistic statistical models for complex problems

    • Applicable in many domains

    • Allow exploiting of subject matter knowledge

    • Allow formal combining of multiple data sources

    • Built on rigorous mathematics

    • Principled inferential methods

Thank you for your attention!


ad