Simulating and Modeling Genetically Informative Data

Simulating and Modeling Genetically Informative Data Matthew C. Keller Sarah E. Medland

Outline • The usefulness of simulation in behavioral genetics • Using GeneEvolve to simulate genetically informative data • Practical simulating different designs • Classical Twin Design (CTD) • Nuclear Twin Family Design (NTFD)

Simulation provides knowledge about processes that are difficult/impossible to figure out analytically • Independent check of models. Especially important for complex (e.g., extended twin family) models. • Model verification: Check that your models work as they are supposed to. • Sensitivity analysis: Check the effect on parameter estimates when assumptions are violated (e.g., different modes of assortative mating, vertical transmission, or genetic action). • Method for predicting complex dynamics in population genetics

Using complex models without independent verification (e.g., simulation) is like…

Process of model verification Simulate a dataset that has parameters that your model can estimate. Run your model on the simulated dataset Obtain and store parameter estimates Repeat steps 1-3 many (e.g., 1000) times

Results of model verification • If the mean parameter estimate = the simulated parameter estimate, the estimate is unbiased. If your model has no mistakes, parameters should generally be unbiased (there are exceptions) • The standard deviation of an estimates corresponds to its standard error and its distribution to its sampling distribution • You can also easily study the multivariate sampling distribution and statistics. E.g., how correlated parameters are.

Process of sensitivity analysis Simulate a dataset that has one or more parameters that your model cannot estimate. Run your model on the simulated dataset Obtain and store parameter estimates Repeat steps 1-3 many (e.g., 1000) times

Results of sensitivity analysis • Because we are simulating violations of assumptions, we expect parameters to be biased. The question becomes: how biased? I.e., how big of a deal are these violations? We should be able to quantify the answers to these questions.

Reality: A=.4, D=.15, S=.15

A,D, & F estimates are highly correlated in Stealth & Cascade models

Simulation is not a panacea • Simulation can be said to provide “knowledge without understanding.” It is a helpful tool for understanding, but doesn’t provide understanding in and of itself. • Simulations themselves rely on assumptions about how processes work. If these are wrong, our simulation results may not reflect reality.

Simulation program: GeneEvolve

GeneEvolve 0.73 • Implemented in R, open-source, user modifiable • User specifies 31 basic parameters up front (and 17 advanced ones); no need to alter script after that. • Fast (on AMD Opteron 3.2GHz dual 64 bit processor, 2GB RAM; OS= RHEL AS4) • 10 genes, N=20,000 takes ~ 20 seconds/gen Download: www.matthewckeller.com

How GeneEvolve works: User specifies: • population size, # generations for population to evolve, threshold effects, mechanisms of assortative mating, vertical transmission, etc. • 3 types of genetic effects • 5 types of environmental effects • 13 types of moderator/covariate effects Download: www.matthewckeller.com

Diagram of GeneEvolve Model zs w w q q A A F F x x S S E E f f a a s s e e D D d d aa PFa PMa aa µ AxA AxA m m m m A A F F zd S S E E f a a f zaa s s e e D D d d aa PT2 PT1 aa AxA AxA

Diagram: GeneEvolve Age-by-A Interactions Purcell Model: Our Model: r 1 1 1 A AS AL βA+ βint(age) βAL βAS βage β0 β0 P 1 P 1 1 age open Purcell-vs-Ours.pdf; Purcell-vs-OursCorrelation.pdf

How GeneEvolve works (cont): • At adulthood, ~ x% find mates s.t. phenotypic correlation b/w mating phenotypes = AM: • Pairs have children : • Rate determined by user-specified population growth • Process iterated n times Download: www.matthewckeller.com

How GeneEvolve works (cont): • After n iterations, population split into two: • Parents of spouses • Parents of twins • Parents of twins have offspring (MZ/DZ twins & their sibs) • Twins mate with spousal population & have offspring Download: www.matthewckeller.com

What you get: • 3 generations of phenotypic data written out (one row per family), potentially across repeated measures • This data (& subsets of it) can be entered into structural models for model verification and sensitivity analysis • A summary PDF at end shows: • Basic simulation statistics • Changes in variance components across time • Correlations between 10 relative types Download: www.matthewckeller.com

Structural Equation Modeling (SEM) in BG • SEM is great because… • Directs focus to effect sizes, not “significance” • Forces consideration of causes and consequences • Explicit disclosure of assumptions • Potential weakness… • Parameter reification: “Using the CTD we found that 50% of variation is due to A and 20% to C.”

Structural Equation Modeling (SEM) in BG • SEM is great because… • Directs focus to effect sizes, not “significance” • Forces consideration of causes and consequences • Explicit disclosure of assumptions • Potential weakness… • Parameter reification: “Using the CTD we found that 50% of variation is due to A and 20% to C.” NO! Only true under strong assumptions that probably aren’t met (e.g., D=0) and usually go untested. To the degree assumptions wrong, estimates are biased.

Classical Twin Design (CTD) 1 A A 1/.25 C C E E a a c c e e D D d d PT2 PT1

Classical Twin Design (CTD) • Assumption biased up biased down Either D or C is zero A C & D No assortative mating C D No A-C covariance C D & A 1 A A 1/.25 C C E E a a c c e e D D d d PT2 PT1

q q w w A A C C E E a a c c e e D D d d PFa µ PMa m m m m A A 1/.25 C C E E a a c c e e D D d d PT2 PT1 Adding parents gets us around all these assumptions • Assumption biased up biased down Either D or C is zero No assortative mating No A-C covariance We don’t have to make these x x

1 1 A A F F A A 1/.25 1/.25 S S C C E E E E f a a f a a s s c c e e e e D D D D d d d d PT2 PT2 PT1 PT1 Parents also allow differentiation of S & F With parents, we can break “C” up into: S = env. factors shared only between sibs F = familial env factors passed from parents to offspring S C F

Nuclear Twin Family Design (NTFD) w w q q A A F F x x S S E E f f a a s s e e D D d d PFa PMa m m m m zs A A F F zd S S E E f a a f s s e e D D d d PT2 PT1 µ Note: m estimated and f fixed to 1 • Assumptions: • Only can estimate 3 of 4: A, D, S, and F (bias is variable) • Assortative mating due to primary phenotypic assortment (bias is variable)

Stealth • Include twins and their sibs, parents, spouses, and offspring… • Gives 17 unique covariances (MZ, DZ, Sib, P-O, Spousal, MZ avunc, DZ avunc, MZ cous, DZ cous, GP-GO, and 7 in-laws) • 88 covariances with sex effects

1 Additional obs. covs with Stealth allow estimation of A, S, D, F, T T F D S can be estimated simultaneously = env. factors shared only between twins A T A A F F 1/.25 S S E E f a a f s s e e D D 1/0 d d d PT2 PT1 t t T T (Remember: we’re not just estimating more effects. More importantly, we’re reducing the bias in estimated effects!)

w w q q A A F F x x S S E E f f a a s s e e D D d d PFa µ PMa t t T T m m m m w w 1 q q A A F F A F A F x x 1/.25 S S S S E E E E f f a a f a f a s s s s e e e e D D D D 1/0 d d d d PT2 PMa µ PFa PT1 µ t t t T T t T T m m m m A F A F S S E E a f a f s s e e D D d d PCh PCh t t T T Stealth

Stealth • Assumption biased up biased down Primary assortative mating A, D, or F A, D, or F No epistasis A, D S No AxAge D, S A

Stealth • Assumption biased up biased down Primary assortative mating A, D, or F A, D, or F No epistasis A, D S No AxAge D, S A • Primary AM: mates choose each other based on phenotypic similarity • Social homogamy: mates choose each other due to environmental similarity (e.g., religion) • Convergence: mates become more similar to each other (e.g., becoming more conservative when dating a conservative)

Cascade ~ ~ µ PFa PMa ~ ~ ~ ~ ~ ~ d s s d t t ~ ~ w w a a ~ ~ q q f f ~ ~ e e A A F F x x S S E E f f a a s s e e D D d d PFa PMa t t T T m m m m ~ ~ ~ ~ µ PT1 PT2 µ PSp PSp ~ ~ ~ ~ ~ ~ s d d s t t ~ ~ ~ ~ w a a w a a ~ ~ ~ ~ 1 q q f f f f ~ ~ ~ ~ ~ ~ e e e e s s A A F F A F A F ~ x x ~ 1/.25 d d S S S S E E E E ~ ~ f f a a f a f a t t s s s s e e e e D D D D 1/0 d d d d PT2 PMa PFa PT1 t t t T T t T T m m m m A F A F S S E E a f a f s s e e D D d d PCh PCh t t T T

Reality: A=.5, D=.2

Reality: A=.5, S=.2

Reality: A=.4, D=.15, S=.15

Reality: A=.35, D=.15, F=.2, S=.15, T=.15, AM=.3

Reality: A=.45, D=.15, F=.25, AM=.3 (Soc Hom)

Reality: A=.4, A*A=.15, S=.15

Reality: A=.4, A*Age=.15, S=.15

Conclusions • All models require assumptions. Generally, more assumptions = more biased estimates • For the first time, we have demonstrated independent assessments of the NTFD, Stealth, and Cascade models • These complicated models work as designed! • In all models, but especially the CTD, please don’t REIFY A, C, & D!

Acknowledgments • Those who conceived of these models originally: • Jinks, Fulker, Eaves, Cloninger, Reich, Rice, Heath, Neale, Maes, etc. • And to Nick Martin: for his energy and enthusiasm, and for encouraging us to do this to begin with

Why use it? Modeling aid • Check bias & identification: • Feed PE parameters you are modeling, simulate data, & see if your model recovers the parameters • Check model’s sensitivity to assumptions: • Simulate violations of assumptions & note its effects on estimates • Estimate power & multivariate sampling dist’s of estimates under very general conditions: • Run PE multiple times given whatever condition you want Download: www.matthewckeller.com

Why use it? Predictor of population / evolutionary genetics dynamics • Find changes in variance parameters & relative covariances under different modes of AM, VT, & genetic effects: • Simulate random genetic drift by varying population size • Introduce selection (coming) to test theories on maintenance of genetic variation Download: www.matthewckeller.com

Simulating and Modeling Genetically Informative Data