92 Views

Download Presentation
##### Gene-Environment Case-Control Studies

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Gene-Environment Case-Control Studies**Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA**Outline**• Problem: Case-Control Studies with Gene-Environment relationships • Efficient formulation when genes are observed • Measurement errors in environmental variables • Haplotype modeling and Robustness**Acknowledgment**• This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)**Acknowledgment**• Further work is joint with Mitchell Gail (NCI), Iryna Lobach (Yale) and Bhramar Mukherjee (Michigan)**Software**• SAS and Matlab Programs Available at my web site under the software button • Examples are given in the programs http://stat.tamu.edu/~carroll**Some Personal History**• I was born in Japan • The coffee table is still in my house**Some Personal History**• My father lived in Seoul for 2 months in 1948 and 1 year in 1968 • He took many photos of sights there, especially in 1948**Basic Problem Formalized**• Case control sample: D = disease • Gene expression: G • Environment, can include strata: X • We are interested in main effects for G and X along with their interaction**Prospective Models**• Simplest logistic model • General logistic model • The function m(G,X,b1) is completely general**Likelihood Function**• The likelihood is • Note how the likelihood depends on two things: • The distribution of (X,G) in the population • The probability of disease in the population • Neither can be estimated from the case-control study**When G is observed**• The usual choice is ordinary logistic regression • It is semiparametric efficient if nothing is known about the distribution of G, X in the population • Why semiparametric: what is unknown is the distribution of (G,X) in the population**When G is observed**• Logistic regression is thus robust to any modeling assumptions about the covariates in the population • Unfortunately it is not very efficient for understanding interactions**Gene-Environment Independence**• In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata • This assumption is often used in gene-environment interaction studies**G-E Independence**• Does not always hold! • Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction • Part of this talk is to model the distribution of G given X**Gene-Environment Independence**• If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. • The reason is that you are putting a constraint on the retrospective likelihood**More Efficiency, G Observed**• A constraint on the population is to posit a parametric or semiparametric model for G given X • Consequences: • More efficient estimation of G effects • Much more efficient estimation of G and (X,S) interactions.**The Formulation**• In the most general semiparametric setting, we have • Question: What methods do we have to construct estimators?**Methodology**• We have developed two new ways of thinking about this problem • In ordinary logistic regression case-control studies, they reduce to the Prentice-Pyke formulation**The Hard Way**• Treat X as a discrete random variable whose mass points are the observed data points • Holding all parameters fixed, maximize the retrospective likelihood to estimate the probabilities of the X values.**The Hard Way**• The maximization is not trivial to do correctly • Result: an explicit profile likelihood that does not involve the distribution of X**Pretend Missing Data Formulation**• The following simple trick can be shown to be legitimate and semiparametric efficient • Equivalently, we compute a semiparametric profiled likelihood • Semiparametric because the distribution of X is not modeled**Pretend Missing Data Formulation**• The idea is to create a “pretend” study, which is one of random sampling with missing data • We use an MAR regime. • The “pretend” study mimics the case-control study**Pretend Missing Data Formulation**• Suppose you have a large but finite population of size N • Then, there are with the disease • There are without the disease**Pretend Missing Data Formulation**• In a case-control sample, we randomly select n1 with the disease, and n0 without. • The fraction of people with disease status D=d that we observe is**Pretend Missing Data Formulation**• Then let’s make up a “pretend” study, that has random sampling with missing data • I take a random sample • I get to observe (D,X,G) when D=d with probability • I will say that if I observe (D,X,G). Then**Pretend Missing Data Formulation**• In this pretend missing data formulation, ordinary logistic regression is simply • We have a model for G given X, hence we compute • This has a simple explicit form, as follows**Result**• Define • This is the intercept that ordinary logistic regression actually estimates • It only gets the slope right**Result**• Define • Further define**Result**• Then, the semiparametric efficient profiled likelihood function is • Trivial to compute.**Result**• In the rare disease case, we have the further simplification that**Interesting Technical Point**• Profile pseudo-likelihood acts like a likelihood • Information Asymptotics are (almost) exact**Typical Simulation Example**• MSE Efficiency of Profile method compared to ordinary logistic regression**Consequence #1**• We have a formal likelihood: • This is also a legitimate semiparametric profile likelihood • Anything you can do with a likelihood you can do with a semiparametric profile likelihood**Consequences #2-#3**• Measurement Error in the Gene: • Handle misclassification of a covariate (the gene) as in any likelihood problem (see later) • Measurement Error in the Environment : • The structural approach, wherein you specify a flexible model for covariates measured with error, is applicable.**Advertisement**Lobach, et al., Biometrics, in press**Consequences #4-#5**• Flexible Modeling of Covariate Effects: • Modeling some components by penalized regression splines • The LASSO and other likelihood-based methods apply • Model Averaging: • Can entertain/average various risk models • Bayesian methods are asymptotically correct**Consequence #6**• Model Robustness: • One can model average/select/LASSO various models for the distribution of G given X • Main Point: Our method results in a legitimate likelihood, hence can be treated as such**Modeling the Gene**• Now turn to models for the gene • Given such models likelihood calculations can be used for model fitting • We will consider haplotypes**Haplotypes**• Haplotypes consist of what we get from our mother and father at more than one site • Mother gives us the haplotype hm = (Am,Bm) • Father gives us the haplotype hf = (af,bf) • Our diplotype is Hdip = {(Am,Bm), (af,bf)}**Haplotypes**• Unfortunately, we cannot presently observe the two haplotypes • We can only observe genotypes • Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b)**Missing Haplotypes**• Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b) • However, this is also consistent with a different diplotype, namely Hdip = {(am,Bm), (Af,bf)} • Note that the number of copies of the (a,b) haplotype differs in these two cases • The true diploid = haplotype pair is missing**Missing Haplotypes**• The likelihood in terms of the diploid is • We observe the genotypes G • The likelihood of the observed data is**Missing Haplotypes**• The likelihood of the observed data is • Note how easy this was: it is really the profiled semiparametric likelihood of the observed data**Haplotypes**• Danyu Lin has a nice EM-based program for estimating haplotype frequencies • It accepts data in text format with SAS missing data conventions • The program is flexible, and for example it can assume Hardy-Weinberg equilibrium (HWE) http://www.bios.unc.edu/~lin/hapstat/**Haplotype Fitting**• Models that assume haplotype-environment independence are straightforward to fit via EM • Danyu Lin’s program can do this as well as our SAS program • The remaining issue is how to gain robustness against deviations from this assumed independence**Robustness**• We build robustness by specifying models for diplotypes given the environmental variables • We first run a program to get a preliminary estimate of haplotype frequency • We use the most frequent haplotype as a reference haplotype