1 / 70

Gene-Environment Case-Control Studies

Gene-Environment Case-Control Studies. Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University http://stat.tamu.edu/~carroll. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A. Outline.

Download Presentation

Gene-Environment Case-Control Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculties of Nutrition and Toxicology Texas A&M University http://stat.tamu.edu/~carroll TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA

  2. Outline • Problem: Case-Control Studies with Gene-Environment relationships • Efficient formulation when genes are observed • Measurement errors in environmental variables • Haplotype modeling and Robustness

  3. Acknowledgment • This work is joint with Nilanjan Chatterjee (NCI) and Yi-Hau Chen (Academia Sinica)

  4. Acknowledgment • Further work is joint with Mitchell Gail (NCI), Iryna Lobach (Yale) and Bhramar Mukherjee (Michigan)

  5. Software • SAS and Matlab Programs Available at my web site under the software button • Examples are given in the programs http://stat.tamu.edu/~carroll

  6. Some Personal History • I was born in Japan • The coffee table is still in my house

  7. Some Personal History • My father lived in Seoul for 2 months in 1948 and 1 year in 1968 • He took many photos of sights there, especially in 1948

  8. Joonghwa moon at Deoksugung, 1948

  9. Joonghwa moon at Deoksugung, today

  10. The Prices of Drinks Were Pretty Low

  11. Basic Problem Formalized • Case control sample: D = disease • Gene expression: G • Environment, can include strata: X • We are interested in main effects for G and X along with their interaction

  12. Prospective Models • Simplest logistic model • General logistic model • The function m(G,X,b1) is completely general

  13. Likelihood Function • The likelihood is • Note how the likelihood depends on two things: • The distribution of (X,G) in the population • The probability of disease in the population • Neither can be estimated from the case-control study

  14. When G is observed • The usual choice is ordinary logistic regression • It is semiparametric efficient if nothing is known about the distribution of G, X in the population • Why semiparametric: what is unknown is the distribution of (G,X) in the population

  15. When G is observed • Logistic regression is thus robust to any modeling assumptions about the covariates in the population • Unfortunately it is not very efficient for understanding interactions

  16. Gene-Environment Independence • In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata • This assumption is often used in gene-environment interaction studies

  17. G-E Independence • Does not always hold! • Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction • Part of this talk is to model the distribution of G given X

  18. Gene-Environment Independence • If you’re willing to make assumptions about the distributions of the covariates in the population, more efficiency can be obtained. • The reason is that you are putting a constraint on the retrospective likelihood

  19. More Efficiency, G Observed • A constraint on the population is to posit a parametric or semiparametric model for G given X • Consequences: • More efficient estimation of G effects • Much more efficient estimation of G and (X,S) interactions.

  20. The Formulation • In the most general semiparametric setting, we have • Question: What methods do we have to construct estimators?

  21. Methodology • We have developed two new ways of thinking about this problem • In ordinary logistic regression case-control studies, they reduce to the Prentice-Pyke formulation

  22. The Hard Way • Treat X as a discrete random variable whose mass points are the observed data points • Holding all parameters fixed, maximize the retrospective likelihood to estimate the probabilities of the X values.

  23. The Hard Way • The maximization is not trivial to do correctly • Result: an explicit profile likelihood that does not involve the distribution of X

  24. Pretend Missing Data Formulation • The following simple trick can be shown to be legitimate and semiparametric efficient • Equivalently, we compute a semiparametric profiled likelihood • Semiparametric because the distribution of X is not modeled

  25. Pretend Missing Data Formulation • The idea is to create a “pretend” study, which is one of random sampling with missing data • We use an MAR regime. • The “pretend” study mimics the case-control study

  26. Pretend Missing Data Formulation • Suppose you have a large but finite population of size N • Then, there are with the disease • There are without the disease

  27. Pretend Missing Data Formulation • In a case-control sample, we randomly select n1 with the disease, and n0 without. • The fraction of people with disease status D=d that we observe is

  28. Pretend Missing Data Formulation • Then let’s make up a “pretend” study, that has random sampling with missing data • I take a random sample • I get to observe (D,X,G) when D=d with probability • I will say that if I observe (D,X,G). Then

  29. Pretend Missing Data Formulation • In this pretend missing data formulation, ordinary logistic regression is simply • We have a model for G given X, hence we compute • This has a simple explicit form, as follows

  30. Result • Define • This is the intercept that ordinary logistic regression actually estimates • It only gets the slope right

  31. Result • Define • Further define

  32. Result • Then, the semiparametric efficient profiled likelihood function is • Trivial to compute.

  33. Result • In the rare disease case, we have the further simplification that

  34. Interesting Technical Point • Profile pseudo-likelihood acts like a likelihood • Information Asymptotics are (almost) exact

  35. Typical Simulation Example • MSE Efficiency of Profile method compared to ordinary logistic regression

  36. Typical Empirical Example

  37. Consequence #1 • We have a formal likelihood: • This is also a legitimate semiparametric profile likelihood • Anything you can do with a likelihood you can do with a semiparametric profile likelihood

  38. Consequences #2-#3 • Measurement Error in the Gene: • Handle misclassification of a covariate (the gene) as in any likelihood problem (see later) • Measurement Error in the Environment : • The structural approach, wherein you specify a flexible model for covariates measured with error, is applicable.

  39. Advertisement Lobach, et al., Biometrics, in press

  40. Consequences #4-#5 • Flexible Modeling of Covariate Effects: • Modeling some components by penalized regression splines • The LASSO and other likelihood-based methods apply • Model Averaging: • Can entertain/average various risk models • Bayesian methods are asymptotically correct

  41. Consequence #6 • Model Robustness: • One can model average/select/LASSO various models for the distribution of G given X • Main Point: Our method results in a legitimate likelihood, hence can be treated as such

  42. Modeling the Gene • Now turn to models for the gene • Given such models likelihood calculations can be used for model fitting • We will consider haplotypes

  43. Haplotypes • Haplotypes consist of what we get from our mother and father at more than one site • Mother gives us the haplotype hm = (Am,Bm) • Father gives us the haplotype hf = (af,bf) • Our diplotype is Hdip = {(Am,Bm), (af,bf)}

  44. Haplotypes • Unfortunately, we cannot presently observe the two haplotypes • We can only observe genotypes • Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b)

  45. Missing Haplotypes • Thus, if we were really Hdip = {(Am,Bm), (af,bf)}, then the data we would see would simply be the unordered set (A,a,B,b) • However, this is also consistent with a different diplotype, namely Hdip = {(am,Bm), (Af,bf)} • Note that the number of copies of the (a,b) haplotype differs in these two cases • The true diploid = haplotype pair is missing

  46. Missing Haplotypes • The likelihood in terms of the diploid is • We observe the genotypes G • The likelihood of the observed data is

  47. Missing Haplotypes • The likelihood of the observed data is • Note how easy this was: it is really the profiled semiparametric likelihood of the observed data

  48. Haplotypes • Danyu Lin has a nice EM-based program for estimating haplotype frequencies • It accepts data in text format with SAS missing data conventions • The program is flexible, and for example it can assume Hardy-Weinberg equilibrium (HWE) http://www.bios.unc.edu/~lin/hapstat/

  49. Haplotype Fitting • Models that assume haplotype-environment independence are straightforward to fit via EM • Danyu Lin’s program can do this as well as our SAS program • The remaining issue is how to gain robustness against deviations from this assumed independence

  50. Robustness • We build robustness by specifying models for diplotypes given the environmental variables • We first run a program to get a preliminary estimate of haplotype frequency • We use the most frequent haplotype as a reference haplotype

More Related