1 / 30

Deletion Diagnostics for detection of influential observations from a Generalised Linear Mixed Model

Deletion Diagnostics for detection of influential observations from a Generalised Linear Mixed Model.

kynton
Download Presentation

Deletion Diagnostics for detection of influential observations from a Generalised Linear Mixed Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deletion Diagnostics for detection of influential observations from a Generalised Linear Mixed Model

  2. B. Ganguli, S. Sen Roy, Dept of Statistics, University of Calcutta, India.M. Naskar National Institute of Research for Jute and Allied Fibre Technology, India.E. J. MalloyDept of Statistics, American University, USA. E. A. EisenDepts of Environmental Health, Harvard University & Environmental Health Sciences, UC, Berkeley, USA.

  3. Motivation • need to simultaneously address the issues of modeling nonlinear dose-response relationships and account for outliers and influential observations that may affect this relationship - common problems in environmental epidemiology • heterogeneity in response to toxic exposures is a possible explanation for outliers in models of the health effects of environmental exposures - may lead to unusual shapes of the dose response observations - for example, healthy survivors may be exposed to the largest exposure levels

  4. Example : Silica Exposure Study(Checkoway et.al., 1997, Amer J. Epidemiology) • Cohort mortality study of 2342 male workers exposed to crystalline silica (cristobalite) in a diatomaceous earth mining and processing facility in California. • Study period : 1942 – 1994. • Worked for atleast 12 months • Mortality excesses detected for • Nonmalignant respiratory diseases (NMRD) • Lung cancer

  5. 77 deaths from lung cancer in the cohort during the follow-up period Q. 1:Do the outliers and influential observations occur only at the high extremes of exposure ? Q. 2:How do these outliers/influential observations affect the dose-response relationship. Study using (i) GLM model (ii) deletion diagnostics

  6. Linear Model E(y)=  = Xα, α fixed effect Linear Mixed Model Generalized Linear Model E(y) = Xα + Zb,  = g() = Xα b random effect g(.) some function of  (accounts for correlation) (accounts for non-linearity) Generalized Linear Mixed Model

  7. Generalized Linear Mixed Model • n individuals • response : yi • covariates : xi associated with fixed effectszi associated with random effects • α : p-vector fixed effect • b : q-vector random effect • The fixed effect models the mean of y whereas the random effect governs the variance-covariance structure.

  8. Model E(yi| b) =  and Var(yi| b) = ai(i) ai‘s known scalars and (.) known function • linear predictor : i = g(i) = xiα + zib g(.) some known function • b ~ N(0, D), where D = ((jIqj))j=1,…,k , qj = q

  9. Y =  + (y - )g() • assuming canonical link W = diag(aig(i)) V = W-1 + ZDZ Q = V-1 - V-1X(XV-1X) -1XV-1

  10. Normal equations • Let Z = [ Z1, …, Zk ] (1) YQZjZjQY = trace(QZjbZj) (2) • Implication :fitting a series of linear models on transforms of original data

  11. Deletion Diagnostic • Delete one observation at a time and re-fit the model. • Observe the differences dfbeta = full-set estimate – deleted-set estimate dffit = full-set predictor – deleted-set predictor • If these are substantially large then the deleted observation has an unusually large impact on the estimates and hence is an influential observation

  12. Q :Then, given n observations, do we need to fit the model (n+1) times to identify the influential observations ? Given that iterative techniques are required to solve the normal equations, even a single fit will take considerable time. So (n+1) fits would be computationally time consuming, particularly if n is large. • No, we simply need to fit the model once with the full-data set. • The dfbeta and dffit can be obtained from the leverages and residuals of this single fit.

  13. Question :How do we know that the dfbeta and dffit are sufficiently large to identify the corresponding observationas an influential observation ? • The expressions can be suitably standardized and critical values can then be derived using simulation techniques.

  14. This study has been concerned with • To derive the dfbeta and dffit for the GLMM. • To derive the impact of deletion on the variance components (generally ignored in such studies). • To study the probabilistic behaviour of the residuals so that variances of dfbeta and dffit can be derived and standardization can be done. • To apply the results on simulated and real-life data sets to assess its performance.

  15. Define • B = (XV-1X) -1XV-1 = [ B1,…,Bn],, Q = [ Q1, …, Qn]

  16. Result : • Standardized residuals : (Cook’s distance)

  17. Application to the Silica Exposure data-set • Cox’s hazard model : h(t|x) = h(t)exp(hisp + f(x)) • t – age at which subject died of lung cancer • hisp – indicator of whether the subject was Hispanic or not • x – cumulative silica exposure • f(.) – unknown smooth function

  18. Cook’s distance for the silica data

  19. Standardized dfbeta residual of variance of random effects

  20. Outliers and influential observations need not occur at the highest extremes of exposure (in fact, all the observations identified as outliers with regard to the fit correspond to low exposure) • Distinction can be made between outliers/ influential observations which affect the fit and those which affect the variance of the random component (the latter are mostly those with high exposures) • The individual with the highest exposure does not affect the fit but affects the variance component

  21. Log hazard with and without outliers for fitted values

  22. Log hazard with and without outliers for both fitted values and variance estimates

  23. The outliers in the variance component affect the shape of the hazard function and are generally associated with high exposure levels. • The outliers in the fit do not change the shape too much except for a sharper dip at the higher exposure levels. These outliers can occur even at low exposure levels.

  24. Clustered Data Example(a simulation study) • k clusters (say, hospitals) • data in the form of counts yij ~ Poisson(ij), i=1(l)k, j=1(l)ni log(ij) = bi + xij • all observations in the ith cluster share the same intercept bi • xij subject specific covariate

  25. set • k = 9 •  = 0.5 • ni = 100 • bi generated from N(0, 0.5) • xij generated from uniform [0, 1] distribution • yij then generated for clusters 1 & 3-9 • Cluster 2 observations generated using a comparatively high ij = 20

  26. Plot of standardized dfbeta residuals

  27. estimated cluster means

  28. standardized dfbetas clearly identify the observations from Cluster 2 as outliers • the estimated cluster means are expectedly larger when Cluster 2 observations are included as opposed to when they are excluded.

  29. Thank you

More Related