1 / 45

Generalized Linear models

Generalized Linear models. GENERALIZED LINEAR MODELS. Overview. overview. general linear models. Actually, proc glm. overview. GENERALIZED LINEAR MODELS. The distribution of the observations can come from the exponential family of distributions.

abdalla
Download Presentation

Generalized Linear models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generalized Linear models

  2. GENERALIZED LINEAR MODELS Overview

  3. overview general linear models Actually, procglm

  4. overview GENERALIZED LINEAR MODELS • The distribution of the observations can come from the exponential family of distributions. • The variance of the response variable is a specified function of its mean. • X is fit to a function of E(y) (called a link function) suggested by the distribution of the observations: g(E(y)) = g() = X … Link function

  5. overview LogitLink Function for Binary Response Logit (pi) pi Logit Transform Predictor Predictor

  6. overview Log Link Function for Count Data Log(count) Count Log Transform Predictor Predictor

  7. overview Examples of Generalized Linear Models *Models often use the LOG link in practice.

  8. Poisson Regression

  9. Poisson Regression Properties and examples • Examples include • number of ear infections in infants • number of equipment failures • colony counts for bacteria or viruses • counts of a rare disease in a population • number of fatal crashes at an intersection • homicide rates in a given state • rate of insurance claims • number of infected areas per unit volume of a tree • response rates to a marketing campaign • is one type of generalized linear model • assumes that the response variable follows a Poisson distribution conditional on the values of the predictor variables • can be used to model the number of occurrences of an event of interest or the rate of occurrence of an event of interest as a function of some predictor variables • is most appropriate for rare events • Response dist. should have small mean (<10 or even <5 and ideally ~1) • If no, gamma and lognormal could be better choice

  10. Poisson Regression Poisson versus Normal Distribution • Poisson distribution • is skewed to the right for rare events • is for nonnegative integer values • has only one parameter (the mean) • has a variance that is equal to the mean • Normal distribution • is symmetric • can be for negative as well as positivereal values • has two unrelated parameters (mean and variance)

  11. Poisson Regression Model

  12. Poisson Regression Parameter Estimates multiplicative effect on for a one-unit change in X. Example 1, if 1.20, then a one-unit increase in X1 yields a 20% increase in the estimated mean. Example 2, if 0.80, then a one-unit increase in X2 yields a 20% decrease in the estimated mean.

  13. Poisson Regression Пример: данные Age in Years Number of Self-DiagnosedEar Infections Frequent or Occasional Ocean Swimmer Typical Swimming Location Gender

  14. Poisson Regression Categorical Frequent or Occasional Ocean Swimmer Typical Swimming Location Gender Female Male Occasional Freq nonBeach Beach

  15. Poisson Regression Interval Age in Years

  16. Poisson Regression Пример procgenmoddata=sasuser.earinfection; class Swimmer (param=ref ref='Freq') Location (param=ref ref='Beach') Gender (param=ref ref='Male'); model infections = swimmer location gender age age*age / dist=poissonlink=log type3; run;

  17. Poisson Regression Пример: procgenmod output Scale = 1*

  18. Poisson Regression overdispersion • Poisson regression models assume the variance is equal to the mean. • Count data often exhibit variability exceeding the mean. • Overdispersion leads to underestimates of the standard errors of parameter estimates. • Overdispersion results in overestimates of the test statistic and liberal p-values. WHAT TO DO • Use the negative binomial distribution [NOW] • Apply a multiplicative adjustment factor (PSCALE or DSCALE option in the MODEL statement) [HW] • Subject heterogeneity due to an under-specified model • Outliers in the data • Positive correlation between the responses in clustered data

  19. negative binomial REGRESSION

  20. Response Variable Distribution Link Function Variance Function Count Negative Binomial Natural Log Negative Binomial regression Distribution and MODEL • The negative binomial distribution • is the distribution for count data that permits the variance to exceed the mean • enables the model to have greater flexibility in modeling the relationship between the mean and the variance of the response variable than the Poisson model

  21. Negative Binomial regression Dispersion Parameter k • The dispersion parameter k is not allowed to vary over observations. • The limiting case when the parameter k is equal to 0 corresponds to a Poisson regression model. • When the parameter is greater than 0, overdispersion is evident and the standard errors will increase. The fitted values are similar, but the larger standard errors reflect the overdispersion uncaptured with the Poisson model.

  22. Negative Binomial Regression Пример procgenmoddata=sasuser.earinfection; class Swimmer (param=ref ref='Freq') Location (param=ref ref='Beach') Gender (param=ref ref='Male'); model infections = swimmer location gender age age*age / dist=negbinlink=log type3; run;

  23. Negative Binomial Regression Пример: procgenmod output

  24. Poisson regression for rates

  25. Poisson REGRESSION: RATES Rates data: DEFINITION & EXAMPLES • When events occur over time, space, or some other index of exposure, it is more relevant to model the rate at which they occur rather than the number of events. • Rates provide the necessary standardization to make the outcomes comparable. • You use the OFFSET= option in the MODEL statement in PROC GENMOD. • How crime rates are related to the city’s unemployment rate • How melanoma incidence rates are related to demographic variables • How the rate of loan defaults is related to region of the country • How response rates to marketing campaigns relate to known characteristics of the recipients

  26. Poisson REGRESSION: RATES RATES DATA: OFFSET … • Log(T) is called the offset variable that has a coefficient equal to 1. • The offset variable makes the fitted rate proportional to the index of exposure. • For example, using the log of the population as an offset variable is the same as modeling the mean number of events proportional to population size. … OFFSET = Variable …

  27. Poisson REGRESSION: RATES Skin Cancer in Texas and Minnesota City: Minneapolis-St. Paul Dallas-Fort Worth Incidence of nonmelanomaskin cancer Age_ 15-24Group: 25-34 35-44 45-54 55-64 65-74 75-84 85+

  28. Poisson REGRESSION: RATES ПРИМЕР procgenmoddata=sasuser.skin; class City (param=ref ref='MSP') Age (param=ref ref='85+'); model cases = city age / offset=log_popdist=poissonlink=log type3; run;

  29. Zero-inflated Poisson model

  30. ZIP Purpose • In some settings, the incidence of zero counts will be much greater than expected for the Poisson distribution. • Poisson regression models will exhibit overdispersion when they are fit to data with an excess number of zeros. • Zero-inflated Poisson (ZIP) models might be a better fit to the data.

  31. ZIP model • The population that can be modeled with the zero-inflated Poisson distribution is considered to consist of two types of responses. • The first type gives Poisson distributed counts, which can produce the zero outcome or some other positive outcome. • The second type always gives a zero count. • Therefore, the relevant distribution is a mixture of a Poisson distribution and a distribution that is constant at zero.

  32. ZIP COMPONENTS MODEL statement ZEROMODEL statement procgenmoddata=sasuser.roots; class bap photoperiod; model roots = photoperiod | bap / dist=ziplink=log type3; zeromodel photoperiod; run;

  33. ZIP ПРИМЕР: ДАННЫЕ

  34. ZIP ПРИМЕР 16 hours 8 hours

  35. ZIP ПРИМЕР: РЕЗУЛЬТАТЫ dist=zinb

  36. GAMMA REGRESSION

  37. Gamma Distribution is a skewed distribution for positive values has a variance that is proportional to the squared mean has lighter tails than a lognormal distribution gamma Var(y)  [E(y)]2

  38. Distribution Variance Normal (truncated) constant* Poisson  E(Y) Gamma  (E(Y))2 Lognormal  (E(Y))2 100x distributions comparison

  39. GAMMA REGRESSION пример procunivariatedata=car; var price; histogram / gamma (alpha=est sigma=est theta=est color=blue w=2) vaxis=0 to 14 by 2midpoints=8 to 50 by 2; run;

  40. GAMMA REGRESSION Reg and GENMOD results: residual procgenmoddata=car; model price = hwympg hwympg2 horsepower / dist=gamma link=log /*identity*/obstats id=model; run; PROC REG PROC GENMOD, link=identity PROC GENMOD, link=log

  41. summary • Problem: • nonconstant variance • Approaches: • Transform the dependent variable Price (log). • Fit a gamma regression model with the log link function. • Fit a gamma regression model with the identity link function. PROBLEM for OLS ?

  42. СТРАХОВАНИЕ Case study

  43. GenMod Страхование • Frequency - how often claims are made • Severity • A typical way to model severity (claim amount) is by using a gamma distribution with a log link function • Pure premium - it is the portion of the company’s expected cost that is “purely” attributed to loss • does not include the general expense of doing business • Tweedie distribution

  44. GLM Страхование: Frequency & Pure Premium • Tweediedistribution – • PROC SEVERITY SAS/ETS • ZIP

  45. Спасибо!

More Related