510 likes | 716 Views
Part I: Filling a model with x-variables. There are basically three ways to fill a model with x-variables:put in new variableschange the specification of existing variablesinteract variables with each other. Putting in new variables. Potential problems with leaving out a variable:if the new va
E N D
1. Sociology 709 (Martin)Lecture 4: February 19, 2009
Creating and Modifying X-variables
Maximum Likelihood Estimation
2. Part I: Filling a model with x-variables There are basically three ways to fill a model with x-variables:
put in new variables
change the specification of existing variables
interact variables with each other
3. Putting in new variables Potential problems with leaving out a variable:
if the new variable is correlated with the outcome variable and with the key explanatory variable, the coefficient for the key explanatory variable will be biased if you leave the new variable out of the model.
if the new variable is correlated with the outcome variable but not with the key explanatory variable, the standard error for the key explanatory variable will be larger if you leave the new variable out of the model.
If the reader thinks the new variable is the true cause of the relationship, he or she will not believe your results if you leave the new variable out of the model.
4. Changing the specification of existing variables. Reasons to respecify an x-variable:
If the relationship between a key variable x and y* is nonlinear, we could make bad estimates of slopes and bad point estimates of population proportions.
if a control variable is correlated with the outcome variable and with the key explanatory variable, the coefficient for the key explanatory variable will be biased if you miss-specify the control variable. (In some cases, a badly specified control variable can be worse than no control variable!)
8. More on polynomial models Form of a polynomial model:
Yhat = bo + b1X1 + b2X12 + b2X12 + b3X13 + …
When to use a polynomial model:
when you have a theoretical presumption that the response function is a polynomial function
(example: the distance an object falls as a function of time).
when the response function is complex or unknown, but it fits pretty well to a polynomial function.
(example: death rates as a function of age)
9. Second-order polynomials Second-order polynomials are the commonest kind; they include only a squared term:
Yhat = bo + b1X1 + b2X12
Example: girls’ height in inches as a function of age, for ages 2-12
Yhat = 20 + 3*X1 – 0.2X12
Predict the height of a girl at age 2, 5, 8, and 11.
10. Graphing second-order polynomials Second-order polynomials always reflect a response function with a single curve.
The linear (first order) term describes the general trend as upward or downward, for values of X near 0.
The squared (second-order) term describes the curvature as upward or downward.
Sketch these examples
Yhat = 20 + 3*X1 + 0.2X12
Yhat = 20 + 3*X1 – 0.2X12
Yhat = 20 – 3*X1 + 0.2X12
Yhat = 20 – 3*X1 – 0.2X12
11. Graphing higher-order polynomials Third-order polynomials describe response functions where the curvature changes over time.
Yhat = 20 + 3*X1 - 0.2X12 + 0.07X13
Higher-order polynomials describe response functions where the change in the curvature changes over time, as in bimodal distributions.
Yhat = 20 + 3*X1 - 0.2X12 + 0.07X13 - 0.01X14
You rarely see high order polynomials in social research.
12. Warnings for polynomial regression Polynomial regression can be a good way to explain error related to important control variables.
However, polynomial regression creates at least four problems:
1.) It is difficult to interpret any of the coefficients related to the variable with the polynomial specification.
2.) Each “order” in a polynomial regression eats up a degree of freedom.
3.) Polynomial terms can be highly collinear.
4.) The model becomes highly unstable at extreme x-values, and don’t even think of extrapolating.
13. Model building with polynomial regression The standard order for adding polynomial terms to a model is to start with the first order term, add the second order term if necessary, and so on.
Never use a model that has a higher order term for X, but is missing a lower-order term for X.
14. Interaction regression models Form of an interaction model:
Yhat = b0 + b1X1 + b2X2 + b3X1X2
When to use an interaction model:
When it appears that the effect of X1on Y varies with the value of X2.
19. Dichotomous interaction terms The easiest type of interaction occurs when both X1 and X2 are scaled as dichotomous variables
Example: what is the effect of college attendance and gender on income?
Data coding: X1(male) X2(college) X1*X2
female, college 0 1 0*1=0
female, no college 0 0 0*0=0
male, college 1 1 1*1=1
male, no college 1 0 1*0=0
Note: three coefficients for four categories (which category does the intercept describe?)
20. Dichotomous interaction coefficients In the income example with X1 and X2 both dichotomous,
Yhat = b0 + b1X1 + b2X2 + b3X1X2
We can interpret the coefficients using a 2X2 table:
21. Dichotomous interaction coefficients In the income example with X1 and X2 both dichotomous,
Yhat = b0 + b1X1 + b2X2 + b3X1X2
Predict income for all four groups if:
b0 = 30,000 b1 = 10,000 b2 = 20,000 b3=10,000
Alternately, for a non-interaction model of income:
Yhat = b0 + b1X1 + b2X2
Predict income for all four groups if:
b0 = 27,500 b1 = 15,000 b2 = 25,000
22. Interpreting dichotomous interaction coefficients In a 2x2 table for dichotomous variables X1 and X2:
b0 is the predicted value of Y for X1 = 0, X2 = 0
b1 is the predicted increase in Y for moving from X1 = 0 to X1 = 1, given that X2 = 0
b2 is the predicted increase in Y for moving from X2 = 0 to X2 = 1, given that X1 = 0
b2+b3 is the predicted increase in Y for moving from X2 = 0 to X2 = 1, given that X1 = 1
b1+b3 is the predicted increase in Y for moving from X1 = 0 to X1 = 1, given that X2 = 1
23. One dichotomous and one linear interaction term If X1 is dichotomous (say, race = nonhispanic white [1,0]) and X2 is linear (say, age), and Y is a measure of health, we can use an interaction term to control for the possibility that health changes with age at a different rate for whites than for other racial/ethnic groups.
Yhat = b0 + b1X1 + b2X2 + b3X1X2
This pattern is most easily seen in graphs: (Example)
Y = health measured on a scale from 10 (great) to 1(awful)
b0 = 8 b1 = 1.5 b2 = -0.1 b3 = +0.02
24. Interaction models with one dichotomous (X1) and one linear (X2) variable b0 is the predicted value of Y for X1 = 0, X2 = 0
b1 is the predicted increase in Y for moving from X1 = 0 to X1 = 1, given that X2 = 0
b2 is the predicted increase in Y for each unit of X2, given that X1 = 0
b2+b3 is the predicted increase in Y for each unit of X2, given that X1 = 1
In other words, b3 is the difference between the linear effect of age for white nonhispanic persons and the linear effect for others.
25. Interaction models with two linear variables It is also possible to have an interaction model when both X1 and X2 are linear terms.
Yhat = b0 + b1X1 + b2X2 + b3X1X2
While this specification may often produce a good model fit, people tend not to use it because it is hard to interpret, and because it tends to fit the model to points with high values of X1 and X2
(Draw graphs of response surface with no interaction, then with interaction)
b0 = 0.5 b1 = 2.0 b2 = 1.0 b3 = 1.2
26. Interpreting interaction models with two linear variables b0 is the predicted value of Y when both X1and X2 are zero.
b1 is the linear effect of X1 when X2 is zero (which may be outside of the data range!)
b2 is the linear effect of X2 when X1 is zero (which may be outside of the data range!)
b3 is the compounded effect of X1*X2, beyond that predicted by the linear effects of X1 and X2.
27. Part II: Maximum Likelihood Estimation It is good to know…
the difference between a maximum likelihood estimator and a least squares estimator
how to do a simple likelihood estimate
the meaning of a likelihood ratio
It is good to know because…
we might need to fix models that “crash”
we need to recognize problems with assumptions
a deeper understanding helps us phrase our explanations of the findings
28. 1: Fictitious example: seat belt use Information on seat belt use and sex for a sample of 7 drivers: (.75 of males and .33 of females)
29. Least squares estimator for seat belt use E(y) = b0 + b1x
For case 1: 1 = b0+ b1*1 + e1
For case 2: 0 = b0+ b1*1 + e2
For case 3: 0 = b0+ b1*1 + e3
For case 4: 1 = b0+ b1*0 + e4
For case 5: 1 = b0+ b1*0 + e5
For case 6: 1 = b0+ b1*0 + e6
For case 7: 0 = b0+ b1*0 + e7
zero error constraint: ?e = 0
least squared error constraint: ?e2 = minimum w.r.t. b1:
d(?e2)/d(b1) = 0
solve for b1 and b0: E(y) = .75 - .42x
30. Key properties of the least squares estimator:
1.) The sum of the errors is zero by assumption.
2.) The sum of the squared errors is as small as possible by assumption.
3.) The estimator solves for all the unknowns in the equation.
31. Why the least squares estimator cannot work for the log odds of a proportion: Least squares estimators estimate the distance between the predicted and observed outcomes for every case.
When the outcome is a log odds and p must be 1 or 0, the observed log odds for every case is either +? or -?. (that is, ln(1/1-1) or ln(0/1-0))
Thus, for every observation, the distance between predicted and observed outcomes is either +? or -?, and there is no way to sum the errors to zero or to find a minimum for the sum of squared errors.
32. The M.L. alternative for the log odds of a proportion: Do not attempt to minimize error or even measure it.
Assume (based on a lack of any information), that all possible values of a parameter are equally plausible before you examine the data.
Then, if value #1 of a population parameter is more likely than value #2 to have produced the sample statistic, then value #1 is more likely to be the true value of the parameter.
33. Theoretical problems with the M.L.E.: 1.) All possible values of the population parameter are not equally likely before we start examining the data.
there is only the one true parameter.
2.) The M.L.E. looks for the most likely score, not an average. Thus, we have no assurance that the errors will sum to zero!
(In practice, these problems are not too bad.)
34. Solving a simple maximum likelihood estimate: Refer to seat belt data on page 3.
For case 1, x could be 1 or 0, in this case 1 (female)
Also, y could be 1 or 0, in this case 1 (used belt)
p = 1 for this case, so what is the most likely population proportion (?) of US persons using seat belts, based only on information from this one case?
35. Solving a simple maximum likelihood estimate: Considering only case 1, p = 1
Estimate ? using maximum likelihood estimation:
if ?=0, then pr(p=1 | ?=0) = 0
if ?=0.5, then pr(p=1 | ?=.5) = .5
if ?=0.7244, then pr(p=1 | ?=.7244) = .7244
if ?=0.9, then pr(p=1 | ?=.9) = .9
if ?=1, then pr(p=1 | ?=1) = 1
Thus, for the example of a single case with a “1” outcome, our best guess of ? is ?=1.
36. Solving a simple maximum likelihood estimate: Consider all three women: case 1,2,3, p = 1,0,0
Estimate ? using maximum likelihood estimation:
if ?=1, then pr(p=1 | ?=1) = 1 for the first case,
pr(p=0 | ?=1) = 0 for the second case, and
pr(p=0 | ?=1) = 0 for the third case,
for a total probability of 1*0*0 = 0
(? cannot be 1 if anybody in the sample scores a 0.)
Note: if you compare the equation for the total probability to binomial equations from some statistics texts, you will notice that the factorial terms have dropped out. This is OK.
37. Solving a simple maximum likelihood estimate: Consider all three women: case 1,2,3, p = 1,0,0
Estimate other possible values of ?:
if ?=.333, then pr(p=1 | ?=.333) = .333 for the first case,
pr(p=0 | ?=.333) = 1-.333 = .667 for the second case, and
pr(p=0 | ?=.333) = 1-.333 = .667 for the third case,
for a total probability of .333*.667*.667 = .1481
(If ? = .333, we would be this result .1481 of the time)
38. Solving a simple maximum likelihood estimate: Consider all three women: case 1,2,3, p = 1,0,0
Estimate other possible values of ?:
if ?=.4, then we have a total probability of .4*(1-.4)*(1-.4) = .144
if ?=.3, then we have a total probability of .3*(1-.3)*(1-.3) = .147
if ?=.2, then we have a total probability of .2*(1-.2)*(1-.2) = .128
We can calculate for all values of ? we can think of, and the maximum likelihood will turn out to be the one corresponding to ? = .333. That is our best estimate.
39. The simple maximum likelihood estimate gets hard: Consider all seven cases: we are estimating both a value for the constant and a value for the “slope”.
(in a logit model, ln(p/1-p) = b0 + b1x, so we solve for two unknowns)
For this situation, we still could guess what parameters to use, but with continuous covariates and more x-variables, it quickly becomes impossible to do more than guess what values of ?0, ?1,… we are aiming for.
Maximum likelihood estimation is thus a matter of repeated guesses, even for a computer.
40. Formal statement of the maximum likelihood estimator: From previous lectures, phat = exp(b0+b1)/(1+exp(b0+b1))
(for a model with only one covariate)
given y “successes” in n observations, we guess at two (or more) parameters ?0 and ?1, and we estimate the likelihood that corresponds to the proportion of “successes” and “failures” that we actually got in our sample:
41. Formal statement of the maximum likelihood estimator (continued):
46. What M.L.E. output looks like Example: do sex and marital status predict health limitations among U.S. persons age 65+?
47. What is a log likelihood? A log likelihood is the natural log of a likelihood.
A likelihood (like the p-value in a regression equation) is the probability that the sample outcome would occur, given a set of population parameters.
The concept behind linear regression analysis is the probability of many sample outcomes, assuming one population parameter.
In M.L.E., we consider the probability of one sample outcome, based on different population parameters.
48. Other STATA output An iteration is a guess of a set of parameters that is improved by successive attempts.
A likelihood ratio chi-squared compares the natural log of the likelihood for the current model with the log likelihood of a model with only an intercept. (Treat as a chi-squared test for df )
A pseudo R2 is an attempt to produce a statistic analogous to R2 in a linear regression model.
I generally ignore pseudo-R2
51. Readings
For techniques for working with x-variables, read Long and Freese section 2.13.
For ML estimation, scan Long sections 3.3 to 3.6