Loading in 5 sec....

Maximum Likelihood EstimationPowerPoint Presentation

Maximum Likelihood Estimation

- 136 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Maximum Likelihood Estimation' - orenda

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Maximum Likelihood Estimation

Psych 818 - DeShon

MLE vs. OLS

- Ordinary Least Squares Estimation
- Typically yields a closed form solution that can be directly computed
- Closed form solutions often require very strong assumptions

- Typically yields a closed form solution that can be directly computed
- Maximum Likelihood Estimation
- Default Method for most estimation problems
- Generally equal to OLS when OLS assumptions are met
- Method yields desirable “asymptotic” estimation properties
- Foundation for Bayesian inference
- Requires numerical methods :(

MLE logic

- MLE reverses the probability inference
- Recall: p(X|)
- represents the parameters of a model (i.e., pdf)
- What’s the probability of observing a score of 73 from a N(70,10) distribution

- In MLE, you know the data (Xi)
- Primary question: Which of a potentially infinite number of distributions is most likely responsible for generating the data?
- p(|X)?

- Recall: p(X|)

Likelihood

- Likelihood may be thought of as an unbounded or unnormalized probability measure
- PDF is a function of the data given the parameters on the data scale
- Likelihood is a function of the parameters given the data on the parameter scale

Likelihood

- Likelihood function
- Likelihood is the joint (product) probability of the observed data given the parameters of the pdf
- Assume you have X1,…,Xn independent samples from a given pdf, f

Likelihood

- Log-Likelihood function
- Working with products is a pain
- maxima are unaffected by monotone transformations, so can take the logarithm of the likelihood and turn it into a sum

Maximum Likelihood

- Find the value(s) of that maximize the likelihood function
- Can sometimes be found analytically
- Maximization (or minimization) is the focus of calculus and derivatives of functions

- Often requires iterative numeric methods

Likelihood

- Normal Distribution example
- pdf:
- Likelihood
- Log-Likelihood
- Note: C is a constant that vanishes once derivatives are taken

Likelihood

- Can compute the maximum of this log-likelihood function directly
- More relevant and fun to estimate it numerically!

Normal Distribution example

- Assume you obtain 100 samples from a normal distribution
- rv.norm <- rnorm(100, mean=5, sd=2)
- This is the true data generating model!

- Now, assume you don’t know the mean of this distribution and we have to estimate it…
- Let’s compute the log-likelihood of the observations for N(4,2)

- rv.norm <- rnorm(100, mean=5, sd=2)

Normal Distribution example = -221.0698 Okay, this is the log-likelihood for one possible distribution….we need to examine it for all possible distributions and select the one that yields the largest value

- sum(dnorm(rv.norm, mean=4, sd=2, log=T))
- dnorm gives the probability of an observation for a given distribution
- Summing it across observations gives the log-likelihood

- This is the log-likelihood of the data for the given pdf parameters

Normal Distribution example

- Make a sequence of possible means
- m<-seq(from = 1, to = 10, by = 0.1)

- Now, compute the log-likelihood for each of the possible means
- This is a simple “grid search” algorithm
- log.l<-sapply(m, function (x) sum(dnorm(rv.norm, mean=x, sd=2, log=T)) )

- This is a simple “grid search” algorithm

mean log.l

1 1.0 -417.3891

2 1.1 -407.2201

3 1.2 -397.3012

4 1.3 -387.6322

5 1.4 -378.2132

6 1.5 -369.0442

7 1.6 -360.1253

8 1.7 -351.4563

9 1.8 -343.0373

10 1.9 -334.8683

11 2.0 -326.9494

12 2.1 -319.2804

13 2.2 -311.8614

14 2.3 -304.6924

15 2.4 -297.7734

16 2.5 -291.1045

17 2.6 -284.6855

18 2.7 -278.5165

19 2.8 -272.5975

20 2.9 -266.9286

21 3.0 -261.5096

22 3.1 -256.3406

23 3.2 -251.4216

24 3.3 -246.7527

25 3.4 -242.3337

26 3.5 -238.1647

27 3.6 -234.2457

28 3.7 -230.5768

29 3.8 -227.1578

30 3.9 -223.9888

31 4.0 -221.0698

32 4.1 -218.4008

33 4.2 -215.9819

34 4.3 -213.8129

35 4.4 -211.8939

36 4.5 -210.2249

37 4.6 -208.8060

38 4.7 -207.6370

39 4.8 -206.7180

40 4.9 -206.0490

41 5.0 -205.6301

42 5.1 -205.4611

43 5.2 -205.5421

44 5.3 -205.8731

45 5.4 -206.4542

46 5.5 -207.2852

47 5.6 -208.3662

48 5.7 -209.6972

49 5.8 -211.2782

50 5.9 -213.1093

51 6.0 -215.1903

52 6.1 -217.5213

53 6.2 -220.1023

54 6.3 -222.9334

55 6.4 -226.0144

56 6.5 -229.3454

57 6.6 -232.9264

58 6.7 -236.7575

59 6.8 -240.8385

60 6.9 -245.1695

61 7.0 -249.7505

62 7.1 -254.5816

63 7.2 -259.6626

64 7.3 -264.9936

65 7.4 -270.5746

66 7.5 -276.4056

67 7.6 -282.4867

68 7.7 -288.8177

69 7.8 -295.3987

70 7.9 -302.2297

71 8.0 -309.3108

72 8.1 -316.6418

73 8.2 -324.2228

74 8.3 -332.0538

75 8.4 -340.1349

76 8.5 -348.4659

77 8.6 -357.0469

78 8.7 -365.8779

79 8.8 -374.9590

80 8.9 -384.2900

81 9.0 -393.8710

82 9.1 -403.7020

83 9.2 -413.7830

84 9.3 -424.1141

85 9.4 -434.6951

86 9.5 -445.5261

87 9.6 -456.6071

88 9.7 -467.9382

89 9.8 -479.5192

90 9.9 -491.3502

91 10.0 -503.4312

Why are these numbers negative?

Normal Distribution example

- dnorm gives us the probability of an observation from the given distribution
- The log of a value between 0-1 is negative
- Log(.05)=-2.99
- What’s the MLE?
- m[which(log.l==max(log.l))]
- = 5.1

- m[which(log.l==max(log.l))]

Normal Distribution example

- What about estimating both the mean and the SD simultaneously?
- Use grid search approach again…
- Compute the log-likelihood at each combination of mean and SD

SD Mean log.l

1 1.0 1.0 -1061.6201

2 1.0 1.1 -1022.2843

3 1.0 1.2 -983.9486

4 1.0 1.3 -946.6129

5 1.0 1.4 -910.2771

6 1.0 1.5 -874.9414

7 1.0 1.6 -840.6056

8 1.0 1.7 -807.2699

9 1.0 1.8 -774.9341

10 1.0 1.9 -743.5984

11 1.0 2.0 -713.2627

12 1.0 2.1 -683.9269

13 1.0 2.2 -655.5912

14 1.0 2.3 -628.2554

15 1.0 2.4 -601.9197

16 1.0 2.5 -576.5839

17 1.0 2.6 -552.2482

18 1.0 2.7 -528.9125

19 1.0 2.8 -506.5767

20 1.0 2.9 -485.2410

853 1.9 4.3 -211.3830

854 1.9 4.4 -209.6280

855 1.9 4.5 -208.1499

856 1.9 4.6 -206.9489

857 1.9 4.7 -206.0249

858 1.9 4.8 -205.3779

859 1.9 4.9 -205.0078

860 1.9 5.0 -204.9148

861 1.9 5.1 -205.0988

862 1.9 5.2 -205.5599

863 1.9 5.3 -206.2979

864 1.9 5.4 -207.3129

865 1.9 5.5 -208.6049

866 1.9 5.6 -210.1740

867 1.9 5.7 -212.0200

868 1.9 5.8 -214.1431

869 1.9 5.9 -216.5432

870 1.9 6.0 -219.2203

871 1.9 6.1 -222.1743

872 1.9 6.2 -225.4054

873 1.9 6.3 -228.9135

6134 7.7 4.6 -299.1132

6135 7.7 4.7 -299.0569

6136 7.7 4.8 -299.0175

6137 7.7 4.9 -298.9950

6138 7.7 5.0 -298.9893

6139 7.7 5.1 -299.0006

6140 7.7 5.2 -299.0286

6141 7.7 5.3 -299.0736

6142 7.7 5.4 -299.1354

6143 7.7 5.5 -299.2140

6144 7.7 5.6 -299.3096

6145 7.7 5.7 -299.4220

6146 7.7 5.8 -299.5512

6147 7.7 5.9 -299.6974

6148 7.7 6.0 -299.8604

6149 7.7 6.1 -300.0402

6150 7.7 6.2 -300.2370

6151 7.7 6.3 -300.4506

Normal Distribution example

- Get max(log.l)
- m[which(log.l==max(log.l), arr.ind=T)]
- = 5.0, 1.9
- Note: this could be done the same way for a simple linear regression (2 parameters)

Algorithms

- Grid search works for these simple problems with few estimated parameters
- Much more advanced search algorithms are needed for more complex problems
- More advanced algs take advantage of the slope or gradient of the likelihood surface to make good guesses about the direction of search in parameter space

- We’ll use the “mlm” routine in R

Algorithms

- Grid Search:
- Vary each parameter in turn, compute the log-likelihood, then find parameter combination yielding lowest log-likelihood

- Gradient Search:
- Vary all parameters simultaneously, adjusting relative magnitudes of the variations so that the direction of propagation in parameter space is along the direction of steepest ascent of log-likelihood

- Expansion Methods:
- Find an approximate analytical function that describes the log-likelihood hypersurface and use this function to locate the minimum. Number of computed points is less, but computations are considerably more complicated.

- Marquardt Method: Gradient-Expansion combination

R – mlm routine

- First we need to define a function to maximize
- Wait! Most general routines focus on minimization
- e.g., root finding for solving equations

- So, usually minimize –log-likelihood

- Wait! Most general routines focus on minimization
- norm.func<-function(x,y) { sum(sapply(rv.norm, function(z)
-1*dnorm(z, mean=x, sd=y, log=T))) }

R – mlm routine

- norm.mle <- mle(norm.func, start=list(x=4,y=2), method="L-BFGS-B", lower=c(0, 0))
- Many interesting points
- Starting values
- Global vs. local maxima or minima

- Bounds
- SD can’t be negative

- Starting values

R – mlm routine

- Output - summary(norm.mle)
- Standard errors come from the inverse of the hessian matrix
- Convergence!!
- -2(log-likelihood) = deviance
- Functions like the R2 in regression

Coeficients:

Estimate Std. Error

x 4.844249 0.1817031

y 1.817031 0.1284834

-2 log L: 403.2285

> norm.mle@details$convergence

[1] 0

Maximum Likelihood Regression

- A standard regression:
- May be broken down into two components

Maximum Likelihood Regression

- First define our x's and y'sx<- 1:100 y<- 4 + 3*x+rnorm(100, mean=5, sd=20)
- Define -log likelihood function
reg.func <- function(b0,b1,sigma) { if(sigma<=0) return(NA) # no sd of 0 or less! yhat<-b0*x+b1 #the estimated function -sum(dnorm(y, mean=yhat, sd=sigma,log=T))

#the -log likelihood function }

Maximum Likelihood Regression

- Call MLE to minimize the –log-likelihood
lm.mle<-mle(reg.func, start=list(b0=2, b1=2, sigma=35))

- Get results - summary(lm.mle)

Coefficients:

Estimate Std. Error

b0 3.071449 0.0716271

b1 8.959386 4.1663956

sigma 20.675930 1.4621709

-2 log L: 889.567

Maximum Likelihood Regression

- Compare to OLS results
- lm(y~x)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.95635 4.20838 2.128 0.0358 *

x 3.07149 0.07235 42.454 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.88 on 98 degrees of freedom

Multiple R-Squared: 0.9484,

Standard Errors of Estimates

- Behavior of the likelihood function near the maximum is important
- If it is flat then observations have little to say about the parameters
- changes in the parameters will not cause large changes in the probability

- if the likelihood has a pronounced peak near to the maximum then small changes in parameters would cause large changes in probability
- In this cases we say that observation has more information about parameters

- Expressed as the second derivative (or curvature) of the log-likelihood function
- If more than 1 parameter, then 2nd partial deriviatives

- If it is flat then observations have little to say about the parameters

Standard Errors of Estimates

- Rate of change is the second derivative of a function (e.g., velocity and acceleration)
- Hessian Matrix is the matrix of 2nd partial derivatives of the -log-likelihood function
- The entries in the Hessian are called the observed information for an estimate

Standard Errors

- Information is used to obtained the expected variance (or standard error) or the estimated parameters
- When sample size becomes large then maximum likelihood estimator becomes approximately normally distributed with variance close to
- More precisely…

Likelihood Ratio Test

- Let LF be the maximum of the likelihood function for an unrestricted model
- Let LR be the maximum of the likelihood function of a restricted model nested in the full model
- LF must be greater than or equal to LR
- Removing a variable or adding a constraint can only hurt model fit. Same logic as R2

- Question: Does adding the constraint or removing the variable (constraint of zero) significantly impact model fit?
- Model fit will decrease but does it decrease more than would be expected by chance?

Likelihood Ratio Test

- Likelihood Ratio
- R = -2ln(LR / LF)
- R = 2(log(LF) – log(LR))
- R is distributed as chi-square distribution with m degrees of freedom
- m is the difference in the number of estimated parameters between the two models.
- The expected value of R is m, so if you get an R that is bigger than the difference in parameters then the constraint hurts model fit.
- More formally…. You should reference the chi-square table with m degrees of freedom to find the probability of getting R by chance alone, assuming that the null hypothesis is true.

Likelihood Ratio Example

- Go back to our simple regression example
- Does the variable (X) significantly improve our predictive ability or model fit?
- Alternatively, does removing X or constraining it’s parameter estimate to zero significantly decrease prediction or model fit?

- Full Model: -2log-L = 889.567
- Reduced Model: -2log-L =1186.05
- Chi-square critical value = 3.84

Fit Indices

- Akaike’s information criterion (AIC)
- Pronounced “Ah-kah-ee-key”
- K is the number of estimated parameters in our model.
- Penalizes the log-likelihood for using many parameters to increase fit
- Choose the model with the smallest AIC value

Fit Indices

- Bayesian Information Criterion (BIC)
- AKA- SIC for Schwarz Information Criterion
- Choose the model with the smallest BIC
- the likelihood is the probability of obtaining the data you did under the given model. It makes sense to choose a model that makes this probability as large as possible. But putting the minus sign in front switches the maximization to minimization

Multiple Regression

- -Log-Likelihood function for multiple regression
#Note, theta is a vector of parameters, with std.dev being the first one#theta[-1] is all values of theta, except the first#and here we're using matrix multiplication

ols.lf3 <- function(theta, y, X) { if (theta[1] <= 0) return(NA) -sum(dnorm(y, mean = X %*% theta[-1], sd =

sqrt(theta[1]), log = TRUE))}

Download Presentation

Connecting to Server..