1 / 34

# Maximum Likelihood Estimation - PowerPoint PPT Presentation

Maximum Likelihood Estimation. Psych 818 - DeShon. MLE vs. OLS. Ordinary Least Squares Estimation Typically yields a closed form solution that can be directly computed Closed form solutions often require very strong assumptions Maximum Likelihood Estimation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Maximum Likelihood Estimation' - orenda

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Maximum Likelihood Estimation

Psych 818 - DeShon

• Ordinary Least Squares Estimation

• Typically yields a closed form solution that can be directly computed

• Closed form solutions often require very strong assumptions

• Maximum Likelihood Estimation

• Default Method for most estimation problems

• Generally equal to OLS when OLS assumptions are met

• Method yields desirable “asymptotic” estimation properties

• Foundation for Bayesian inference

• Requires numerical methods :(

• MLE reverses the probability inference

• Recall: p(X|)

•  represents the parameters of a model (i.e., pdf)

• What’s the probability of observing a score of 73 from a N(70,10) distribution

• In MLE, you know the data (Xi)

• Primary question: Which of a potentially infinite number of distributions is most likely responsible for generating the data?

• p(|X)?

• Likelihood may be thought of as an unbounded or unnormalized probability measure

• PDF is a function of the data given the parameters on the data scale

• Likelihood is a function of the parameters given the data on the parameter scale

• Likelihood function

• Likelihood is the joint (product) probability of the observed data given the parameters of the pdf

• Assume you have X1,…,Xn independent samples from a given pdf, f

• Log-Likelihood function

• Working with products is a pain

• maxima are unaffected by monotone transformations, so can take the logarithm of the likelihood and turn it into a sum

• Find the value(s) of  that maximize the likelihood function

• Can sometimes be found analytically

• Maximization (or minimization) is the focus of calculus and derivatives of functions

• Often requires iterative numeric methods

• Normal Distribution example

• pdf:

• Likelihood

• Log-Likelihood

• Note: C is a constant that vanishes once derivatives are taken

• Can compute the maximum of this log-likelihood function directly

• More relevant and fun to estimate it numerically!

• Assume you obtain 100 samples from a normal distribution

• rv.norm <- rnorm(100, mean=5, sd=2)

• This is the true data generating model!

• Now, assume you don’t know the mean of this distribution and we have to estimate it…

• Let’s compute the log-likelihood of the observations for N(4,2)

• sum(dnorm(rv.norm, mean=4, sd=2, log=T))

• dnorm gives the probability of an observation for a given distribution

• Summing it across observations gives the log-likelihood

• = -221.0698

• This is the log-likelihood of the data for the given pdf parameters

• Okay, this is the log-likelihood for one possible distribution….we need to examine it for all possible distributions and select the one that yields the largest value

• Make a sequence of possible means

• m<-seq(from = 1, to = 10, by = 0.1)

• Now, compute the log-likelihood for each of the possible means

• This is a simple “grid search” algorithm

• log.l<-sapply(m, function (x) sum(dnorm(rv.norm, mean=x, sd=2, log=T)) )

mean log.l

1 1.0 -417.3891

2 1.1 -407.2201

3 1.2 -397.3012

4 1.3 -387.6322

5 1.4 -378.2132

6 1.5 -369.0442

7 1.6 -360.1253

8 1.7 -351.4563

9 1.8 -343.0373

10 1.9 -334.8683

11 2.0 -326.9494

12 2.1 -319.2804

13 2.2 -311.8614

14 2.3 -304.6924

15 2.4 -297.7734

16 2.5 -291.1045

17 2.6 -284.6855

18 2.7 -278.5165

19 2.8 -272.5975

20 2.9 -266.9286

21 3.0 -261.5096

22 3.1 -256.3406

23 3.2 -251.4216

24 3.3 -246.7527

25 3.4 -242.3337

26 3.5 -238.1647

27 3.6 -234.2457

28 3.7 -230.5768

29 3.8 -227.1578

30 3.9 -223.9888

31 4.0 -221.0698

32 4.1 -218.4008

33 4.2 -215.9819

34 4.3 -213.8129

35 4.4 -211.8939

36 4.5 -210.2249

37 4.6 -208.8060

38 4.7 -207.6370

39 4.8 -206.7180

40 4.9 -206.0490

41 5.0 -205.6301

42 5.1 -205.4611

43 5.2 -205.5421

44 5.3 -205.8731

45 5.4 -206.4542

46 5.5 -207.2852

47 5.6 -208.3662

48 5.7 -209.6972

49 5.8 -211.2782

50 5.9 -213.1093

51 6.0 -215.1903

52 6.1 -217.5213

53 6.2 -220.1023

54 6.3 -222.9334

55 6.4 -226.0144

56 6.5 -229.3454

57 6.6 -232.9264

58 6.7 -236.7575

59 6.8 -240.8385

60 6.9 -245.1695

61 7.0 -249.7505

62 7.1 -254.5816

63 7.2 -259.6626

64 7.3 -264.9936

65 7.4 -270.5746

66 7.5 -276.4056

67 7.6 -282.4867

68 7.7 -288.8177

69 7.8 -295.3987

70 7.9 -302.2297

71 8.0 -309.3108

72 8.1 -316.6418

73 8.2 -324.2228

74 8.3 -332.0538

75 8.4 -340.1349

76 8.5 -348.4659

77 8.6 -357.0469

78 8.7 -365.8779

79 8.8 -374.9590

80 8.9 -384.2900

81 9.0 -393.8710

82 9.1 -403.7020

83 9.2 -413.7830

84 9.3 -424.1141

85 9.4 -434.6951

86 9.5 -445.5261

87 9.6 -456.6071

88 9.7 -467.9382

89 9.8 -479.5192

90 9.9 -491.3502

91 10.0 -503.4312

Why are these numbers negative?

• dnorm gives us the probability of an observation from the given distribution

• The log of a value between 0-1 is negative

• Log(.05)=-2.99

• What’s the MLE?

• m[which(log.l==max(log.l))]

• = 5.1

• What about estimating both the mean and the SD simultaneously?

• Use grid search approach again…

• Compute the log-likelihood at each combination of mean and SD

SD Mean log.l

1 1.0 1.0 -1061.6201

2 1.0 1.1 -1022.2843

3 1.0 1.2 -983.9486

4 1.0 1.3 -946.6129

5 1.0 1.4 -910.2771

6 1.0 1.5 -874.9414

7 1.0 1.6 -840.6056

8 1.0 1.7 -807.2699

9 1.0 1.8 -774.9341

10 1.0 1.9 -743.5984

11 1.0 2.0 -713.2627

12 1.0 2.1 -683.9269

13 1.0 2.2 -655.5912

14 1.0 2.3 -628.2554

15 1.0 2.4 -601.9197

16 1.0 2.5 -576.5839

17 1.0 2.6 -552.2482

18 1.0 2.7 -528.9125

19 1.0 2.8 -506.5767

20 1.0 2.9 -485.2410

853 1.9 4.3 -211.3830

854 1.9 4.4 -209.6280

855 1.9 4.5 -208.1499

856 1.9 4.6 -206.9489

857 1.9 4.7 -206.0249

858 1.9 4.8 -205.3779

859 1.9 4.9 -205.0078

860 1.9 5.0 -204.9148

861 1.9 5.1 -205.0988

862 1.9 5.2 -205.5599

863 1.9 5.3 -206.2979

864 1.9 5.4 -207.3129

865 1.9 5.5 -208.6049

866 1.9 5.6 -210.1740

867 1.9 5.7 -212.0200

868 1.9 5.8 -214.1431

869 1.9 5.9 -216.5432

870 1.9 6.0 -219.2203

871 1.9 6.1 -222.1743

872 1.9 6.2 -225.4054

873 1.9 6.3 -228.9135

6134 7.7 4.6 -299.1132

6135 7.7 4.7 -299.0569

6136 7.7 4.8 -299.0175

6137 7.7 4.9 -298.9950

6138 7.7 5.0 -298.9893

6139 7.7 5.1 -299.0006

6140 7.7 5.2 -299.0286

6141 7.7 5.3 -299.0736

6142 7.7 5.4 -299.1354

6143 7.7 5.5 -299.2140

6144 7.7 5.6 -299.3096

6145 7.7 5.7 -299.4220

6146 7.7 5.8 -299.5512

6147 7.7 5.9 -299.6974

6148 7.7 6.0 -299.8604

6149 7.7 6.1 -300.0402

6150 7.7 6.2 -300.2370

6151 7.7 6.3 -300.4506

• Get max(log.l)

• m[which(log.l==max(log.l), arr.ind=T)]

• = 5.0, 1.9

• Note: this could be done the same way for a simple linear regression (2 parameters)

• Grid search works for these simple problems with few estimated parameters

• Much more advanced search algorithms are needed for more complex problems

• More advanced algs take advantage of the slope or gradient of the likelihood surface to make good guesses about the direction of search in parameter space

• We’ll use the “mlm” routine in R

• Grid Search:

• Vary each parameter in turn, compute the log-likelihood, then find parameter combination yielding lowest log-likelihood

• Vary all parameters simultaneously, adjusting relative magnitudes of the variations so that the direction of propagation in parameter space is along the direction of steepest ascent of log-likelihood

• Expansion Methods:

• Find an approximate analytical function that describes the log-likelihood hypersurface and use this function to locate the minimum. Number of computed points is less, but computations are considerably more complicated.

• First we need to define a function to maximize

• Wait! Most general routines focus on minimization

• e.g., root finding for solving equations

• So, usually minimize –log-likelihood

• norm.func<-function(x,y) {   sum(sapply(rv.norm, function(z)

-1*dnorm(z, mean=x, sd=y, log=T))) }

• norm.mle <- mle(norm.func, start=list(x=4,y=2), method="L-BFGS-B", lower=c(0, 0))

• Many interesting points

• Starting values

• Global vs. local maxima or minima

• Bounds

• SD can’t be negative

• Output - summary(norm.mle)

• Standard errors come from the inverse of the hessian matrix

• Convergence!!

• -2(log-likelihood) = deviance

• Functions like the R2 in regression

Coeficients:

Estimate Std. Error

x 4.844249 0.1817031

y 1.817031 0.1284834

-2 log L: 403.2285

> norm.mle@details\$convergence

[1] 0

• A standard regression:

• May be broken down into two components

• First define our x's and y'sx<- 1:100 y<- 4 + 3*x+rnorm(100, mean=5, sd=20)

• Define -log likelihood function

reg.func <- function(b0,b1,sigma) {   if(sigma<=0) return(NA) # no sd of 0 or less!   yhat<-b0*x+b1 #the estimated function   -sum(dnorm(y, mean=yhat, sd=sigma,log=T))

#the -log likelihood function }

• Call MLE to minimize the –log-likelihood

lm.mle<-mle(reg.func, start=list(b0=2, b1=2, sigma=35))

• Get results - summary(lm.mle)

Coefficients:

Estimate Std. Error

b0 3.071449 0.0716271

b1 8.959386 4.1663956

sigma 20.675930 1.4621709

-2 log L: 889.567

• Compare to OLS results

• lm(y~x)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.95635 4.20838 2.128 0.0358 *

x 3.07149 0.07235 42.454 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.88 on 98 degrees of freedom

Multiple R-Squared: 0.9484,

• Behavior of the likelihood function near the maximum is important

• If it is flat then observations have little to say about the parameters

• changes in the parameters will not cause large changes in the probability

• if the likelihood has a pronounced peak near to the maximum then small changes in parameters would cause large changes in probability

• Expressed as the second derivative (or curvature) of the log-likelihood function

• If more than 1 parameter, then 2nd partial deriviatives

• Rate of change is the second derivative of a function (e.g., velocity and acceleration)

• Hessian Matrix is the matrix of 2nd partial derivatives of the -log-likelihood function

• The entries in the Hessian are called the observed information for an estimate

• Information is used to obtained the expected variance (or standard error) or the estimated parameters

• When sample size becomes large then maximum likelihood estimator becomes approximately normally distributed with variance close to

• More precisely…

• Let LF be the maximum of the likelihood function for an unrestricted model

• Let LR be the maximum of the likelihood function of a restricted model nested in the full model

• LF must be greater than or equal to LR

• Removing a variable or adding a constraint can only hurt model fit. Same logic as R2

• Question: Does adding the constraint or removing the variable (constraint of zero) significantly impact model fit?

• Model fit will decrease but does it decrease more than would be expected by chance?

• Likelihood Ratio

• R = -2ln(LR / LF)

• R = 2(log(LF) – log(LR))

• R is distributed as chi-square distribution with m degrees of freedom

• m is the difference in the number of estimated parameters between the two models.

• The expected value of R is m, so if you get an R that is bigger than the difference in parameters then the constraint hurts model fit.

• More formally…. You should reference the chi-square table with m degrees of freedom to find the probability of getting R by chance alone, assuming that the null hypothesis is true.

• Go back to our simple regression example

• Does the variable (X) significantly improve our predictive ability or model fit?

• Alternatively, does removing X or constraining it’s parameter estimate to zero significantly decrease prediction or model fit?

• Full Model: -2log-L = 889.567

• Reduced Model: -2log-L =1186.05

• Chi-square critical value = 3.84

• Akaike’s information criterion (AIC)

• Pronounced “Ah-kah-ee-key”

• K is the number of estimated parameters in our model.

• Penalizes the log-likelihood for using many parameters to increase fit

• Choose the model with the smallest AIC value

• Bayesian Information Criterion (BIC)

• AKA- SIC for Schwarz Information Criterion

• Choose the model with the smallest BIC

• the likelihood is the probability of obtaining the data you did under the given model. It makes sense to choose a model that makes this probability as large as possible. But putting the minus sign in front switches the maximization to minimization

• -Log-Likelihood function for multiple regression

#Note, theta is a vector of parameters, with std.dev being the first one#theta[-1] is all values of theta, except the first#and here we're using matrix multiplication

ols.lf3 <- function(theta, y, X) {   if (theta[1] <= 0) return(NA)   -sum(dnorm(y, mean = X %*% theta[-1], sd =

sqrt(theta[1]), log = TRUE))}