Maximum likelihood estimation
Download
1 / 34

Maximum Likelihood Estimation - PowerPoint PPT Presentation


  • 135 Views
  • Uploaded on

Maximum Likelihood Estimation. Psych 818 - DeShon. MLE vs. OLS. Ordinary Least Squares Estimation Typically yields a closed form solution that can be directly computed Closed form solutions often require very strong assumptions Maximum Likelihood Estimation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Maximum Likelihood Estimation' - orenda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Maximum likelihood estimation

Maximum Likelihood Estimation

Psych 818 - DeShon


Mle vs ols
MLE vs. OLS

  • Ordinary Least Squares Estimation

    • Typically yields a closed form solution that can be directly computed

      • Closed form solutions often require very strong assumptions

  • Maximum Likelihood Estimation

    • Default Method for most estimation problems

    • Generally equal to OLS when OLS assumptions are met

    • Method yields desirable “asymptotic” estimation properties

    • Foundation for Bayesian inference

    • Requires numerical methods :(


Mle logic
MLE logic

  • MLE reverses the probability inference

    • Recall: p(X|)

      •  represents the parameters of a model (i.e., pdf)

      • What’s the probability of observing a score of 73 from a N(70,10) distribution

    • In MLE, you know the data (Xi)

      • Primary question: Which of a potentially infinite number of distributions is most likely responsible for generating the data?

      • p(|X)?


Likelihood
Likelihood

  • Likelihood may be thought of as an unbounded or unnormalized probability measure

    • PDF is a function of the data given the parameters on the data scale

    • Likelihood is a function of the parameters given the data on the parameter scale


Likelihood1
Likelihood

  • Likelihood function

    • Likelihood is the joint (product) probability of the observed data given the parameters of the pdf

    • Assume you have X1,…,Xn independent samples from a given pdf, f


Likelihood2
Likelihood

  • Log-Likelihood function

    • Working with products is a pain

    • maxima are unaffected by monotone transformations, so can take the logarithm of the likelihood and turn it into a sum


Maximum likelihood
Maximum Likelihood

  • Find the value(s) of  that maximize the likelihood function

  • Can sometimes be found analytically

    • Maximization (or minimization) is the focus of calculus and derivatives of functions

  • Often requires iterative numeric methods


Likelihood3
Likelihood

  • Normal Distribution example

    • pdf:

    • Likelihood

    • Log-Likelihood

    • Note: C is a constant that vanishes once derivatives are taken


Likelihood4
Likelihood

  • Can compute the maximum of this log-likelihood function directly

  • More relevant and fun to estimate it numerically!


Normal distribution example
Normal Distribution example

  • Assume you obtain 100 samples from a normal distribution

    • rv.norm <- rnorm(100, mean=5, sd=2)

      • This is the true data generating model!

    • Now, assume you don’t know the mean of this distribution and we have to estimate it…

    • Let’s compute the log-likelihood of the observations for N(4,2)


Normal distribution example1
Normal Distribution example

  • sum(dnorm(rv.norm, mean=4, sd=2, log=T))

    • dnorm gives the probability of an observation for a given distribution

    • Summing it across observations gives the log-likelihood

  • = -221.0698

    • This is the log-likelihood of the data for the given pdf parameters

  • Okay, this is the log-likelihood for one possible distribution….we need to examine it for all possible distributions and select the one that yields the largest value


  • Normal distribution example2
    Normal Distribution example

    • Make a sequence of possible means

      • m<-seq(from = 1, to = 10, by = 0.1)

    • Now, compute the log-likelihood for each of the possible means

      • This is a simple “grid search” algorithm

        • log.l<-sapply(m, function (x) sum(dnorm(rv.norm, mean=x, sd=2, log=T)) )


    Normal Distribution example

    mean log.l

    1 1.0 -417.3891

    2 1.1 -407.2201

    3 1.2 -397.3012

    4 1.3 -387.6322

    5 1.4 -378.2132

    6 1.5 -369.0442

    7 1.6 -360.1253

    8 1.7 -351.4563

    9 1.8 -343.0373

    10 1.9 -334.8683

    11 2.0 -326.9494

    12 2.1 -319.2804

    13 2.2 -311.8614

    14 2.3 -304.6924

    15 2.4 -297.7734

    16 2.5 -291.1045

    17 2.6 -284.6855

    18 2.7 -278.5165

    19 2.8 -272.5975

    20 2.9 -266.9286

    21 3.0 -261.5096

    22 3.1 -256.3406

    23 3.2 -251.4216

    24 3.3 -246.7527

    25 3.4 -242.3337

    26 3.5 -238.1647

    27 3.6 -234.2457

    28 3.7 -230.5768

    29 3.8 -227.1578

    30 3.9 -223.9888

    31 4.0 -221.0698

    32 4.1 -218.4008

    33 4.2 -215.9819

    34 4.3 -213.8129

    35 4.4 -211.8939

    36 4.5 -210.2249

    37 4.6 -208.8060

    38 4.7 -207.6370

    39 4.8 -206.7180

    40 4.9 -206.0490

    41 5.0 -205.6301

    42 5.1 -205.4611

    43 5.2 -205.5421

    44 5.3 -205.8731

    45 5.4 -206.4542

    46 5.5 -207.2852

    47 5.6 -208.3662

    48 5.7 -209.6972

    49 5.8 -211.2782

    50 5.9 -213.1093

    51 6.0 -215.1903

    52 6.1 -217.5213

    53 6.2 -220.1023

    54 6.3 -222.9334

    55 6.4 -226.0144

    56 6.5 -229.3454

    57 6.6 -232.9264

    58 6.7 -236.7575

    59 6.8 -240.8385

    60 6.9 -245.1695

    61 7.0 -249.7505

    62 7.1 -254.5816

    63 7.2 -259.6626

    64 7.3 -264.9936

    65 7.4 -270.5746

    66 7.5 -276.4056

    67 7.6 -282.4867

    68 7.7 -288.8177

    69 7.8 -295.3987

    70 7.9 -302.2297

    71 8.0 -309.3108

    72 8.1 -316.6418

    73 8.2 -324.2228

    74 8.3 -332.0538

    75 8.4 -340.1349

    76 8.5 -348.4659

    77 8.6 -357.0469

    78 8.7 -365.8779

    79 8.8 -374.9590

    80 8.9 -384.2900

    81 9.0 -393.8710

    82 9.1 -403.7020

    83 9.2 -413.7830

    84 9.3 -424.1141

    85 9.4 -434.6951

    86 9.5 -445.5261

    87 9.6 -456.6071

    88 9.7 -467.9382

    89 9.8 -479.5192

    90 9.9 -491.3502

    91 10.0 -503.4312

    Why are these numbers negative?


    Normal distribution example3
    Normal Distribution example

    • dnorm gives us the probability of an observation from the given distribution

    • The log of a value between 0-1 is negative

      • Log(.05)=-2.99

      • What’s the MLE?

        • m[which(log.l==max(log.l))]

          • = 5.1


    Normal distribution example4
    Normal Distribution example

    • What about estimating both the mean and the SD simultaneously?

      • Use grid search approach again…

      • Compute the log-likelihood at each combination of mean and SD

    SD Mean log.l

    1 1.0 1.0 -1061.6201

    2 1.0 1.1 -1022.2843

    3 1.0 1.2 -983.9486

    4 1.0 1.3 -946.6129

    5 1.0 1.4 -910.2771

    6 1.0 1.5 -874.9414

    7 1.0 1.6 -840.6056

    8 1.0 1.7 -807.2699

    9 1.0 1.8 -774.9341

    10 1.0 1.9 -743.5984

    11 1.0 2.0 -713.2627

    12 1.0 2.1 -683.9269

    13 1.0 2.2 -655.5912

    14 1.0 2.3 -628.2554

    15 1.0 2.4 -601.9197

    16 1.0 2.5 -576.5839

    17 1.0 2.6 -552.2482

    18 1.0 2.7 -528.9125

    19 1.0 2.8 -506.5767

    20 1.0 2.9 -485.2410

    853 1.9 4.3 -211.3830

    854 1.9 4.4 -209.6280

    855 1.9 4.5 -208.1499

    856 1.9 4.6 -206.9489

    857 1.9 4.7 -206.0249

    858 1.9 4.8 -205.3779

    859 1.9 4.9 -205.0078

    860 1.9 5.0 -204.9148

    861 1.9 5.1 -205.0988

    862 1.9 5.2 -205.5599

    863 1.9 5.3 -206.2979

    864 1.9 5.4 -207.3129

    865 1.9 5.5 -208.6049

    866 1.9 5.6 -210.1740

    867 1.9 5.7 -212.0200

    868 1.9 5.8 -214.1431

    869 1.9 5.9 -216.5432

    870 1.9 6.0 -219.2203

    871 1.9 6.1 -222.1743

    872 1.9 6.2 -225.4054

    873 1.9 6.3 -228.9135

    6134 7.7 4.6 -299.1132

    6135 7.7 4.7 -299.0569

    6136 7.7 4.8 -299.0175

    6137 7.7 4.9 -298.9950

    6138 7.7 5.0 -298.9893

    6139 7.7 5.1 -299.0006

    6140 7.7 5.2 -299.0286

    6141 7.7 5.3 -299.0736

    6142 7.7 5.4 -299.1354

    6143 7.7 5.5 -299.2140

    6144 7.7 5.6 -299.3096

    6145 7.7 5.7 -299.4220

    6146 7.7 5.8 -299.5512

    6147 7.7 5.9 -299.6974

    6148 7.7 6.0 -299.8604

    6149 7.7 6.1 -300.0402

    6150 7.7 6.2 -300.2370

    6151 7.7 6.3 -300.4506


    Normal distribution example5
    Normal Distribution example

    • Get max(log.l)

    • m[which(log.l==max(log.l), arr.ind=T)]

    • = 5.0, 1.9

    • Note: this could be done the same way for a simple linear regression (2 parameters)


    Algorithms
    Algorithms

    • Grid search works for these simple problems with few estimated parameters

    • Much more advanced search algorithms are needed for more complex problems

      • More advanced algs take advantage of the slope or gradient of the likelihood surface to make good guesses about the direction of search in parameter space

    • We’ll use the “mlm” routine in R


    Algorithms1
    Algorithms

    • Grid Search:

      • Vary each parameter in turn, compute the log-likelihood, then find parameter combination yielding lowest log-likelihood

    • Gradient Search:

      • Vary all parameters simultaneously, adjusting relative magnitudes of the variations so that the direction of propagation in parameter space is along the direction of steepest ascent of log-likelihood

    • Expansion Methods:

      • Find an approximate analytical function that describes the log-likelihood hypersurface and use this function to locate the minimum. Number of computed points is less, but computations are considerably more complicated.

    • Marquardt Method: Gradient-Expansion combination


    R mlm routine
    R – mlm routine

    • First we need to define a function to maximize

      • Wait! Most general routines focus on minimization

        • e.g., root finding for solving equations

      • So, usually minimize –log-likelihood

    • norm.func<-function(x,y) {   sum(sapply(rv.norm, function(z)

      -1*dnorm(z, mean=x, sd=y, log=T))) }


    R mlm routine1
    R – mlm routine

    • norm.mle <- mle(norm.func, start=list(x=4,y=2), method="L-BFGS-B", lower=c(0, 0))

    • Many interesting points

      • Starting values

        • Global vs. local maxima or minima

      • Bounds

        • SD can’t be negative


    R mlm routine2
    R – mlm routine

    • Output - summary(norm.mle)

    • Standard errors come from the inverse of the hessian matrix

    • Convergence!!

    • -2(log-likelihood) = deviance

      • Functions like the R2 in regression

    Coeficients:

    Estimate Std. Error

    x 4.844249 0.1817031

    y 1.817031 0.1284834

    -2 log L: 403.2285

    > [email protected]$convergence

    [1] 0


    Maximum likelihood regression
    Maximum Likelihood Regression

    • A standard regression:

    • May be broken down into two components


    Maximum likelihood regression1
    Maximum Likelihood Regression

    • First define our x's and y'sx<- 1:100 y<- 4 + 3*x+rnorm(100, mean=5, sd=20)

    • Define -log likelihood function

      reg.func <- function(b0,b1,sigma) {   if(sigma<=0) return(NA) # no sd of 0 or less!   yhat<-b0*x+b1 #the estimated function   -sum(dnorm(y, mean=yhat, sd=sigma,log=T))

      #the -log likelihood function }


    Maximum likelihood regression2
    Maximum Likelihood Regression

    • Call MLE to minimize the –log-likelihood

      lm.mle<-mle(reg.func, start=list(b0=2, b1=2, sigma=35))

    • Get results - summary(lm.mle)

    Coefficients:

    Estimate Std. Error

    b0 3.071449 0.0716271

    b1 8.959386 4.1663956

    sigma 20.675930 1.4621709

    -2 log L: 889.567


    Maximum likelihood regression3
    Maximum Likelihood Regression

    • Compare to OLS results

      • lm(y~x)

    Coefficients:

    Estimate Std. Error t value Pr(>|t|)

    (Intercept) 8.95635 4.20838 2.128 0.0358 *

    x 3.07149 0.07235 42.454 <2e-16 ***

    ---

    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 20.88 on 98 degrees of freedom

    Multiple R-Squared: 0.9484,


    Standard errors of estimates
    Standard Errors of Estimates

    • Behavior of the likelihood function near the maximum is important

      • If it is flat then observations have little to say about the parameters

        • changes in the parameters will not cause large changes in the probability

      • if the likelihood has a pronounced peak near to the maximum then small changes in parameters would cause large changes in probability

        • In this cases we say that observation has more information about parameters

      • Expressed as the second derivative (or curvature) of the log-likelihood function

      • If more than 1 parameter, then 2nd partial deriviatives


    Standard errors of estimates1
    Standard Errors of Estimates

    • Rate of change is the second derivative of a function (e.g., velocity and acceleration)

    • Hessian Matrix is the matrix of 2nd partial derivatives of the -log-likelihood function

    • The entries in the Hessian are called the observed information for an estimate


    Standard errors
    Standard Errors

    • Information is used to obtained the expected variance (or standard error) or the estimated parameters

    • When sample size becomes large then maximum likelihood estimator becomes approximately normally distributed with variance close to

    • More precisely…


    Likelihood ratio test
    Likelihood Ratio Test

    • Let LF be the maximum of the likelihood function for an unrestricted model

    • Let LR be the maximum of the likelihood function of a restricted model nested in the full model

    • LF must be greater than or equal to LR

      • Removing a variable or adding a constraint can only hurt model fit. Same logic as R2

    • Question: Does adding the constraint or removing the variable (constraint of zero) significantly impact model fit?

      • Model fit will decrease but does it decrease more than would be expected by chance?


    Likelihood ratio test1
    Likelihood Ratio Test

    • Likelihood Ratio

    • R = -2ln(LR / LF)

    • R = 2(log(LF) – log(LR))

    • R is distributed as chi-square distribution with m degrees of freedom

      • m is the difference in the number of estimated parameters between the two models.

      • The expected value of R is m, so if you get an R that is bigger than the difference in parameters then the constraint hurts model fit.

      • More formally…. You should reference the chi-square table with m degrees of freedom to find the probability of getting R by chance alone, assuming that the null hypothesis is true.


    Likelihood ratio example
    Likelihood Ratio Example

    • Go back to our simple regression example

    • Does the variable (X) significantly improve our predictive ability or model fit?

      • Alternatively, does removing X or constraining it’s parameter estimate to zero significantly decrease prediction or model fit?

    • Full Model: -2log-L = 889.567

    • Reduced Model: -2log-L =1186.05

    • Chi-square critical value = 3.84


    Fit indices
    Fit Indices

    • Akaike’s information criterion (AIC)

      • Pronounced “Ah-kah-ee-key”

      • K is the number of estimated parameters in our model.

      • Penalizes the log-likelihood for using many parameters to increase fit

      • Choose the model with the smallest AIC value


    Fit indices1
    Fit Indices

    • Bayesian Information Criterion (BIC)

      • AKA- SIC for Schwarz Information Criterion

      • Choose the model with the smallest BIC

      • the likelihood is the probability of obtaining the data you did under the given model. It makes sense to choose a model that makes this probability as large as possible. But putting the minus sign in front switches the maximization to minimization


    Multiple regression
    Multiple Regression

    • -Log-Likelihood function for multiple regression

      #Note, theta is a vector of parameters, with std.dev being the first one#theta[-1] is all values of theta, except the first#and here we're using matrix multiplication

      ols.lf3 <- function(theta, y, X) {   if (theta[1] <= 0) return(NA)   -sum(dnorm(y, mean = X %*% theta[-1], sd =

      sqrt(theta[1]), log = TRUE))}


    ad