Estimation of Item Response Models

Estimation of Item Response Models Mister Ibik Division of Psychology in Education Arizona State University EDP 691: Advanced Topics in Item Response Theory

Motivation and Objectives • Why estimate? • Distinguishing feature of IRT modeling as compared to classical techniques is the presence of parameters • These parameters characterize and guide inference regarding entities of interest (i.e., examinees, items) • We will think through: • Different estimation situations • Alternative estimation techniques • The logic and mathematics underpinning these techniques • Various strengths and weaknesses • What you will have • A detailed introduction to principles and mathematics • A resource to be revisited…and revisited…and revisited

Outline • Some Necessary Mathematical Background • Maximum Likelihood and Bayesian Theory • Estimation of Person Parameters When Item Parameters are Known • ML • MAP • EAP • Estimation of Item Parameters When Person Parameters are Known • ML • Simultaneous Estimation of Item and Person Parameters • JML • CML • MML • Other Approaches

Background: Finding the Root of an Equation • Newton-Raphson Algorithm • Finds the root of an equation • Example: the function f(x) = x2 • Has a root (where f(x) = 0) at x = 0

Newton-Raphson • Newton-Raphson takes a given point, x0, and systematically progresses to find the root of the equation • Utilizes the slope of the function to find where the root may be • The slope of the function is given by the derivative • Denoted • Gives the slope of the straight line that is tangent to f(x) at x • Tangent: best linear prediction of how the function is changing • For x0, the best guess for the root is the point where f′(x) = 0 • This occurs at • So the next candidate point for the root is:

Newton-Raphson Updating (1) • Suppose x0 = 1.5 f′(x0) = 3 f(x0) = 2.25 x1 = 0.75 x0 = 1.5

Newton-Raphson Updating (2) • Now x1 = 0.75 f′(x1) = 1.5 f(x1) = 0.5625 x2 = 0.375 x1 = 0.75

Newton-Raphson Example

Newton-Raphson Summary • Iterative algorithm for finding the root of an equation • Takes a starting point and systematically progresses to find the root of the function • Requires the derivative of the function • Each successive point is given by • The process continues until we get arbitrarily close, as usually measured by the change in some function

Difficulties With Newton-Raphson • Some functions have multiple roots • Which root is found often depends on the start value

Difficulties With Newton-Raphson • Numerical complications can arise • When the derivative is relatively small in magnitude, the algorithm shoots into outer space

Logic of Maximum Likelihood • A general approach to parameter estimation • The use of a model implies that the data may be sufficiently characterized by the features of the model, including the unknown parameters • Parameters govern the data in the sense that the data depend on the parameters • Given values of the parameters we can calculate the (conditional) probability of the data • P(Xij = 1 | θi, bj) = exp(θi – bj)/(1+ exp(θi – bj)) • Maximum likelihood (ML) estimation asks: “What are the values of the parameters that make the data most probable?”

Example: Series of Bernoulli Variables With Unknown Probability • Bernoulli variable: P(X = 1) = p • The probability of the data is given by pX× (1-p)(1-X) • Suppose we have two random variables X1 and X2 • When taken as a function of the parameters, it is called the likelihood • Suppose X1 =1, X2 = 0 • P(X1 =1, X2 = 0|p) = L(p|X1 =1, X2 = 0) = p × (1-p) • Choose p to maximize the conditional probability of the data • For p = 0.1, L = 0.1 × (1-0.1) = 0.09 • For p = 0.2, L = 0.2 × (1-0.2) = 0.16 • For p = 0.3, L = 0.3 × (1-0.3) = 0.21

Example: Likelihood Function

The Likelihood Function in IRT • The Likelihood may be thought of as the conditional probability, where the data are known and the parameters vary • Let Pij = P(Xij = 1 | θi, ωj) • The goal is to maximize this function – what values of the parameters yield the highest value?

Log-Likelihood Functions • It is numerically easier to maximize the natural logarithm of the likelihood • The log-likelihood has the same maximum as the likelihood

Maximizing the Log-Likelihood • Note that at the maximum of the function, the slope of the tangent line equals 0 • The slope of the tangent is given by the first derivative • If we can find the point at which the first derivative equals 0, we will have also found the point at which the function is maximized

Overview of Numerical Techniques • One can maximize the ln[L]function by finding a point where its derivative is 0 • A variety of methods are available for maximizing L, or ln[L] • Newton-Raphson • Fisher Scoring • Estimation-Maximization (EM) • The generality of ML estimation and these numerical techniques results in the same concepts and estimation routines being employed across modeling situations • Logistic regression, log-linear modeling, FA, SEM, LCA

ML Estimation of Person Parameters When Item Parameters Are Known • Assume item parameters bj, aj, and cj, are known • Assume unidimensionality, local and respondent independence Conditional probability now depends on person parameter only Likelihood function for the person parameters only

ML Estimation of Person Parameters When Item Parameters Are Known • Choose each θi such that L or ln[L]is maximized • Let’s suppose we have one examinee • Maximize this function using any of several methods • We’ll use Newton-Raphson

Newton-Raphson Estimation Recap • Recall NR seeks to find the root of a function (where = 0) • NR updates follow the general structure What is the derivative of this function? What is our function of interest? Current value Derivative of the function of interest Updated value Function of interest

Newton-Raphson Estimation of Person Parameters • Newton-Raphson uses the derivative of the function of interest • Our function is itself a derivative, the first derivative of ln[L] with respect to θi • We’ll need the second derivative as well as the first derivative • Updates given by

ML Estimation of Person Parameters When Item Parameters Are Known: The Log-Likelihood • The log-likelihood to be maximized • Select a start value and iterate towards a solution using Newton-Raphson • A “hill-climbing” sequence

ML Estimation of Person Parameters When Item Parameters Are Known: Newton-Raphson • Start at -1.0

ML Estimation of Person Parameters When Item Parameters Are Known: Newton-Raphson • Move to 0.09

ML Estimation of Person Parameters When Item Parameters Are Known: Newton-Raphson • Move to -0.0001 • When the change in θiis arbitrarily small (e.g., less than 0.001), stop estimation • No meaningful change in next step • The key is that the tangent is 0

Newton-Raphson Estimation of Multiple Person Parameters • But we have N examinees each with a θito be estimated • We need a multivariate version of the Newton-Raphson algorithm

First Order Derivatives • First order derivatives of the log-likelihood • ∂ln[L]/∂θi only involves terms corresponding to subject i Why???

Second Order Derivatives • Hessian: second order partial derivatives of the log-likelihood • This matrix needs to be inverted • In the current context, this matrix is diagonal Why???

Second Order Derivatives • The inverse of the Hessian is diagonal with elements that are the reciprocals of the diagonal of the Hessian • Updates for each θi do not depend on any other subject’s θ

Second Order Derivatives • The updates for each θi are independent of one another • The procedure can be performed one examinee at a time

ML Estimation of Person Parameters When Item Parameters Are Known: Standard Errors • The approximate, asymptotic standard error of the ML estimate of θi is • where I(θi) is the information function: • Standard errors are • asymptotic with respect to the number of items • approximate because only an estimate of θi is employed • asymptotically approximately unbiased

ML Estimation of Person Parameters When Item Parameters Are Known: Strengths • ML estimates have some desirable qualities • They are consistent • If a sufficient statistic exists, then the MLE is a function of that statistic (Rasch models) • Asymptotically normally distributed • Asymptotically most efficient (least variable) estimator among the class of normally distributed unbiased estimators • Asymptotically with respect to what?

ML Estimation of Person Parameters When Item Parameters Are Known: Weaknesses • ML estimates have some undesirable qualities • Estimates may fly off into outer space • They do not exist for so called “perfect scores” (all 1’s or 0’s) • Can be difficult to compute or verify when the likelihood function is not single peaked (may occur with 3-PLM or more complex IRT models)

ML Estimation of Person Parameters When Item Parameters Are Known: Weaknesses • Strategies to handle wayward solutions • Bound the amount of change at any one iteration • Atheoretical • No longer common • Use an alternative estimation framework (Fisher, Bayesian) • Strategies to handle perfect scores • Do not estimate θi • Use an alternative estimation framework (Bayesian) • Strategies to handle local maxima • Re-estimate the parameters using different starting points and look for agreement

ML Estimation of Person Parameters When Item Parameters Are Known: Weaknesses • An alternative to the Newton-Raphson technique is Fisher’s method of scoring • Instead of the Hessian, it uses the information matrix (based on the Hessian) • This usually leads to quicker convergence • Often is more stable than Newton-Raphson • But what about those perfect scores?

Bayes’ Theorem • We can avoid some of the problems that occur in ML estimation by employing a Bayesian approach • All entities treated as random variables • Bayes’ Theorem for random variables A and B Posterior distribution of A, given B: “The probability of A, given B.” Conditional probability of B, given A Prior probability of A Marginal probability of B

Bayes’ Theorem • If A is discrete • If A is continuous • Note that P(B|A) = L(A|B)

Bayesian Estimation of Person Parameters: The Posterior • Select a prior distribution for θidenoted P(θi) • Recall the likelihood function takes on the form P(Xi | θi) • The posterior density of θigiven Xi is • Since P(Xi) is a constant

Bayesian Estimation of Person Parameters: The Posterior The Likelilhood The Prior The Posterior

Maximum A Posteriori Estimation of Person Parameters • The Maximum A Posteriori (MAP) estimate is the maximum of the posterior density of θi • Computed by maximizing the posterior density, or its log • Find θi such that • Use Newton-Raphson or Fisher scoring • Max of ln[P(θi| Xi)] occurs at max of ln[P(Xi| θi)] + ln[P(θi)] • This can be thought of as augmenting the likelihood with prior information

Choice of Prior Distribution • Choosing P(θi) ~ U(-∞, ∞) yields the posterior to be proportional to the likelihood • In this case, the MAP is very similar to the ML estimate • The prior distribution P(θi) is often assumed to be N(0, 1) • The normal distribution commonly justified by appeal to CLT • Choice of mean and variance identifies the scale of the latent continuum

MAP Estimation of Person Parameters: Features • The approximate, asymptotic standard error of the MAP is where I(θi) is the information from the posterior density • Advantages of the MAP estimator • Exists for every response pattern – why? • Generally leads to a reduced tendency for local extrema • Disadvantages of the MAP estimator • Must specify a prior • Exhibits shrinkage in that it is biased towards the mean: May need lots of items to “swamp” the prior if it’s misspecified • Calculations are iterative and may take a long time • May result in local extrema

Expected A Posteriori (EAP) Estimation of Person Parameters • The Expected A Posteriori (EAP) estimator is the mean of the posterior distribution • Exact computations are often intractable • We approximate the integral using numerical techniques • Essentially, we take a weighted average of the values, where the weights are determined by the posterior distribution • Recall that the posterior distribution is itself determined by the prior and the likelihood

Numerical Integration Via Quadrature • The Posterior Distribution • With quadrature points • Evaluate the heights of the distribution at each point • Use the relative heights as the weights ∑ ≈ .165 .021 ⁄ .165 = .127 .002 ⁄ .165 = .015

EAP Estimation of via Quadrature • The Expected A Posteriori (EAP) is estimated by a weighted average: where H(Qr)is weight of point Qrin the posterior (compare Embretson & Reise, 2000; p. 177) • The standard error is the standard deviation in the posterior and may also be approximated via quadrature

EAP Estimation of via Quadrature • Advantages • Exists for all possible response patterns • Non-iterative solution strategy • Not a maximum, therefore no local extrema • Has smallest MSE in the population • Disadvantages • Must specify a prior • Exhibits shrinkage to the prior mean: If the prior is misspecified, may need lots of items to “swamp” the prior

ML Estimation of Item Parameters When Person Parameters Are Known: Assumptions • Assume • person parameters θi are known • respondent and local independence • Choose values for item parameters that maximize ln[L]

Estimation of Item Response Models