Beyond marlap new statistical tests for method validation
1 / 72

Beyond MARLAP: New Statistical Tests For Method Validation - PowerPoint PPT Presentation

  • Uploaded on

Beyond MARLAP: New Statistical Tests For Method Validation. NAREL – ORIA – US EPA Laboratory Incident Response Workshop At the 53 rd Annual RRMC. Outline. The method validation problem MARLAP’s test And its peculiar features New approach – testing mean squared error (MSE)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Beyond MARLAP: New Statistical Tests For Method Validation' - ata

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Beyond marlap new statistical tests for method validation l.jpg

Beyond MARLAP:New Statistical TestsFor Method Validation


Laboratory Incident Response Workshop

At the 53rd Annual RRMC

Outline l.jpg

  • The method validation problem

  • MARLAP’s test

    • And its peculiar features

  • New approach – testing mean squared error (MSE)

  • Two possible tests of MSE

    • Chi-squared test

    • Likelihood ratio test

  • Power comparisons

  • Recommendations and implications for MARLAP

The problem l.jpg
The Problem

  • We’ve prepared spiked samples at one or more activity levels

  • A lab has performed one or more analyses of the samples at each level

  • Our task: Evaluate the results to see whether the lab and method can achieve the required uncertainty (uReq) at each level

Marlap s test l.jpg

  • In 2003 the MARLAP work group developed a simple test for MARLAP Chapter 6

  • Chose a very simple criterion

  • Original criterion was whether every result was within ±3uReq of the target

  • Modified slightly to keep false rejection rate ≤ 5 % in all cases

Equations l.jpg

  • Acceptance range is TV ± kuReq where

    • TV = target value (true value)

    • uReq = required uncertainty at TV, and

  • E.g., for n = 21 measurements (7 reps at each of 3 levels), with α= 0.05, we get k = z0.99878 = 3.03

  • For smaller n we get slightly smaller k

Required uncertainty l.jpg
Required Uncertainty

  • The required uncertainty, uReq, is a function of the target value

  • Where uMR is the required method uncertainty at the upper bound of the gray region (UBGR)

  • φMR is the corresponding relative method uncertainty

Alternatives l.jpg

  • We considered a chi-squared (χ2) test as an alternative in 2003

  • Accounted for uncertainty of target values using “effective degrees of freedom”

  • Rejected at the time because of complexity and lack of evidence for performance

  • Kept the simple test that now appears in MARLAP Chapter 6

But we didn’t forget about the χ2 test

Peculiarity of marlap s test l.jpg
Peculiarity of MARLAP’s Test

  • Power to reject a biased but precise method decreases with number of analyses performed (n)

  • Because we adjusted the acceptance limits to keep false rejection rates low

  • Acceptance range gets wider as n gets larger

Slide9 l.jpg

Biased but Precise

This graphic image was borrowed and edited

for the RRMC workshop presentation. Please

view the original now at

Best use of data l.jpg
Best Use of Data?

  • It isn’t just about bias

  • MARLAP’s test uses data inefficiently – even to evaluate precision alone (its original purpose)

  • The statistic – in effect – is just the worst normalized deviation from the target value

  • Wastes a lot of useful information

Example the marlap test l.jpg
Example: The MARLAP Test

  • Suppose we perform a level D method validation experiment

    • UBGR = AL = 100 pCi/L

    • uMR = 10 pCi/L

    • φMR= 10/100 = 0.10, or 10 %

  • Three activity levels (L = 3)

    • 50 pCi/L, 100 pCi/L, and 300 pCi/L

  • Seven replicates per level (N = 7)

  • Allow 5 % false rejections (α = 0.05)

Example continued l.jpg
Example (continued)

  • For 21 measurements, calculate

  • When evaluating measurement results for target value TV, require for each result Xj:

  • Equivalently, require

Example continued13 l.jpg
Example (continued)

  • We’ll work through calculations at just one target value

  • Say TV = 300 pCi/L

  • This value is greater than UBGR (100 pCi/L)

  • So, the required uncertainty is 10 % of 300 pCi/L

    • uReq = 30 pCi/L

Example continued14 l.jpg
Example (continued)

  • Suppose the lab produces 7 results Xj shown at the right

  • For each result, calculate

    the “Z score”

  • We require |Zj| ≤ 3.0 for each j

Example continued15 l.jpg
Example (continued)

  • Every Zj is smaller than ±3.0

  • The method is obviously biased (~15 % low)

  • But it passes the MARLAP test

Slide16 l.jpg

  • In early 2007 we were developing the new method validation guide

  • Applying MARLAP guidance, including the simple test of Chapter 6

  • Someone suggested presenting power curves in the context of bias

  • Time had come to reconsider MARLAP’s simple test

Bias and imprecision l.jpg
Bias and Imprecision

  • Which is worse: bias or imprecision?

  • Either leads to inaccuracy

  • Both are tolerable if not too large

  • When we talk about uncertainty (à la GUM), we don’t distinguish between the two

Mean squared error l.jpg
Mean Squared Error

  • When characterizing a method, we often consider bias and imprecision separately

  • Uncertainty estimates combine them

  • There is a concept in statistics that also combines them: mean squared error

Definition of mse l.jpg
Definition of MSE

  • If X is an estimator for a parameter θ, the mean squared error of X is

    • MSE(X) = E((X − θ)2) by definition

  • It also equals

    • MSE(X) = V(X) + Bias(X)2= σ2 + δ2

  • If X is unbiased, MSE(X) = V(X)= σ2

  • We tend to think in terms of the root MSE, which is the square root of MSE

New approach l.jpg
New Approach

  • For the method validation guide we chose a new conceptual approach

    A method is adequate if its root MSE at each activity level does not exceed the required uncertainty at that level

  • We don’t care whether the MSE is dominated by bias or imprecision

Root mse v standard uncertainty l.jpg
Root MSE v. Standard Uncertainty

  • Are root MSE and standard uncertainty really the same thing?

  • Not exactly, but one can interpret the GUM’s treatment of uncertainty in such a way that the two are closely related

  • We think our approach – testing uncertainty by testing MSE – is reasonable

Chi squared test revisited l.jpg
Chi-squared Test Revisited

  • For the new method validation document we simplified the χ2 test proposed (and rejected) in 2003

    • Ignore uncertainties of target values, which should be small

    • Just use a straightforward χ2 test

  • Presented as an alternative in App. E

    • But the document still uses MARLAP’s simple test

The two hypotheses l.jpg
The Two Hypotheses

  • We’re now explicitly testing the MSE

  • Null hypothesis (H0):

  • Alternative hypothesis (H1):

  • In MARLAP the 2 hypotheses were not clearly stated

  • Assumed any bias (δ) would be small

  • We were mainly testing variance (σ2)

A 2 test for variance l.jpg
A χ2 Test for Variance

  • Imagine we really tested variance only

  • H0:

  • H1:

  • We could calculate a χ2 statistic

  • Chi-squared with N − 1 degrees of freedom

  • Presumes there may be bias but doesn’t test for it

Mle for variance l.jpg
MLE for Variance

  • The maximum-likelihood estimator (MLE) for σ2 when the mean is unknown is:

  • Notice similarity to χ2 from preceding slide

Another 2 test for variance l.jpg
Another χ2 Test for Variance

  • We could calculate a different χ2 statistic

  • N degrees of freedom

  • Can be used to test variance if there is no bias

  • Any bias increases the rejection rate

Mle for mse l.jpg

  • The MLE for the MSE is:

  • Notice similarity to χ2 from preceding slide

  • In the context of biased measurements, χ2 seems to assess MSE rather than variance

Our proposed 2 test for mse l.jpg
Our Proposed χ2 Test for MSE

  • For a given activity level (TV), calculate a χ2 statistic W:

  • Calculate the critical value of W as follows:

  • N = number of replicate measurements

  • α = max false rejection rate at this level

Multiple activity levels l.jpg
Multiple Activity Levels

  • When testing at more than one activity level, calculate the critical value as follows:

  • Where L is the number of levels and N is the number of measurements at each level

  • Now α is the maximum overall false rejection rate

Evaluation criteria l.jpg
Evaluation Criteria

  • To perform the test, calculate Wi at each activity level TVi

  • Compare each Wi to wC

  • If Wi > wC for any i, reject the method

  • The method must pass the test at each spike activity level

  • Don’t allow bad performance at one level just because of good performance at another

Lesson learned l.jpg
Lesson Learned

  • Don’t test at too many levels

  • Otherwise you must choose:

    • High false acceptance rate at each level,

    • High overall false rejection rate, or

    • Complicated evaluation criteria

  • Prefer to keep error rates low

  • Need a low level and a high level

  • But probably not more than three levels (L=3)

Better use of same data l.jpg
Better Use of Same Data

  • The χ2 test makes better use of the measurement data than the MARLAP test

  • The statistic W is calculated from all the data at a given level – not just the most extreme value

Caveat l.jpg

  • The distribution of W is not completely determined by the MSE

  • Depends on how MSE is partitioned into variance and bias components

  • Our test looks like a test of variance

    • As if we know δ = 0 and we’re testing σ2 only

  • But we’re actually using it to test MSE

False rejections l.jpg
False Rejections

  • If wC<N, the maximum false rejection rate (100 %) occurs when δ= ±uReq and σ=0

    • But you’ll never have this situation in practice

  • If wC≥N+2, the maximum false rejection rate occurs when σ=uReq and δ=0

    • This is the usual situation

    • Why we can assume the null distribution is χ2

  • Otherwise the maximum false rejection rate occurs when both δand σ are nonzero

    • This situation is unlikely in practice

To avoid high rejection rates l.jpg
To Avoid High Rejection Rates

  • We must have wC≥N+2

    • This will always be true if α<0.08, even if L=N=1

  • Ensures the maximum false rejection rate occurs when δ = 0 and the MSE is just σ2

  • Not stated explicitly in App. E, because:

    • We didn’t have a proof at the time

    • Not an issue if you follow the procedure

  • Now we have a proof

Example critical value l.jpg
Example: Critical Value

  • Suppose L = 3 and N = 7

  • Let α = 0.05

  • Then the critical value for W is

  • Since wC ≥ N + 2 = 9, we won’t have unexpectedly high false rejection rates

Since α < 0.08, we didn’t really have to check

Some facts about the power l.jpg
Some Facts about the Power

  • The power always increases with |δ|

  • The power increases with σ if

    or if

  • For a given bias δ with , there is a positive value of σ that minimizes the power

  • If , even this minimum power exceeds 50 %

  • Power increases with N

Power comparisons l.jpg
Power Comparisons

  • We compared the tests for power

    • Power to reject a biased method

    • Power to reject an imprecise method

  • The χ2 test outperforms the simple MARLAP test on both counts

  • Results of comparisons at end of this presentation

False rejection rates l.jpg
False Rejection Rates


Rejection rate = α

Rejection rate < α


Rejection rate = 0

Region of low power l.jpg
Region of Low Power


Rejection rate = α


Region of low power marlap l.jpg
Region of Low Power (MARLAP)


Rejection rate = α


Example applying the 2 test l.jpg
Example: Applying the χ2 Test

  • Return to the scenario used earlier for the MARLAP example

  • Three levels (L = 3)

  • Seven measurements per level (N = 7)

  • 5 % overall false rejection rate (α = 0.05)

  • Consider results at just one level, TV = 300 pCi/L, where uReq = 30 pCi/L

Example continued43 l.jpg
Example (continued)

  • Reuse the data from our

    earlier example 

  • Calculate the χ2 statistic

  • Since W > wC (17.4 > 17.1),

    the method is rejected

  • We’re using all the data now – not just the worst result

Likelihood ratio test for mse l.jpg
Likelihood Ratio Test for MSE

  • We also discovered a statistical test published in 1999, which directly addressed MSE for analytical methods

  • By Danish authors Erik Holst and Poul Thyregod

  • It’s a “likelihood ratio” test, which is a common, well accepted approach to hypothesis testing

Likelihood ratio tests l.jpg
Likelihood Ratio Tests

  • To test a hypothesis about a parameter θ, such as the MSE

  • First find a likelihood functionL(θ), which tells how “likely” a value of θ is, given the observed experimental data

    • Based on the probability mass function or probability density function for the data

Test statistic l.jpg
Test Statistic

  • Maximize L(θ) on all possible values of θ and again on all values of θ that satisfy the null hypothesis H0

  • Can use the ratio of these two maxima as a test statistic

  • The authors actually use λ = −2 ln(Λ) as the statistic for testing MSE

Critical values l.jpg
Critical Values

  • It isn’t simple to derive equations for λ, or to calculate percentiles of its distribution, but Holst and Thyregod did both

  • They used numerical integration to approximate percentiles of λ, which serve as critical values

Equations48 l.jpg

  • For the two-sided test statistic, λ:

  • Where is the unique real root of the cubic polynomial

    • See Holst & Thyregod for details

One sided test l.jpg
One-Sided Test

  • We actually need the one-sided test statistic:

  • This is equivalent to:

Issues l.jpg

  • The distribution of either λ or λ* is not completely determined by the MSE

  • Under H0 with , the percentiles λ1−α and λ*1−α are maximized when σ0 and |δ|uReq

  • To ensure the false rejection rate never exceeds α, use the maximum value of the percentile as the critical value

  • Apparently we improved on the authors’ method of calculating this maximum

Distribution function for l.jpg
Distribution Function for λ*

  • To calculate max values of the percentiles, use the following “cdf” for λ*:

  • From this equation, obtain percentiles of the null distribution by iteration

  • Select a percentile (e.g., 95th) as a critical value

The downside l.jpg
The Downside

  • More complicated to implement

  • Critical values are not readily available (unlike percentiles of χ2)

    • Unless you can program the equation from the preceding slide in software

Power of the likelihood ratio test l.jpg
Power of the Likelihood Ratio Test

  • More powerful than either the χ2 test or MARLAP’s test for rejecting biased methods

    • Sometimes much more powerful

  • Slightly less powerful at rejecting unbiased but imprecise methods

    • Not so much worse that we wouldn’t consider it a reasonable alternative

Power comparisons55 l.jpg
Power Comparisons

  • Same scenario as before:

    • Level D method validation

    • 3 activity levels: AL/2, AL, 3×AL

    • 7 replicate measurements per level

    • φMR is 0.10, or 10 %

  • Constant relative bias at all levels

  • Assume ratio σ/uReq constant at all levels

Power contours l.jpg
Power Contours

  • Same assumptions as before (Level D method validation, etc.)

  • Contour plots show power as a function of both δ and σ

  • Horizontal coordinate is bias (δ) at the action level

  • Vertical coordinate is the standard deviation (σ)

  • Power is shown as color

Power l.jpg




Power of MARLAP’s test

Power62 l.jpg




Power of the chi-squared test

Power63 l.jpg




Power of the likelihood ratio test

Recommendations l.jpg

  • You can still use the MARLAP test

  • We prefer the χ2 test of App. E.

    • It’s simple

    • Critical values are widely available (percentiles of χ2)

    • It outperforms the MARLAP test

  • The likelihood ratio test is a possibility, but

    • It is somewhat complicated

    • Our guide doesn’t give you enough information to implement it

Implications for marlap l.jpg
Implications for MARLAP

  • The χ2 test for MSE will likely be included in revision 1 of MARLAP

  • So will the likelihood ratio test

    • Or maybe a variant of it

  • One or both of these will probably become the recommended test for evaluating a candidate method

Power calculations marlap l.jpg
Power Calculations – MARLAP

  • For MARLAP’s test, probability of rejecting a method at a given activity level is

  • Where σ is the method’s standard deviation at that level, δ is the method’s bias, and k is the multiplier calculated earlier (k ≈ 3)

  • Φ(z) is the cdf for the standard normal distribution

Power calculations continued l.jpg
Power Calculations (continued)

  • Same probability is calculated by the following equation

  • is the cdf for the non-central χ2 distribution

    • In this case, with ν = 1 degree of freedom and non-centrality parameter λ = δ2 / σ2

Power calculations 2 l.jpg
Power Calculations – χ2

  • For the new χ2 test, the probability of rejecting a method is

  • Where again σ is the method’s standard deviation at that level and δ is the method’s bias

Non central 2 cdf l.jpg
Non-Central χ2 CDF

  • The cdf for the non-central χ2 distribution is given by

  • Where P(∙,∙) denotes the incomplete gamma function

  • You can find algorithms for P(∙,∙), e.g., in books like Numerical Recipes

Solving the cubic l.jpg
Solving the Cubic

  • To solve the cubic equation for

Variations l.jpg

  • Another possibility: Use Holst & Thyregod’s methodology to derive a likelihood ratio test for H0: versus H1:

  • There are a couple of new issues to deal with when k>1

  • But the same approach mostly works

  • Only recently considered – not fully explored yet