Beyond marlap new statistical tests for method validation
Download
1 / 72

Beyond MARLAP: New Statistical Tests For Method Validation - PowerPoint PPT Presentation


  • 190 Views
  • Uploaded on

Beyond MARLAP: New Statistical Tests For Method Validation. NAREL – ORIA – US EPA Laboratory Incident Response Workshop At the 53 rd Annual RRMC. Outline. The method validation problem MARLAP’s test And its peculiar features New approach – testing mean squared error (MSE)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Beyond MARLAP: New Statistical Tests For Method Validation' - ata


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Beyond marlap new statistical tests for method validation l.jpg

Beyond MARLAP:New Statistical TestsFor Method Validation

NAREL – ORIA – US EPA

Laboratory Incident Response Workshop

At the 53rd Annual RRMC


Outline l.jpg
Outline

  • The method validation problem

  • MARLAP’s test

    • And its peculiar features

  • New approach – testing mean squared error (MSE)

  • Two possible tests of MSE

    • Chi-squared test

    • Likelihood ratio test

  • Power comparisons

  • Recommendations and implications for MARLAP


The problem l.jpg
The Problem

  • We’ve prepared spiked samples at one or more activity levels

  • A lab has performed one or more analyses of the samples at each level

  • Our task: Evaluate the results to see whether the lab and method can achieve the required uncertainty (uReq) at each level


Marlap s test l.jpg
MARLAP’s Test

  • In 2003 the MARLAP work group developed a simple test for MARLAP Chapter 6

  • Chose a very simple criterion

  • Original criterion was whether every result was within ±3uReq of the target

  • Modified slightly to keep false rejection rate ≤ 5 % in all cases


Equations l.jpg
Equations

  • Acceptance range is TV ± kuReq where

    • TV = target value (true value)

    • uReq = required uncertainty at TV, and

  • E.g., for n = 21 measurements (7 reps at each of 3 levels), with α= 0.05, we get k = z0.99878 = 3.03

  • For smaller n we get slightly smaller k


Required uncertainty l.jpg
Required Uncertainty

  • The required uncertainty, uReq, is a function of the target value

  • Where uMR is the required method uncertainty at the upper bound of the gray region (UBGR)

  • φMR is the corresponding relative method uncertainty


Alternatives l.jpg
Alternatives

  • We considered a chi-squared (χ2) test as an alternative in 2003

  • Accounted for uncertainty of target values using “effective degrees of freedom”

  • Rejected at the time because of complexity and lack of evidence for performance

  • Kept the simple test that now appears in MARLAP Chapter 6

But we didn’t forget about the χ2 test


Peculiarity of marlap s test l.jpg
Peculiarity of MARLAP’s Test

  • Power to reject a biased but precise method decreases with number of analyses performed (n)

  • Because we adjusted the acceptance limits to keep false rejection rates low

  • Acceptance range gets wider as n gets larger


Slide9 l.jpg

Biased but Precise

This graphic image was borrowed and edited

for the RRMC workshop presentation. Please

view the original now at despair.com.

http://www.despair.com/consistency.html


Best use of data l.jpg
Best Use of Data?

  • It isn’t just about bias

  • MARLAP’s test uses data inefficiently – even to evaluate precision alone (its original purpose)

  • The statistic – in effect – is just the worst normalized deviation from the target value

  • Wastes a lot of useful information


Example the marlap test l.jpg
Example: The MARLAP Test

  • Suppose we perform a level D method validation experiment

    • UBGR = AL = 100 pCi/L

    • uMR = 10 pCi/L

    • φMR= 10/100 = 0.10, or 10 %

  • Three activity levels (L = 3)

    • 50 pCi/L, 100 pCi/L, and 300 pCi/L

  • Seven replicates per level (N = 7)

  • Allow 5 % false rejections (α = 0.05)


Example continued l.jpg
Example (continued)

  • For 21 measurements, calculate

  • When evaluating measurement results for target value TV, require for each result Xj:

  • Equivalently, require


Example continued13 l.jpg
Example (continued)

  • We’ll work through calculations at just one target value

  • Say TV = 300 pCi/L

  • This value is greater than UBGR (100 pCi/L)

  • So, the required uncertainty is 10 % of 300 pCi/L

    • uReq = 30 pCi/L


Example continued14 l.jpg
Example (continued)

  • Suppose the lab produces 7 results Xj shown at the right

  • For each result, calculate

    the “Z score”

  • We require |Zj| ≤ 3.0 for each j


Example continued15 l.jpg
Example (continued)

  • Every Zj is smaller than ±3.0

  • The method is obviously biased (~15 % low)

  • But it passes the MARLAP test


Slide16 l.jpg
2007

  • In early 2007 we were developing the new method validation guide

  • Applying MARLAP guidance, including the simple test of Chapter 6

  • Someone suggested presenting power curves in the context of bias

  • Time had come to reconsider MARLAP’s simple test


Bias and imprecision l.jpg
Bias and Imprecision

  • Which is worse: bias or imprecision?

  • Either leads to inaccuracy

  • Both are tolerable if not too large

  • When we talk about uncertainty (à la GUM), we don’t distinguish between the two


Mean squared error l.jpg
Mean Squared Error

  • When characterizing a method, we often consider bias and imprecision separately

  • Uncertainty estimates combine them

  • There is a concept in statistics that also combines them: mean squared error


Definition of mse l.jpg
Definition of MSE

  • If X is an estimator for a parameter θ, the mean squared error of X is

    • MSE(X) = E((X − θ)2) by definition

  • It also equals

    • MSE(X) = V(X) + Bias(X)2= σ2 + δ2

  • If X is unbiased, MSE(X) = V(X)= σ2

  • We tend to think in terms of the root MSE, which is the square root of MSE


New approach l.jpg
New Approach

  • For the method validation guide we chose a new conceptual approach

    A method is adequate if its root MSE at each activity level does not exceed the required uncertainty at that level

  • We don’t care whether the MSE is dominated by bias or imprecision


Root mse v standard uncertainty l.jpg
Root MSE v. Standard Uncertainty

  • Are root MSE and standard uncertainty really the same thing?

  • Not exactly, but one can interpret the GUM’s treatment of uncertainty in such a way that the two are closely related

  • We think our approach – testing uncertainty by testing MSE – is reasonable


Chi squared test revisited l.jpg
Chi-squared Test Revisited

  • For the new method validation document we simplified the χ2 test proposed (and rejected) in 2003

    • Ignore uncertainties of target values, which should be small

    • Just use a straightforward χ2 test

  • Presented as an alternative in App. E

    • But the document still uses MARLAP’s simple test


The two hypotheses l.jpg
The Two Hypotheses

  • We’re now explicitly testing the MSE

  • Null hypothesis (H0):

  • Alternative hypothesis (H1):

  • In MARLAP the 2 hypotheses were not clearly stated

  • Assumed any bias (δ) would be small

  • We were mainly testing variance (σ2)


A 2 test for variance l.jpg
A χ2 Test for Variance

  • Imagine we really tested variance only

  • H0:

  • H1:

  • We could calculate a χ2 statistic

  • Chi-squared with N − 1 degrees of freedom

  • Presumes there may be bias but doesn’t test for it


Mle for variance l.jpg
MLE for Variance

  • The maximum-likelihood estimator (MLE) for σ2 when the mean is unknown is:

  • Notice similarity to χ2 from preceding slide


Another 2 test for variance l.jpg
Another χ2 Test for Variance

  • We could calculate a different χ2 statistic

  • N degrees of freedom

  • Can be used to test variance if there is no bias

  • Any bias increases the rejection rate


Mle for mse l.jpg
MLE for MSE

  • The MLE for the MSE is:

  • Notice similarity to χ2 from preceding slide

  • In the context of biased measurements, χ2 seems to assess MSE rather than variance


Our proposed 2 test for mse l.jpg
Our Proposed χ2 Test for MSE

  • For a given activity level (TV), calculate a χ2 statistic W:

  • Calculate the critical value of W as follows:

  • N = number of replicate measurements

  • α = max false rejection rate at this level


Multiple activity levels l.jpg
Multiple Activity Levels

  • When testing at more than one activity level, calculate the critical value as follows:

  • Where L is the number of levels and N is the number of measurements at each level

  • Now α is the maximum overall false rejection rate


Evaluation criteria l.jpg
Evaluation Criteria

  • To perform the test, calculate Wi at each activity level TVi

  • Compare each Wi to wC

  • If Wi > wC for any i, reject the method

  • The method must pass the test at each spike activity level

  • Don’t allow bad performance at one level just because of good performance at another


Lesson learned l.jpg
Lesson Learned

  • Don’t test at too many levels

  • Otherwise you must choose:

    • High false acceptance rate at each level,

    • High overall false rejection rate, or

    • Complicated evaluation criteria

  • Prefer to keep error rates low

  • Need a low level and a high level

  • But probably not more than three levels (L=3)


Better use of same data l.jpg
Better Use of Same Data

  • The χ2 test makes better use of the measurement data than the MARLAP test

  • The statistic W is calculated from all the data at a given level – not just the most extreme value


Caveat l.jpg
Caveat

  • The distribution of W is not completely determined by the MSE

  • Depends on how MSE is partitioned into variance and bias components

  • Our test looks like a test of variance

    • As if we know δ = 0 and we’re testing σ2 only

  • But we’re actually using it to test MSE


False rejections l.jpg
False Rejections

  • If wC<N, the maximum false rejection rate (100 %) occurs when δ= ±uReq and σ=0

    • But you’ll never have this situation in practice

  • If wC≥N+2, the maximum false rejection rate occurs when σ=uReq and δ=0

    • This is the usual situation

    • Why we can assume the null distribution is χ2

  • Otherwise the maximum false rejection rate occurs when both δand σ are nonzero

    • This situation is unlikely in practice


To avoid high rejection rates l.jpg
To Avoid High Rejection Rates

  • We must have wC≥N+2

    • This will always be true if α<0.08, even if L=N=1

  • Ensures the maximum false rejection rate occurs when δ = 0 and the MSE is just σ2

  • Not stated explicitly in App. E, because:

    • We didn’t have a proof at the time

    • Not an issue if you follow the procedure

  • Now we have a proof


Example critical value l.jpg
Example: Critical Value

  • Suppose L = 3 and N = 7

  • Let α = 0.05

  • Then the critical value for W is

  • Since wC ≥ N + 2 = 9, we won’t have unexpectedly high false rejection rates

Since α < 0.08, we didn’t really have to check


Some facts about the power l.jpg
Some Facts about the Power

  • The power always increases with |δ|

  • The power increases with σ if

    or if

  • For a given bias δ with , there is a positive value of σ that minimizes the power

  • If , even this minimum power exceeds 50 %

  • Power increases with N


Power comparisons l.jpg
Power Comparisons

  • We compared the tests for power

    • Power to reject a biased method

    • Power to reject an imprecise method

  • The χ2 test outperforms the simple MARLAP test on both counts

  • Results of comparisons at end of this presentation


False rejection rates l.jpg
False Rejection Rates

H1

Rejection rate = α

Rejection rate < α

H0

Rejection rate = 0


Region of low power l.jpg
Region of Low Power

H1

Rejection rate = α

H0


Region of low power marlap l.jpg
Region of Low Power (MARLAP)

H1

Rejection rate = α

H0


Example applying the 2 test l.jpg
Example: Applying the χ2 Test

  • Return to the scenario used earlier for the MARLAP example

  • Three levels (L = 3)

  • Seven measurements per level (N = 7)

  • 5 % overall false rejection rate (α = 0.05)

  • Consider results at just one level, TV = 300 pCi/L, where uReq = 30 pCi/L


Example continued43 l.jpg
Example (continued)

  • Reuse the data from our

    earlier example 

  • Calculate the χ2 statistic

  • Since W > wC (17.4 > 17.1),

    the method is rejected

  • We’re using all the data now – not just the worst result


Likelihood ratio test for mse l.jpg
Likelihood Ratio Test for MSE

  • We also discovered a statistical test published in 1999, which directly addressed MSE for analytical methods

  • By Danish authors Erik Holst and Poul Thyregod

  • It’s a “likelihood ratio” test, which is a common, well accepted approach to hypothesis testing


Likelihood ratio tests l.jpg
Likelihood Ratio Tests

  • To test a hypothesis about a parameter θ, such as the MSE

  • First find a likelihood functionL(θ), which tells how “likely” a value of θ is, given the observed experimental data

    • Based on the probability mass function or probability density function for the data


Test statistic l.jpg
Test Statistic

  • Maximize L(θ) on all possible values of θ and again on all values of θ that satisfy the null hypothesis H0

  • Can use the ratio of these two maxima as a test statistic

  • The authors actually use λ = −2 ln(Λ) as the statistic for testing MSE


Critical values l.jpg
Critical Values

  • It isn’t simple to derive equations for λ, or to calculate percentiles of its distribution, but Holst and Thyregod did both

  • They used numerical integration to approximate percentiles of λ, which serve as critical values


Equations48 l.jpg
Equations

  • For the two-sided test statistic, λ:

  • Where is the unique real root of the cubic polynomial

    • See Holst & Thyregod for details


One sided test l.jpg
One-Sided Test

  • We actually need the one-sided test statistic:

  • This is equivalent to:


Issues l.jpg
Issues

  • The distribution of either λ or λ* is not completely determined by the MSE

  • Under H0 with , the percentiles λ1−α and λ*1−α are maximized when σ0 and |δ|uReq

  • To ensure the false rejection rate never exceeds α, use the maximum value of the percentile as the critical value

  • Apparently we improved on the authors’ method of calculating this maximum


Distribution function for l.jpg
Distribution Function for λ*

  • To calculate max values of the percentiles, use the following “cdf” for λ*:

  • From this equation, obtain percentiles of the null distribution by iteration

  • Select a percentile (e.g., 95th) as a critical value


The downside l.jpg
The Downside

  • More complicated to implement

  • Critical values are not readily available (unlike percentiles of χ2)

    • Unless you can program the equation from the preceding slide in software


Power of the likelihood ratio test l.jpg
Power of the Likelihood Ratio Test

  • More powerful than either the χ2 test or MARLAP’s test for rejecting biased methods

    • Sometimes much more powerful

  • Slightly less powerful at rejecting unbiased but imprecise methods

    • Not so much worse that we wouldn’t consider it a reasonable alternative


Power comparisons55 l.jpg
Power Comparisons

  • Same scenario as before:

    • Level D method validation

    • 3 activity levels: AL/2, AL, 3×AL

    • 7 replicate measurements per level

    • φMR is 0.10, or 10 %

  • Constant relative bias at all levels

  • Assume ratio σ/uReq constant at all levels






Power contours l.jpg
Power Contours

  • Same assumptions as before (Level D method validation, etc.)

  • Contour plots show power as a function of both δ and σ

  • Horizontal coordinate is bias (δ) at the action level

  • Vertical coordinate is the standard deviation (σ)

  • Power is shown as color


Power l.jpg
Power

−uReq

uReq

2uReq

Power of MARLAP’s test


Power62 l.jpg
Power

−uReq

uReq

2uReq

Power of the chi-squared test


Power63 l.jpg
Power

−uReq

uReq

2uReq

Power of the likelihood ratio test


Recommendations l.jpg
Recommendations

  • You can still use the MARLAP test

  • We prefer the χ2 test of App. E.

    • It’s simple

    • Critical values are widely available (percentiles of χ2)

    • It outperforms the MARLAP test

  • The likelihood ratio test is a possibility, but

    • It is somewhat complicated

    • Our guide doesn’t give you enough information to implement it


Implications for marlap l.jpg
Implications for MARLAP

  • The χ2 test for MSE will likely be included in revision 1 of MARLAP

  • So will the likelihood ratio test

    • Or maybe a variant of it

  • One or both of these will probably become the recommended test for evaluating a candidate method



Power calculations marlap l.jpg
Power Calculations – MARLAP

  • For MARLAP’s test, probability of rejecting a method at a given activity level is

  • Where σ is the method’s standard deviation at that level, δ is the method’s bias, and k is the multiplier calculated earlier (k ≈ 3)

  • Φ(z) is the cdf for the standard normal distribution


Power calculations continued l.jpg
Power Calculations (continued)

  • Same probability is calculated by the following equation

  • is the cdf for the non-central χ2 distribution

    • In this case, with ν = 1 degree of freedom and non-centrality parameter λ = δ2 / σ2


Power calculations 2 l.jpg
Power Calculations – χ2

  • For the new χ2 test, the probability of rejecting a method is

  • Where again σ is the method’s standard deviation at that level and δ is the method’s bias


Non central 2 cdf l.jpg
Non-Central χ2 CDF

  • The cdf for the non-central χ2 distribution is given by

  • Where P(∙,∙) denotes the incomplete gamma function

  • You can find algorithms for P(∙,∙), e.g., in books like Numerical Recipes


Solving the cubic l.jpg
Solving the Cubic

  • To solve the cubic equation for


Variations l.jpg
Variations

  • Another possibility: Use Holst & Thyregod’s methodology to derive a likelihood ratio test for H0: versus H1:

  • There are a couple of new issues to deal with when k>1

  • But the same approach mostly works

  • Only recently considered – not fully explored yet


ad