Loading in 5 sec....

Beyond MARLAP: New Statistical Tests For Method ValidationPowerPoint Presentation

Beyond MARLAP: New Statistical Tests For Method Validation

- By
**ata** - Follow User

- 190 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Beyond MARLAP: New Statistical Tests For Method Validation' - ata

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Beyond MARLAP:New Statistical TestsFor Method Validation

NAREL – ORIA – US EPA

Laboratory Incident Response Workshop

At the 53rd Annual RRMC

Outline

- The method validation problem
- MARLAP’s test
- And its peculiar features

- New approach – testing mean squared error (MSE)
- Two possible tests of MSE
- Chi-squared test
- Likelihood ratio test

- Power comparisons
- Recommendations and implications for MARLAP

The Problem

- We’ve prepared spiked samples at one or more activity levels
- A lab has performed one or more analyses of the samples at each level
- Our task: Evaluate the results to see whether the lab and method can achieve the required uncertainty (uReq) at each level

MARLAP’s Test

- In 2003 the MARLAP work group developed a simple test for MARLAP Chapter 6
- Chose a very simple criterion
- Original criterion was whether every result was within ±3uReq of the target
- Modified slightly to keep false rejection rate ≤ 5 % in all cases

Equations

- Acceptance range is TV ± kuReq where
- TV = target value (true value)
- uReq = required uncertainty at TV, and

- E.g., for n = 21 measurements (7 reps at each of 3 levels), with α= 0.05, we get k = z0.99878 = 3.03
- For smaller n we get slightly smaller k

Required Uncertainty

- The required uncertainty, uReq, is a function of the target value
- Where uMR is the required method uncertainty at the upper bound of the gray region (UBGR)
- φMR is the corresponding relative method uncertainty

Alternatives

- We considered a chi-squared (χ2) test as an alternative in 2003
- Accounted for uncertainty of target values using “effective degrees of freedom”
- Rejected at the time because of complexity and lack of evidence for performance
- Kept the simple test that now appears in MARLAP Chapter 6

But we didn’t forget about the χ2 test

Peculiarity of MARLAP’s Test

- Power to reject a biased but precise method decreases with number of analyses performed (n)
- Because we adjusted the acceptance limits to keep false rejection rates low
- Acceptance range gets wider as n gets larger

Biased but Precise

This graphic image was borrowed and edited

for the RRMC workshop presentation. Please

view the original now at despair.com.

http://www.despair.com/consistency.html

Best Use of Data?

- It isn’t just about bias
- MARLAP’s test uses data inefficiently – even to evaluate precision alone (its original purpose)
- The statistic – in effect – is just the worst normalized deviation from the target value
- Wastes a lot of useful information

Example: The MARLAP Test

- Suppose we perform a level D method validation experiment
- UBGR = AL = 100 pCi/L
- uMR = 10 pCi/L
- φMR= 10/100 = 0.10, or 10 %

- Three activity levels (L = 3)
- 50 pCi/L, 100 pCi/L, and 300 pCi/L

- Seven replicates per level (N = 7)
- Allow 5 % false rejections (α = 0.05)

Example (continued)

- For 21 measurements, calculate
- When evaluating measurement results for target value TV, require for each result Xj:
- Equivalently, require

Example (continued)

- We’ll work through calculations at just one target value
- Say TV = 300 pCi/L
- This value is greater than UBGR (100 pCi/L)
- So, the required uncertainty is 10 % of 300 pCi/L
- uReq = 30 pCi/L

Example (continued)

- Suppose the lab produces 7 results Xj shown at the right
- For each result, calculate
the “Z score”

- We require |Zj| ≤ 3.0 for each j

Example (continued)

- Every Zj is smaller than ±3.0
- The method is obviously biased (~15 % low)
- But it passes the MARLAP test

2007

- In early 2007 we were developing the new method validation guide
- Applying MARLAP guidance, including the simple test of Chapter 6
- Someone suggested presenting power curves in the context of bias
- Time had come to reconsider MARLAP’s simple test

Bias and Imprecision

- Which is worse: bias or imprecision?
- Either leads to inaccuracy
- Both are tolerable if not too large
- When we talk about uncertainty (à la GUM), we don’t distinguish between the two

Mean Squared Error

- When characterizing a method, we often consider bias and imprecision separately
- Uncertainty estimates combine them
- There is a concept in statistics that also combines them: mean squared error

Definition of MSE

- If X is an estimator for a parameter θ, the mean squared error of X is
- MSE(X) = E((X − θ)2) by definition

- It also equals
- MSE(X) = V(X) + Bias(X)2= σ2 + δ2

- If X is unbiased, MSE(X) = V(X)= σ2
- We tend to think in terms of the root MSE, which is the square root of MSE

New Approach

- For the method validation guide we chose a new conceptual approach
A method is adequate if its root MSE at each activity level does not exceed the required uncertainty at that level

- We don’t care whether the MSE is dominated by bias or imprecision

Root MSE v. Standard Uncertainty

- Are root MSE and standard uncertainty really the same thing?
- Not exactly, but one can interpret the GUM’s treatment of uncertainty in such a way that the two are closely related
- We think our approach – testing uncertainty by testing MSE – is reasonable

Chi-squared Test Revisited

- For the new method validation document we simplified the χ2 test proposed (and rejected) in 2003
- Ignore uncertainties of target values, which should be small
- Just use a straightforward χ2 test

- Presented as an alternative in App. E
- But the document still uses MARLAP’s simple test

The Two Hypotheses

- We’re now explicitly testing the MSE
- Null hypothesis (H0):
- Alternative hypothesis (H1):
- In MARLAP the 2 hypotheses were not clearly stated
- Assumed any bias (δ) would be small
- We were mainly testing variance (σ2)

A χ2 Test for Variance

- Imagine we really tested variance only
- H0:
- H1:
- We could calculate a χ2 statistic
- Chi-squared with N − 1 degrees of freedom
- Presumes there may be bias but doesn’t test for it

MLE for Variance

- The maximum-likelihood estimator (MLE) for σ2 when the mean is unknown is:
- Notice similarity to χ2 from preceding slide

Another χ2 Test for Variance

- We could calculate a different χ2 statistic
- N degrees of freedom
- Can be used to test variance if there is no bias
- Any bias increases the rejection rate

MLE for MSE

- The MLE for the MSE is:
- Notice similarity to χ2 from preceding slide
- In the context of biased measurements, χ2 seems to assess MSE rather than variance

Our Proposed χ2 Test for MSE

- For a given activity level (TV), calculate a χ2 statistic W:
- Calculate the critical value of W as follows:
- N = number of replicate measurements
- α = max false rejection rate at this level

Multiple Activity Levels

- When testing at more than one activity level, calculate the critical value as follows:
- Where L is the number of levels and N is the number of measurements at each level
- Now α is the maximum overall false rejection rate

Evaluation Criteria

- To perform the test, calculate Wi at each activity level TVi
- Compare each Wi to wC
- If Wi > wC for any i, reject the method
- The method must pass the test at each spike activity level
- Don’t allow bad performance at one level just because of good performance at another

Lesson Learned

- Don’t test at too many levels
- Otherwise you must choose:
- High false acceptance rate at each level,
- High overall false rejection rate, or
- Complicated evaluation criteria

- Prefer to keep error rates low
- Need a low level and a high level
- But probably not more than three levels (L=3)

Better Use of Same Data

- The χ2 test makes better use of the measurement data than the MARLAP test
- The statistic W is calculated from all the data at a given level – not just the most extreme value

Caveat

- The distribution of W is not completely determined by the MSE
- Depends on how MSE is partitioned into variance and bias components
- Our test looks like a test of variance
- As if we know δ = 0 and we’re testing σ2 only

- But we’re actually using it to test MSE

False Rejections

- If wC<N, the maximum false rejection rate (100 %) occurs when δ= ±uReq and σ=0
- But you’ll never have this situation in practice

- If wC≥N+2, the maximum false rejection rate occurs when σ=uReq and δ=0
- This is the usual situation
- Why we can assume the null distribution is χ2

- Otherwise the maximum false rejection rate occurs when both δand σ are nonzero
- This situation is unlikely in practice

To Avoid High Rejection Rates

- We must have wC≥N+2
- This will always be true if α<0.08, even if L=N=1

- Ensures the maximum false rejection rate occurs when δ = 0 and the MSE is just σ2
- Not stated explicitly in App. E, because:
- We didn’t have a proof at the time
- Not an issue if you follow the procedure

- Now we have a proof

Example: Critical Value

- Suppose L = 3 and N = 7
- Let α = 0.05
- Then the critical value for W is
- Since wC ≥ N + 2 = 9, we won’t have unexpectedly high false rejection rates

Since α < 0.08, we didn’t really have to check

Some Facts about the Power

- The power always increases with |δ|
- The power increases with σ if
or if

- For a given bias δ with , there is a positive value of σ that minimizes the power
- If , even this minimum power exceeds 50 %
- Power increases with N

Power Comparisons

- We compared the tests for power
- Power to reject a biased method
- Power to reject an imprecise method

- The χ2 test outperforms the simple MARLAP test on both counts
- Results of comparisons at end of this presentation

Example: Applying the χ2 Test

- Return to the scenario used earlier for the MARLAP example
- Three levels (L = 3)
- Seven measurements per level (N = 7)
- 5 % overall false rejection rate (α = 0.05)
- Consider results at just one level, TV = 300 pCi/L, where uReq = 30 pCi/L

Example (continued)

- Reuse the data from our
earlier example

- Calculate the χ2 statistic
- Since W > wC (17.4 > 17.1),
the method is rejected

- We’re using all the data now – not just the worst result

Likelihood Ratio Test for MSE

- We also discovered a statistical test published in 1999, which directly addressed MSE for analytical methods
- By Danish authors Erik Holst and Poul Thyregod
- It’s a “likelihood ratio” test, which is a common, well accepted approach to hypothesis testing

Likelihood Ratio Tests

- To test a hypothesis about a parameter θ, such as the MSE
- First find a likelihood functionL(θ), which tells how “likely” a value of θ is, given the observed experimental data
- Based on the probability mass function or probability density function for the data

Test Statistic

- Maximize L(θ) on all possible values of θ and again on all values of θ that satisfy the null hypothesis H0
- Can use the ratio of these two maxima as a test statistic
- The authors actually use λ = −2 ln(Λ) as the statistic for testing MSE

Critical Values

- It isn’t simple to derive equations for λ, or to calculate percentiles of its distribution, but Holst and Thyregod did both
- They used numerical integration to approximate percentiles of λ, which serve as critical values

Equations

- For the two-sided test statistic, λ:
- Where is the unique real root of the cubic polynomial
- See Holst & Thyregod for details

One-Sided Test

- We actually need the one-sided test statistic:
- This is equivalent to:

Issues

- The distribution of either λ or λ* is not completely determined by the MSE
- Under H0 with , the percentiles λ1−α and λ*1−α are maximized when σ0 and |δ|uReq
- To ensure the false rejection rate never exceeds α, use the maximum value of the percentile as the critical value
- Apparently we improved on the authors’ method of calculating this maximum

Distribution Function for λ*

- To calculate max values of the percentiles, use the following “cdf” for λ*:
- From this equation, obtain percentiles of the null distribution by iteration
- Select a percentile (e.g., 95th) as a critical value

The Downside

- More complicated to implement
- Critical values are not readily available (unlike percentiles of χ2)
- Unless you can program the equation from the preceding slide in software

Power of the Likelihood Ratio Test

- More powerful than either the χ2 test or MARLAP’s test for rejecting biased methods
- Sometimes much more powerful

- Slightly less powerful at rejecting unbiased but imprecise methods
- Not so much worse that we wouldn’t consider it a reasonable alternative

Power Comparisons

- Same scenario as before:
- Level D method validation
- 3 activity levels: AL/2, AL, 3×AL
- 7 replicate measurements per level
- φMR is 0.10, or 10 %

- Constant relative bias at all levels
- Assume ratio σ/uReq constant at all levels

Power Contours

- Same assumptions as before (Level D method validation, etc.)
- Contour plots show power as a function of both δ and σ
- Horizontal coordinate is bias (δ) at the action level
- Vertical coordinate is the standard deviation (σ)
- Power is shown as color

Recommendations

- You can still use the MARLAP test
- We prefer the χ2 test of App. E.
- It’s simple
- Critical values are widely available (percentiles of χ2)
- It outperforms the MARLAP test

- The likelihood ratio test is a possibility, but
- It is somewhat complicated
- Our guide doesn’t give you enough information to implement it

Implications for MARLAP

- The χ2 test for MSE will likely be included in revision 1 of MARLAP
- So will the likelihood ratio test
- Or maybe a variant of it

- One or both of these will probably become the recommended test for evaluating a candidate method

Power Calculations – MARLAP

- For MARLAP’s test, probability of rejecting a method at a given activity level is
- Where σ is the method’s standard deviation at that level, δ is the method’s bias, and k is the multiplier calculated earlier (k ≈ 3)
- Φ(z) is the cdf for the standard normal distribution

Power Calculations (continued)

- Same probability is calculated by the following equation
- is the cdf for the non-central χ2 distribution
- In this case, with ν = 1 degree of freedom and non-centrality parameter λ = δ2 / σ2

Power Calculations – χ2

- For the new χ2 test, the probability of rejecting a method is
- Where again σ is the method’s standard deviation at that level and δ is the method’s bias

Non-Central χ2 CDF

- The cdf for the non-central χ2 distribution is given by
- Where P(∙,∙) denotes the incomplete gamma function
- You can find algorithms for P(∙,∙), e.g., in books like Numerical Recipes

Solving the Cubic

- To solve the cubic equation for

Variations

- Another possibility: Use Holst & Thyregod’s methodology to derive a likelihood ratio test for H0: versus H1:
- There are a couple of new issues to deal with when k>1
- But the same approach mostly works
- Only recently considered – not fully explored yet

Download Presentation

Connecting to Server..