- 1130 Views
- Uploaded on
- Presentation posted in: Sports / GamesEducation / CareerFashion / BeautyGraphics / DesignNews / Politics

Assessing agreement for diagnostic devices

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Assessing agreement for diagnostic devices

FDA/Industry Statistics Workshop

September 28-29, 2006

Bipasa Biswas

Mathematical Statistician, Division of Biostatistics

Office of Surveillance and Biometrics

Center for Devices and Radiological Health, FDA

No official support or endorsement by the Food and Drug Administration of this presentation is intended or should be inferred

- Accuracy measures for diagnostic tests with a dichotomous outcome. Ideal world -tests with reference standard.
- Two indices to measure accuracy –Sensitivity and Specificity

- Assessing agreement between two tests in the absence of a reference standard.
- Overall agreement
- Cohen’s Kappa
- McNemar’s test
- Proposed remedy

- Extending agreement to tests with more than 2 outcomes.
- Cohen’s Kappa
- Extension to Random Marginal Agreement coefficient (RMAC)
- Should agreement per cell be reported?

- If a perfect reference standard exists to classify patients as diseased (D+) versus not diseased (D-) then we can represent the data as:
True Status

TestD+D-

T +

T -

- If the true status of the disease is known then we can estimate the Se =TP/(TP+FN) and the Sp=TN/(TN+FP)

- McNemar’s test to test equality of either sensitivity or specificity.
True Status

Disease D+ No Disease D-

Comparator test Comparator test

New test R+ R-New test R+ R-

T + T +

T - T -

McNemar Chi square:

Check equality of sensitivities of the two tests (|b1-c1|-1)2/(b1+c1)

Check equality of specifities of the two tests (|c2-b2|-1)2/(c2+b2)

- Example
True Status

Disease D+Disease D-

Comparator test Comparator test

New test R+ R-New test R+ R-

T + T +

T - T -

SeT=85.0%(85/100)SpT=88.3%(795/900)

SeR=90.0%(90/100)SpR=90.0%(810/900)

Check equality of sensitivities of the two tests (|5–10|–1)2/(5+10)

p-value=0.30

95% CI (–13.5%,3.5%)

Check equality of specifities of the two tests (|5–20|–1)2/(5+20)

p-value=.005

95% CI (–2.9%, –0.5%)

- Note however that the McNemar’s test is only checking for equality and thus the null hypothesis is of equivalence and the alternative hypothesis of difference. This is not an appropriate hypothesis as a failure to find a statistically significant difference is naively interpreted as evidence for equivalence.
- The 95% confidence interval of the difference in sensitivities and specificities provides a better idea on the difference between the two tests.

- A subject’s true disease status is seldom known with certainty.
- What is the effect on sensitivity and specificity when the comparator test R itself has error?
Imperfect reference test (Comparator test)

New test R+ R-

T +

T -

- Example1: Say we have a new Test T with 80% sensitivity and 70% specificity. And an imperfect reference test R (the comparator test) which misses 20% of the diseased subjects but never falsely indicates disease.
True Status Imperfect reference test

D+ D- R+ R-

T +

T –

Se= (80/100)80.0% Se (relative to R)= (64/80) 80.0%

Sp =(70/100)70.0% Sp (relative to R)= (74/120)62.0%

- Example 2: Say we have a new Test T with 80% sensitivity and 70% specificity. And an imperfect reference test R which misses 20% of the diseased subjects but the error in R is related to the error in T.
True Status Imperfect reference test

D+ D- R+ R-

T +

T –

Se =(80/100)80.0% Se (relative to R)=(80/80) 100.0%

Sp =(70/100)70.0% Sp (relative to R) =(90/120)75.0%

- Example3: Now suppose our test is perfect, that is has 100% sensitivity and 100% specificity, but the imperfect reference test R has only 90% sensitivity and 90% specificity.
True Status Imperfect reference test

D+ D- R+ R-

T +

T –

Se =(100/100)100.0% Se (relative to R)=(90/100) 90.0%

Sp =(100/100)100.0% Sp (relative to R)=(90/100) 90.0%

- Two commonly used overall measures are:
- Overall agreement measure
- Cohen’s Kappa

- McNemar’s Test
- In stead report positive percent agreement (ppa) and negative percent agreement (npa).

- The overall percent agreement can be calculated as:
100%x(a+d)/(a+b+c+d)

- The overall percent agreement however, does not differentiate between the agreement on the positives and agreement on the negatives.
- Instead of overall agreement, report positive percent agreement (PPA) with respect to the imperfect reference standard positives and negative percent agreement (NPA) with respect to imperfect reference standard negative. (reference Feinstein et. al.)
PPA=100%xa/(a+c)

NPA=100%xd/(b+d)

The overall percent agreement is insensitive to off diagonal

imperfect reference test

R+R-

New T+

Test

T-

The overall percent agreement is 85.0% and yet it does not account for the off-diagonal imbalance. The PPA is 100% and the NPA is only 50%

imperfect reference test imperfect reference test

R+ R- R+ R-

New T+new T+

Test T- test T-

Table 1Table2

Overall pct. agreement=90.0% Overall pct. agreement=90.0%

PPA=50.0% (5/10) PPA=87.5% (35/40)

[95% CI= 18.7%,81.3%] [95% CI=73.2%,95.8%]

NPA=94.4% (85/90)NPA=91.7% (55/60)

[95% CI= 87.5%,98.2 %] [95% CI=81.6%,97.2%]

- Kappa is defined as the difference between observed and expected agreement expressed as a fraction of the maximum difference and ranges between -1 to 1.
Imperfect reference standard

R+ R-

New T+

Test

T-

- k=(Io-Ie)/(1-Ie) where Io=(a+d)/n, Ie=((a+c)(a+b)+(b+d)(c+d))/n2

Imperfect reference standard

R+R-

New T+

Test

T-

- Io=(70)/100=0.70, Ie=((50)(50)+(50)(50))/10000= 0.50
- κ=(0.70-0.50)/(1-0.50)=0.40
[95% CI=0.22,0.58]

- By the way the overall percent agreement is 70.0%

Imperfect reference test

R+ R-

New T+

Test T-

Kappa=κ=0.45 [95% CI=0.31,0.59]

Although the overall agreement stayed the same (70%) and the marginal differences are much bigger than before, the kappa agreement index indicates otherwise.

Kappa statistics is impacted by the marginal totals even though the overall agreement is the same.

- Hypothesizes: Equality of rates of positive response
Imperfect reference test

R+ R-

NewT+

Test T-

McNemar Chi square=(|b-c|-1)2/(b+c)

=(|30-5|-1)2/(30+5)=16.46

Two sided p-value=0.00005

Imperfect reference test

R+R-

NewT+

TestT-

- Same p-value as when A=37 and D=28, even though the new and the old test agree on 99.5% of individual cases.

McNemar’s test (insensitivity to main diagonal)

Imperfect reference test

R+R-

NewT+

TestT-

- Two sided p-value=1 even though old and new test agree on no cases.

- In stead of reporting overall agreement or kappa or the McNemar’s test p-value, report both positive percent agreement and negative percent agreement.
- In the 510(k) paradigm where a new device is compared to an already marketed device the positive percent agreement and the negative percent agreement is relative to the comparator device, which is appropriate.

- For example in radiology one often compares the standard film mammogram to a digital mammogram where the radiologists assign a score of 1(negative finding) to 5 (highly suggestive of malignancy) depending on severity.
- The article by Fay in 2005 in Biostatistics proposes a random marginal agreement coefficient (RMAC) which uses a different adjustment for chance than the standard agreement coefficient (Cohen’s Kappa).

- The advantages of RMAC is that the differences between two marginal distributions will not induce greater apparent agreement.
- However, as stated in the paper similar to Cohen’s Kappa with the fixed marginal assumption, the RMAC also depends on the heterogeneity of the population. Thus in cases where the probability of responding in one category is nearly 1 then the chance agreement will be large leading to low agreement coefficients.

- An omnibus agreement index for situations with more than two outcomes is also ridden by similar situations faced for tests with dichotomous outcome. Also, in a regulatory set-up where a new test device is being compared to a predicate device RMAC may not be appropriate as it gives equal weight to the marginals from the test and the predicate device.
- In stead report individual agreement for each category.

- Perfect standard exists then for a dichotomous test then both sensitivity and specificity can be estimated and appropriate hypothesis tests can be performed.
- If a new test is being compared to an imperfect predicate test then the positive percent agreement and negative percent agreement along with their 95% confidence interval is a more appropriate way of comparison than reporting the overall agreement or the kappa statistics or the McNemar’s test.
- In case of tests with more than two outcomes the kappa statistics or the overall agreement has the same problems if the goal of the study is to compare the new test against a predicate. A suggestion would be toreport agreement for each cell.

- Pepe, M.S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press.
- Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests; Draft Guidance for Industry and FDA Reviewers. March 2, 2003.
- Fleiss, JL, Statistical Methods for Rates and Proportions, John Wiley & Sons, New York (2nd ed., 1981).
- Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Lijmer, J.G., Moher, D., Rennie, D., & deVet, H.C.W. (2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD initiative. Clinical Chemistry, 49(1), 1–6. (Also appears in Annals of Internal Medicine (2003) 138(1), W1–12 and in British Medical Journal (2003) 329(7379), 41–44)

- Dunn, G and Everitt, B, Clinical Biostatistics –An Introduction to Evidence-Based Medicine, John Wiley & Sons, New York.
- Feinstein A. R. and Cicchetti D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol 1990; Vol. 43, No. 6, 543-549.
- Feinstein A. R. and Cicchetti D. V. (1990). High agreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol 1990; Vol. 43, No. 6, 551-558.
- Fay M. P. (2005). Random marginal agreement coefficients: rethinking the adjustment for chance when measuring agreement 2005; Biostatistics 6:171-180.