Assessing agreement for diagnostic devices
Advertisement
This presentation is the property of its rightful owner.
1 / 27

Assessing agreement for diagnostic devices PowerPoint PPT Presentation

Assessing agreement for diagnostic devices. FDA/Industry Statistics Workshop September 28-29, 2006 Bipasa Biswas Mathematical Statistician, Division of Biostatistics Office of Surveillance and Biometrics Center for Devices and Radiological Health, FDA

Download Presentation

Assessing agreement for diagnostic devices

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Assessing agreement for diagnostic devices

Assessing agreement for diagnostic devices

FDA/Industry Statistics Workshop

September 28-29, 2006

Bipasa Biswas

Mathematical Statistician, Division of Biostatistics

Office of Surveillance and Biometrics

Center for Devices and Radiological Health, FDA

No official support or endorsement by the Food and Drug Administration of this presentation is intended or should be inferred


Outline

Outline

  • Accuracy measures for diagnostic tests with a dichotomous outcome. Ideal world -tests with reference standard.

    • Two indices to measure accuracy –Sensitivity and Specificity

  • Assessing agreement between two tests in the absence of a reference standard.

    • Overall agreement

    • Cohen’s Kappa

    • McNemar’s test

    • Proposed remedy

  • Extending agreement to tests with more than 2 outcomes.

    • Cohen’s Kappa

    • Extension to Random Marginal Agreement coefficient (RMAC)

    • Should agreement per cell be reported?


Ideal world tests with perfect reference standard single

Ideal World-Tests with perfect reference standard (Single)

  • If a perfect reference standard exists to classify patients as diseased (D+) versus not diseased (D-) then we can represent the data as:

    True Status

    TestD+D-

    T +

    T -

  • If the true status of the disease is known then we can estimate the Se =TP/(TP+FN) and the Sp=TN/(TN+FP)


Ideal world tests with perfect reference standard comparing two tests

Ideal World-Tests with perfect reference standard (Comparing two tests)

  • McNemar’s test to test equality of either sensitivity or specificity.

    True Status

    Disease D+ No Disease D-

    Comparator test Comparator test

    New test R+ R-New test R+ R-

    T + T +

    T - T -

    McNemar Chi square:

    Check equality of sensitivities of the two tests (|b1-c1|-1)2/(b1+c1)

    Check equality of specifities of the two tests (|c2-b2|-1)2/(c2+b2)


Ideal world tests with perfect reference standard comparing two tests1

Ideal World-Tests with perfect reference standard (Comparing two tests)

  • Example

    True Status

    Disease D+Disease D-

    Comparator test Comparator test

    New test R+ R-New test R+ R-

    T + T +

    T - T -

    SeT=85.0%(85/100)SpT=88.3%(795/900)

    SeR=90.0%(90/100)SpR=90.0%(810/900)

  • McNemar Chi square:

    Check equality of sensitivities of the two tests (|5–10|–1)2/(5+10)

    p-value=0.30

    95% CI (–13.5%,3.5%)

    Check equality of specifities of the two tests (|5–20|–1)2/(5+20)

    p-value=.005

    95% CI (–2.9%, –0.5%)


  • Mcnemar s test when a reference standard exists

    McNemar’s test when a reference standard exists

    • Note however that the McNemar’s test is only checking for equality and thus the null hypothesis is of equivalence and the alternative hypothesis of difference. This is not an appropriate hypothesis as a failure to find a statistically significant difference is naively interpreted as evidence for equivalence.

    • The 95% confidence interval of the difference in sensitivities and specificities provides a better idea on the difference between the two tests.


    Imperfect reference standard

    Imperfect reference standard

    • A subject’s true disease status is seldom known with certainty.

    • What is the effect on sensitivity and specificity when the comparator test R itself has error?

      Imperfect reference test (Comparator test)

      New test R+ R-

      T +

      T -


    Imperfect reference standard1

    Imperfect reference standard

    • Example1: Say we have a new Test T with 80% sensitivity and 70% specificity. And an imperfect reference test R (the comparator test) which misses 20% of the diseased subjects but never falsely indicates disease.

      True Status Imperfect reference test

      D+ D- R+ R-

      T +

      T –

      Se= (80/100)80.0% Se (relative to R)= (64/80) 80.0%

      Sp =(70/100)70.0% Sp (relative to R)= (74/120)62.0%


    Imperfect reference standard2

    Imperfect reference standard

    • Example 2: Say we have a new Test T with 80% sensitivity and 70% specificity. And an imperfect reference test R which misses 20% of the diseased subjects but the error in R is related to the error in T.

      True Status Imperfect reference test

      D+ D- R+ R-

      T +

      T –

      Se =(80/100)80.0% Se (relative to R)=(80/80) 100.0%

      Sp =(70/100)70.0% Sp (relative to R) =(90/120)75.0%


    Imperfect reference standard3

    Imperfect reference standard

    • Example3: Now suppose our test is perfect, that is has 100% sensitivity and 100% specificity, but the imperfect reference test R has only 90% sensitivity and 90% specificity.

      True Status Imperfect reference test

      D+ D- R+ R-

      T +

      T –

      Se =(100/100)100.0% Se (relative to R)=(90/100) 90.0%

      Sp =(100/100)100.0% Sp (relative to R)=(90/100) 90.0%


    Challenges in assessing agreement in the absence of a reference standard

    Challenges in assessing agreement in the absence of a reference standard.

    • Two commonly used overall measures are:

      • Overall agreement measure

      • Cohen’s Kappa

    • McNemar’s Test

    • In stead report positive percent agreement (ppa) and negative percent agreement (npa).


    Estimate of agreement

    Estimate of Agreement

    • The overall percent agreement can be calculated as:

      100%x(a+d)/(a+b+c+d)

    • The overall percent agreement however, does not differentiate between the agreement on the positives and agreement on the negatives.

    • Instead of overall agreement, report positive percent agreement (PPA) with respect to the imperfect reference standard positives and negative percent agreement (NPA) with respect to imperfect reference standard negative. (reference Feinstein et. al.)

      PPA=100%xa/(a+c)

      NPA=100%xd/(b+d)


    Why not to report just the overall percent agreement

    Why not to report just the overall percentagreement?

    The overall percent agreement is insensitive to off diagonal

    imperfect reference test

    R+R-

    New T+

    Test

    T-

    The overall percent agreement is 85.0% and yet it does not account for the off-diagonal imbalance. The PPA is 100% and the NPA is only 50%


    Why report both ppa and npa

    Why report both PPA and NPA?

    imperfect reference test imperfect reference test

    R+ R- R+ R-

    New T+new T+

    Test T- test T-

    Table 1Table2

    Overall pct. agreement=90.0% Overall pct. agreement=90.0%

    PPA=50.0% (5/10) PPA=87.5% (35/40)

    [95% CI= 18.7%,81.3%] [95% CI=73.2%,95.8%]

    NPA=94.4% (85/90)NPA=91.7% (55/60)

    [95% CI= 87.5%,98.2 %] [95% CI=81.6%,97.2%]


    Kappa measure of agreement

    Kappa measure of agreement

    • Kappa is defined as the difference between observed and expected agreement expressed as a fraction of the maximum difference and ranges between -1 to 1.

      Imperfect reference standard

      R+ R-

      New T+

      Test

      T-

    • k=(Io-Ie)/(1-Ie) where Io=(a+d)/n, Ie=((a+c)(a+b)+(b+d)(c+d))/n2


    Kappa measure of agreement1

    Kappa measure of agreement

    Imperfect reference standard

    R+R-

    New T+

    Test

    T-

    • Io=(70)/100=0.70, Ie=((50)(50)+(50)(50))/10000= 0.50

    • κ=(0.70-0.50)/(1-0.50)=0.40

      [95% CI=0.22,0.58]

    • By the way the overall percent agreement is 70.0%


    Kappa measure of agreement sensitive to off diagonal

    Kappa measure of agreement sensitive to off-diagonal?

    Imperfect reference test

    R+ R-

    New T+

    Test T-

    Kappa=κ=0.45 [95% CI=0.31,0.59]

    Although the overall agreement stayed the same (70%) and the marginal differences are much bigger than before, the kappa agreement index indicates otherwise.

    Kappa statistics is impacted by the marginal totals even though the overall agreement is the same.


    Mcnemar s test to check for equality in the absence of a reference standard

    McNemar’s Test to check for equality in the absence of a reference standard

    • Hypothesizes: Equality of rates of positive response

      Imperfect reference test

      R+ R-

      NewT+

      Test T-

      McNemar Chi square=(|b-c|-1)2/(b+c)

      =(|30-5|-1)2/(30+5)=16.46

      Two sided p-value=0.00005


    Mcnemar s test insensitivity to main diagonal

    McNemar’s test (insensitivity to main diagonal)

    Imperfect reference test

    R+R-

    NewT+

    TestT-

    • Same p-value as when A=37 and D=28, even though the new and the old test agree on 99.5% of individual cases.


    Assessing agreement for diagnostic tests 2006

    McNemar’s test (insensitivity to main diagonal)

    Imperfect reference test

    R+R-

    NewT+

    TestT-

    • Two sided p-value=1 even though old and new test agree on no cases.


    Proposed remedy

    Proposed remedy

    • In stead of reporting overall agreement or kappa or the McNemar’s test p-value, report both positive percent agreement and negative percent agreement.

    • In the 510(k) paradigm where a new device is compared to an already marketed device the positive percent agreement and the negative percent agreement is relative to the comparator device, which is appropriate.


    Agreement of tests with more than two outcomes

    Agreement of tests with more than two outcomes

    • For example in radiology one often compares the standard film mammogram to a digital mammogram where the radiologists assign a score of 1(negative finding) to 5 (highly suggestive of malignancy) depending on severity.

    • The article by Fay in 2005 in Biostatistics proposes a random marginal agreement coefficient (RMAC) which uses a different adjustment for chance than the standard agreement coefficient (Cohen’s Kappa).


    Comparing two tests with more than two outcomes

    Comparing two tests with more than two outcomes

    • The advantages of RMAC is that the differences between two marginal distributions will not induce greater apparent agreement.

    • However, as stated in the paper similar to Cohen’s Kappa with the fixed marginal assumption, the RMAC also depends on the heterogeneity of the population. Thus in cases where the probability of responding in one category is nearly 1 then the chance agreement will be large leading to low agreement coefficients.


    Comparing two tests with more than two outcomes1

    Comparing two tests with more than two outcomes

    • An omnibus agreement index for situations with more than two outcomes is also ridden by similar situations faced for tests with dichotomous outcome. Also, in a regulatory set-up where a new test device is being compared to a predicate device RMAC may not be appropriate as it gives equal weight to the marginals from the test and the predicate device.

    • In stead report individual agreement for each category.


    Summary

    Summary

    • Perfect standard exists then for a dichotomous test then both sensitivity and specificity can be estimated and appropriate hypothesis tests can be performed.

    • If a new test is being compared to an imperfect predicate test then the positive percent agreement and negative percent agreement along with their 95% confidence interval is a more appropriate way of comparison than reporting the overall agreement or the kappa statistics or the McNemar’s test.

    • In case of tests with more than two outcomes the kappa statistics or the overall agreement has the same problems if the goal of the study is to compare the new test against a predicate. A suggestion would be toreport agreement for each cell.


    References

    References

    • Pepe, M.S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press.

    • Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests; Draft Guidance for Industry and FDA Reviewers. March 2, 2003.

    • Fleiss, JL, Statistical Methods for Rates and Proportions, John Wiley & Sons, New York (2nd ed., 1981).

    • Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Lijmer, J.G., Moher, D., Rennie, D., & deVet, H.C.W. (2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD initiative. Clinical Chemistry, 49(1), 1–6. (Also appears in Annals of Internal Medicine (2003) 138(1), W1–12 and in British Medical Journal (2003) 329(7379), 41–44)


    References continued

    References (continued)

    • Dunn, G and Everitt, B, Clinical Biostatistics –An Introduction to Evidence-Based Medicine, John Wiley & Sons, New York.

    • Feinstein A. R. and Cicchetti D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol 1990; Vol. 43, No. 6, 543-549.

    • Feinstein A. R. and Cicchetti D. V. (1990). High agreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol 1990; Vol. 43, No. 6, 551-558.

    • Fay M. P. (2005). Random marginal agreement coefficients: rethinking the adjustment for chance when measuring agreement 2005; Biostatistics 6:171-180.


  • Login