Revising FDA’s “Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests”. FDA/Industry Statistics Workshop September 28-29, 2006 Kristen Meier, Ph.D. Mathematical Statistician, Division of Biostatistics Office of Surveillance and Biometrics
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
FDA/Industry Statistics Workshop
September 28-29, 2006
Kristen Meier, Ph.D.
Mathematical Statistician, Division of Biostatistics
Office of Surveillance and Biometrics
Center for Devices and Radiological Health, FDA
“Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests: Draft Guidance for Industry and FDA Reviewers”
New Test+ TP (true+) FP (false+)
Test Test FN (false ) TN (true)
sensitivity (sens) = Pr(Test+|Truth+) 100%×TP/(TP+FN)
specificity(spec) = Pr(Test|Truth) 100%×TN/(FP+TN)
“Perfect” test: sens=spec=100% (FP=FN=0)
TRUTH Imperfect Standard
New + 44 1 New + 40 5
Test 7 168Test 4 171
total 51 169 total 44 176
UnbiasedEstimates Biased* Estimates
Sens 86.3% (44/51) 90.9% (40/44)
Spec 99.4% (168/169) 97.2% (171/176)
* Misclassification bias (see Begg 1987)
STAGE 1 – retest discordants STAGE 2 – revise 2x2*
using a “resolver” test based on resolver result
Imperfect Standard Resolver/imperfect std.
+ “+” “”
New + 40 5 (5+, 0)New + 45 0
Test 4 (1+, 3) 171 Test 1 174
total 44 176 total 46 174
“sens” 90.9% (40/44) 97.8% (45/46)
“spec” 97.2% (171/176) 100% (174/174)
FDA received comments from 11 individuals/organizations:
STAndards for Reporting of Diagnostic Accuracy Initiative
Purpose of a qualitative diagnostic test is to determine whether a target condition is present or absent in a subject from the intended use population
Move away from notion of a fixed, theoretical “Truth”
What constitutes “best available method”/reference method?
NEW: FDA recognizes 2 major categories of benchmarks
OLD: “perfect standard” and “imperfect standard”, “gold standard” – concepts and terms deleted
Choice of comparative method determines which performance measures can be reported
Terminology is important – help ensure correct interpretation
Reference standard (STARD)
Non-reference standard (due to 510(k) regulations)
New Test+ a b
Test Test c d
Positive percent agreement (new/non ref. std.) =100%×a/(a+c)
Negative percent agreement (new/non ref. std.)=100%×d/(b+d)
[overall percent agreement=100%×(a+d)/(a+b+c+d)]
“Perfect” new test: PPA≠100% and NPA≠100%
Non-Ref Non-Ref Standard Standard
New + 40 1 New + 40 19
Test 19 512Test 1 512
total 59 513 total 41 531
overall agreement = 96.5% ((40+512)/572))
PPA = 67.8% (40/59) PPA = 97.6% (40/41)
NPA = 99.8% (512/513) NPA = 96.4% (512/531)
Original data: Non-Reference Standard
New + 40 5
Test 4 171
Stratify data above by Reference Standard outcome
Reference Std + Reference Std
Non-Ref Std Non-Ref Std
New + 39 5 New + 1 0
Test 1 6Test 3 165
tests agree and are wrong for 6+1 = 7 subjects
Unknown and non-quantified uncertainty
See Begg (1987), Pepe (2003), Zhou et al. (2002)
A study has high external validity if the study results are sufficiently reflective of the “real world” performance of the device in the intended use population
Albert, P. S. (2006). Imputation approaches for estimating diagnostic accuracy for multiple tests from partially verified designs. Technical Report 042, Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute (http://linus.nci.nih.gov/~brb/TechReport.htm).
Albert, P.S., & Dodd, L.E. (2004). A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60, 427–435.
Albert, P. S. and Dodd, L. E. (2006). On estimating diagnostic accuracy with multiple raters and partial gold standard evaluation. Technical Report 041, Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute (http://linus.nci.nih.gov/~brb/TechReport.htm).
Begg, C.G. Biases in the assessment of diagnostic tests. Statistics in Medicine 1987;6:411-423.
Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Lijmer, J.G., Moher, D., Rennie, D., & deVet, H.C.W. (2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD initiative. Clinical Chemistry, 49(1), 1–6. (Also appears in Annals of Internal Medicine (2003) 138(1), W1–12 and in British Medical Journal (2003) 329(7379), 41–44)
References Resolution” (continued)
Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Moher, D., Rennie, D., deVet, H.C.W., & Lijmer, J.G. (2003). The STARD statement for reporting studies of diagnostic accuracy: Explanation and elaboration. Clinical Chemistry, 49(1), 7–18. (Also appears in Annals of Internal Medicine (2003) 138(1), W1–12 and in British Medical Journal (2003) 329(7379), 41–44)
Lang, Thomas A. and Secic, Michelle. How to Report Statistics in Medicine. Philadelphia: American College of Physicians, 1997.
Kondratovich, Marina (2003). Verification bias in the evaluation of diagnostic devices. Proceedings of the 2003 Joint Statistical Meetings, Biopharmaceutical Section, San Francisco, CA.
Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. New York: Oxford University Press.
Zhou, X. H., Obuchowski, N. A., & McClish, D. K. (2002). Statistical methods in diagnostic medicine. New York: John Wiley & Sons.