Revising fda s statistical guidance on reporting results from studies evaluating diagnostic tests
Download
1 / 31

Revising FDA s Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests - PowerPoint PPT Presentation


  • 364 Views
  • Uploaded on

Revising FDA’s “Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests”. FDA/Industry Statistics Workshop September 28-29, 2006 Kristen Meier, Ph.D. Mathematical Statistician, Division of Biostatistics Office of Surveillance and Biometrics

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Revising FDA s Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests ' - heinrich


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Revising fda s statistical guidance on reporting results from studies evaluating diagnostic tests l.jpg

Revising FDA’s “Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests”

FDA/Industry Statistics Workshop

September 28-29, 2006

Kristen Meier, Ph.D.

Mathematical Statistician, Division of Biostatistics

Office of Surveillance and Biometrics

Center for Devices and Radiological Health, FDA


Outline l.jpg
Outline Results from Studies Evaluating Diagnostic Tests”

  • Background of guidance development

  • Overview of comments

  • STARD Initiative and definitions

  • Choice of comparative benchmark and implications

  • Agreement measures – pitfalls

  • Bias

  • Estimating performance without a “perfect” [reference] standard - latest research

  • Reporting recommendations


Background l.jpg
Background Results from Studies Evaluating Diagnostic Tests”

  • Motivated by CDC concerns with IVDs for sexually transmitted diseases

  • Joint meeting of four FDA device panels (2/11/98): Hematology/Pathology, Clinical Chemistry/Toxicology, Microbiology and Immunology

  • Provide recommendations on “appropriate data collection, analysis, and resolution of discrepant results, using sound scientific and statistical analysis to support indications for use of in vitro diagnostic devices when the new device is compared to another device, a recognized reference method or ‘gold standard,’ or other procedures not commonly used, and/or clinical criteria for diagnosis”


Statistical guidance developed l.jpg
Statistical Guidance Developed Results from Studies Evaluating Diagnostic Tests”

“Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests: Draft Guidance for Industry and FDA Reviewers”

  • issued in Mar. 12, 2003 with a 90-day comment period

  • http://www.fda.gov/cdrh/osb/guidance/1428.html

  • for all diagnostic products not just in vitro diagnostics

  • only addresses diagnostic devices with 2 possible outcomes (positive/negative)

  • does not address design and monitoring of clinical studies for diagnostic devices


Dichotomous diagnostic test performance l.jpg
Dichotomous Diagnostic Test Performance Results from Studies Evaluating Diagnostic Tests”

Study Population

TRUTH

Truth+ Truth

New Test+ TP (true+) FP (false+)

Test Test FN (false ) TN (true)

estimate:

sensitivity (sens) = Pr(Test+|Truth+)  100%×TP/(TP+FN)

specificity(spec) = Pr(Test|Truth)  100%×TN/(FP+TN)

“Perfect” test: sens=spec=100% (FP=FN=0)


Example data 220 subjects l.jpg
Example Data: 220 Subjects Results from Studies Evaluating Diagnostic Tests”

TRUTH Imperfect Standard

+  + 

New + 44 1 New + 40 5

Test  7 168Test  4 171

total 51 169 total 44 176

UnbiasedEstimates Biased* Estimates

Sens 86.3% (44/51) 90.9% (40/44)

Spec 99.4% (168/169) 97.2% (171/176)

* Misclassification bias (see Begg 1987)


Recalculation of performance using discrepant resolution l.jpg
Recalculation of Performance Using “Discrepant Resolution”

STAGE 1 – retest discordants STAGE 2 – revise 2x2*

using a “resolver” test based on resolver result

Imperfect Standard Resolver/imperfect std.

+  “+” “”

New + 40 5 (5+, 0)New + 45 0

Test  4 (1+, 3) 171 Test  1 174

total 44 176 total 46 174

“sens” 90.9% (40/44)  97.8% (45/46)

“spec” 97.2% (171/176)  100% (174/174)

*assumes concordant=“correct”


Topics for guidance l.jpg
Topics for Guidance Resolution”

Realization:

  • Problems are much larger than “discrepant resolution”

  • 2x2 is an oversimplification, but still useful to start

    Provide guidance:

  • What constitutes “truth”?

  • What to do if we don’t know truth?

  • What name do we give performance measures when we don’t have truth?

  • Describing study design: how were subjects, specimens, measurements, labs collected/chosen?


Comments on guidance l.jpg
Comments on Guidance Resolution”

FDA received comments from 11 individuals/organizations:

  • provide guidance on what constitutes “perfect standard”

    • remove “perfect/imperfect standard” concept and include and define “reference/non-reference standard” concept (STARD)

  • reference and use STARD concepts

  • provide approach for indeterminate, inconclusive, equivocal, etc… results

    • minimal recommendations

  • discuss methods for estimating sens and spec when a perfect [reference] standard is not used

    • cite new literature

  • include more discussion on bias, including verification bias

    • some discussion added, add more references

  • add glossary


Stard initiative l.jpg
STARD Initiative Resolution”

STAndards for Reporting of Diagnostic Accuracy Initiative

  • effort by international working group to improve quality of reporting of studies of diagnostic accuracy

  • checklist of 25 items to include when reporting results

  • provide definitions for terminology

  • http://www.consort-statement.org/stardstatement.htm


Stard definitions adopted l.jpg
STARD Definitions Adopted Resolution”

Purpose of a qualitative diagnostic test is to determine whether a target condition is present or absent in a subject from the intended use population

  • Target condition (condition of interest) – can refer to a particular disease , a disease stage, health status, or any other identifiable condition within a patient, such as staging a disease already known to be present, or a health condition that should prompt clinical action, such as the initiation, modification or termination of treatment

  • Intended use population (target population) – those subjects/patients for whom the test is intended to be used


Reference standard stard l.jpg
Reference Standard (STARD) Resolution”

Move away from notion of a fixed, theoretical “Truth”

  • “considered to be the best available method for establishing the presence or absence of the target condition…it can be a single test or method, or a combination of methods and techniques, including clinical follow-up”

  • dichotomous - divides the intended use population into condition present or absent

  • does not consider outcome of new test under evaluation


Reference standard fda l.jpg
Reference Standard (FDA) Resolution”

What constitutes “best available method”/reference method?

  • opinion and practice within the medical, laboratory and regulatory community

  • several possible methods could be considered

  • maybe no consensus reference standard exists

  • maybe reference standard exists but for non-negligible % or intended use population, the reference standard is known to be in error

    FDA ADVICE:

  • consult with FDA on choice of reference standard before beginning your study

  • performance measures must be interpreted in context: report reference standard along with performance measures


Benchmarks for assessing diagnostic performance l.jpg
Benchmarks for Assessing Diagnostic Performance Resolution”

NEW: FDA recognizes 2 major categories of benchmarks

  • reference standard (STARD)

  • non-reference standard (a method or predicate other than a reference standard; 510(k) regulations)

    OLD: “perfect standard” and “imperfect standard”, “gold standard” – concepts and terms deleted

    Choice of comparative method determines which performance measures can be reported


Comparison with benchmark l.jpg
Comparison with Benchmark Resolution”

  • If a reference standard is available: use it

  • If a reference standard is available, but impractical: use it to the extent possible

  • If a reference standard is not available or unacceptable for your situation: consider constructing one

  • If a reference standard is not available and cannot be constructed, use a non-reference standard and report agreement


Naming performance measures depends on benchmarks l.jpg
Naming Performance Measures: Resolution” Depends on Benchmarks

Terminology is important – help ensure correct interpretation

Reference standard (STARD)

  • a lot of literature on studies of diagnostic accuracy (Pepe 2003, Zhou et al. 2002)

  • report sensitivity, specificity (and corresponding CIs), predictive values of positive and negative results

    Non-reference standard (due to 510(k) regulations)

  • report positive percent agreement and negative percent agreement

  • NEW: include corresponding CIs (consider score CIs)

  • interpret with care – many pitfalls!


Agreement l.jpg
Agreement Resolution”

Study Population

Non-Reference Standard

+ 

New Test+ a b

Test Test c d

Positive percent agreement (new/non ref. std.) =100%×a/(a+c)

Negative percent agreement (new/non ref. std.)=100%×d/(b+d)

[overall percent agreement=100%×(a+d)/(a+b+c+d)]

“Perfect” new test: PPA≠100% and NPA≠100%


Pitfalls of agreement l.jpg
Pitfalls of Agreement Resolution”

  • agreement as defined here is not symmetric: calculation is different depending on which marginal total you use for the denominator

  • overall percent agreement is symmetric, but can be misleading (very different 2x2 data can give the same overall agreement

  • agreement ≠ “correct”

  • overall agreement, PPA and NPA can change (possibly a lot) depending the prevalence (relative frequency of target condition in intended use population)


Overall agreement misleading l.jpg
Overall Agreement Misleading Resolution”

Non-Ref Non-Ref Standard Standard

+  + 

New + 40 1 New + 40 19

Test  19 512Test  1 512

total 59 513 total 41 531

overall agreement = 96.5% ((40+512)/572))

PPA = 67.8% (40/59) PPA = 97.6% (40/41)

NPA = 99.8% (512/513) NPA = 96.4% (512/531)


Agreement correct l.jpg
Agreement Resolution” ≠ Correct

Original data: Non-Reference Standard

+ 

New + 40 5

Test  4 171

Stratify data above by Reference Standard outcome

Reference Std + Reference Std 

Non-Ref Std Non-Ref Std

+  + 

New + 39 5 New + 1 0

Test  1 6Test  3 165

tests agree and are wrong for 6+1 = 7 subjects


Slide21 l.jpg
Bias Resolution”

Unknown and non-quantified uncertainty

  • Often existence, size (magnitude), and direction of bias cannot be determined

  • Increasing overall number of subjects reduces statistical uncertainty (confidence interval widths) but may do nothing to reduce bias


Some types of bias l.jpg
Some Types of Bias Resolution”

  • error in reference standard

  • use test under evaluation to establish diagnosis

  • spectrum bias – do not choose the “right” subjects”

  • verification bias – only a non-representative subset of subjects evaluated by reference standard, no statistical adjustments made to estimates

  • many other types of bias

    See Begg (1987), Pepe (2003), Zhou et al. (2002)


Estimating sens and spec without a reference standard l.jpg
Estimating Sens and Spec Without a Reference Standard Resolution”

  • Model-based approaches: latent class models and Bayesian models. See Pepe (2003), and Zhou et al. (2002)

  • Albert and Dodd (2004)

    • incorrect model leads to biased sens and spec estimates

    • different models can fit data equally well, yet produce very different estimates of sens and spec

  • FDA concerns & recommendations:

    • difficult to verify that model and assumptions are correct

    • try a range of models and assumptions and report range of results


Reference standard outcomes on a subset l.jpg
Reference Standard Outcomes Resolution” on a Subset

  • Albert and Dodd (2006, under review)

    • use info from verified and non-verified subjects

    • choosing between competing models is easier

    • explore subset choice (random, test dependent)

  • Albert (2006, under review)

    • estimation via imputation

    • study design implications (Albert, 2006)

  • Kondratovich (2003; 2002-Mar-8 FDA Microbiology Devices Panel Meeting)

    • estimation via imputation


Practices to avoid l.jpg
Practices to Avoid Resolution”

  • using terms “sensitivity” and “specificity” if reference standard is not used

  • discarding equivocal results in data presentations and calculations

  • using data altered or updated by discrepant resolution

  • using the new test as part of the comparative benchmark


External validity l.jpg
External validity Resolution”

A study has high external validity if the study results are sufficiently reflective of the “real world” performance of the device in the intended use population


External validity27 l.jpg
External validity Resolution”

FDA recommends

  • include appropriate subjects and/or specimens

  • use final version of the device according to the final instructions for use

  • use several of these devices in your study

  • include multiple users with relevant training and range of expertise

  • cover a range of expected use and operating conditions


Reporting recommendations l.jpg
Reporting Recommendations Resolution”

  • CRITICAL - need sufficient detail to be able to assess potential bias and external validity

  • just as (more?) important as computing CI’s correctly

  • see guidance for specific recommendations


Slide29 l.jpg

References Resolution”

Albert, P. S. (2006). Imputation approaches for estimating diagnostic accuracy for multiple tests from partially verified designs. Technical Report 042, Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute (http://linus.nci.nih.gov/~brb/TechReport.htm).

Albert, P.S., & Dodd, L.E. (2004). A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60, 427–435.

Albert, P. S. and Dodd, L. E. (2006). On estimating diagnostic accuracy with multiple raters and partial gold standard evaluation. Technical Report 041, Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute (http://linus.nci.nih.gov/~brb/TechReport.htm).

Begg, C.G. Biases in the assessment of diagnostic tests. Statistics in Medicine 1987;6:411-423.

Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Lijmer, J.G., Moher, D., Rennie, D., & deVet, H.C.W. (2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD initiative. Clinical Chemistry, 49(1), 1–6. (Also appears in Annals of Internal Medicine (2003) 138(1), W1–12 and in British Medical Journal (2003) 329(7379), 41–44)


Slide30 l.jpg

References Resolution” (continued)

Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Moher, D., Rennie, D., deVet, H.C.W., & Lijmer, J.G. (2003). The STARD statement for reporting studies of diagnostic accuracy: Explanation and elaboration. Clinical Chemistry, 49(1), 7–18. (Also appears in Annals of Internal Medicine (2003) 138(1), W1–12 and in British Medical Journal (2003) 329(7379), 41–44)

Lang, Thomas A. and Secic, Michelle. How to Report Statistics in Medicine. Philadelphia: American College of Physicians, 1997.

Kondratovich, Marina (2003). Verification bias in the evaluation of diagnostic devices. Proceedings of the 2003 Joint Statistical Meetings, Biopharmaceutical Section, San Francisco, CA.

Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction. New York: Oxford University Press.

Zhou, X. H., Obuchowski, N. A., & McClish, D. K. (2002). Statistical methods in diagnostic medicine. New York: John Wiley & Sons.


ad