Practical Issues on Clinical Validation of Digital Imaging Applications in Routine Surgical Pathology FDA Hematology and Pathology Devices Panel Meeting October 22-23, 2009 Tan Nguyen, MD, PhD, RAC FDA/CDRH/OIVD/DIHD-DCTD
Digitalization Not a Barrier to Pathologic Diagnosis • Image-based telepathology having been in place for a number of years • Availability of capable automated high-speed, high-resolution whole slide imaging technology (WSI) • At issue: How can we demonstrate that pathologists can safely and effectively sign out routine surgical cases via WSI of H&E glass slides? • Compare with diagnoses made by light microscopy
Presentation Outline • Quality of images • Image acquisition, image display • Clinical performance study • Possible study designs • Selection of study participants • Case (specimen) selection • Establishing “reference” diagnosis • Evaluating diagnosis agreement • Other issues
Image Acquisition • Optimal objective lens power for image scanning? • Digital magnification or magnification by interchangeable objective lenses? • Single-focus plane or 3-D image enhancement? • Z-stacks needed for certain examinations (e.g., surgical margin, H. pylori, microcalcifications, nucleoli) • Compression algorithm, user-selectable ratio? • Diagnosis made on uncompressed image or image retrieved from prior compressed image data file?
Image Display • Viewing monitor • Standardized size, aspect ratio, display resolution (low, medium, high)? • Viewing software • Image storage, retrieval, annotation • Viewer functionality • “Thumbnail” view • Panning, zooming, side-by-side viewing of multiple images
Types of Possible Clinical Study • Prospective study (“field study”)? • Replicating real-world surgical pathology practice • Minimizing case selection bias • Introducing multiple new sources of variation • e.g., non-uniform specimen selection/suitability, variable quality of glass slides • Impractical? • Resource constraints at each study site • Possibly longer overall study duration
Types of Possible Clinical Study • Retrospective study? • Ability to select archival cases to challenge (“stress test”) the competing diagnostic modalities • Possible to incorporate more case variation • Inherent case selection bias • Often employed in MRMC ROC studies* to assess diagnostic accuracy of radiologic imaging interpretations • Large study to detect small differences in accuracy possible * Multiple-reader multiple-case receiver operating characteristic studies
MRMC ROC Paradigm • Possible to adopt MRMC ROC paradigm? • Frequently used tool in diagnostic radiology • More information per case, smaller sample sizes • Ability to compare accuracy of diagnostic modalities that rely on wide range of subjective interpretations by readers of varying skill degrees • Generalizable to similar readers and similar cases • Potentially complicated by multiple observations (diagnoses) in the same specimen
Selecting Study Participants • Study participants • Spectrum of pathologists without formal specialty training to specialty experts or more homogeneous population? • Prior exposure to digital pathology • Study locations • Community/academic practices, commercial laboratories • Number of study participants? • Traditional MRMC ROC studies: 10-20 readers; 100-300 cases
Selecting a Balanced Set of Cases • Adequate mix of biopsies to radical excisions • Broad spectrum of diagnostic complexity • Not based on ease of diagnosis, typicality of appearance • Randomly orsequentially selected specimens • Anonymized archival or prospectively collected cases • Use of enriched samples for low-prevalence diseases? • Including all or only representative diagnostic part(s)? • How many cases? • Statistical power against reader’s burden
Observer Variation • Inherent subjectivity in interpretation thresholds • e.g., “atypia,” tumor grading, borderline or uncommon lesions • Paucity of lesional area; intra-lesional variation • Lack of clear diagnostic criteria • Non-quantitative nature of scoring (e.g., pleomorphism) • Subjective distinctions on a histologic continuum • Broad spectrum of experience and confidence • Diagnostic “aggressiveness” or hedging in uncertainty
Reducing Observer Variation • Strict adherence to diagnostic criteria and guidelines • Use of pro forma histopathology reporting form • Use of checklist standardized diagnostic lines • Free-text diagnosis for diagnostic uncertainty? • Accommodating personal reporting style, judgment • Statistically problematic to evaluate • Collapsed 2-tiered versus 3-tiered grading system? • Circulating an annotated training set prior to study?
Establishing Light Microscopy “Reference” Diagnosis • Diagnosis by expert or consensus panel? • Number of experts • Consensus diagnosis by study participants? • Unanimous agreement or majority agreement? • Allowing “acceptable” diagnosis? • Disagreement in opinion, but not “error” (i.e., no amendment necessary)? • “Reference” diagnosis be abstraction of the primary diagnosis or all diagnostic lines?
Evaluating Diagnosis Agreement • Primary diagnosis agreement only? • 2o diagnoses often posing no clinical impact • Unacceptable for a pathologist simply to make accurate diagnosis of malignancy! • Line-by-line agreement (1o and 2o diagnoses)? • Ideal for collecting performance testing data • Unrealistic to expect high agreements without clearly defined diagnostic criteria for all lesions under inquiry • Incomplete agreement on 2o diagnoses?
Evaluating Diagnosis Agreement • “Major” versus “minor” discrepancy • Determined based on clinical impact or flat-out histopathologic error? • Compound nevus versus junctional nevus; CIN II versus CIN III → flat-out error, but no difference in treatment • Tumor on inked margin or within 1 mm of inked margin in breast biopsy → often a subjective call if specimen not adequately inked, but greatly affecting treatment decision • False-positive versus false-negative diagnosis? • Treated differently or equally in statistical evaluation?
Evaluating Diagnosis Agreement R1 R2 Panel’s “reference” diagnoses (light microscopy) R3 Participants’ diagnoses by light microscopy Participants’ diagnoses by digital pathology
“Wash-out” Period • E.g., a study involving the same pathologist reading: • ½ cases: digital imaging followed by light microscopy • ½ cases: light microscopy followed by digital imaging • “Wash-out” period between digital imaging reading and light microscopy reading? • Easier said than done! • Not necessary, if desirable to know whether one modality, when seen first, resulting in improved agreement rate of the subsequent one?
Evaluating Diagnosis Agreement • If significant disagreement between R1,R2 , and R3: • Case-sample variation • Intra- and interobserver variations • Variation intrinsic to each diagnostic modality • Possible or need to tease out all variations? • Or, account for effects of case and reader variations on accuracy of competing diagnostic modalities? • e.g., by MRMC statistical models; then comparing the overall accuracy
Other Issues • Assuming valid performance data exist for one tissue type (e.g., breast pathology): • Can the test system be generalized and labeled for all other surgical pathology tissue types without the need for further validations? • Can it be generalized and labeled for intraoperative (frozen section) diagnosis and telepathology? • If not, how should the label explicitly state the test system’s limitations?
Other Issues • Generalizing performance of WSI of H&E glass slides to non-H&E-stained glass slides? • Required training of pathologists prior to using WSI? • What type of training? • Need for post-marketing study for additional safety and effectiveness data? • How to conduct such study? • What data to collect?