Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology

Whole Slide Imagery as an Enabling Technologyfor Content-Based Image Retrieval:A review of current capabilities, opportunities and challenges Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology University of Michigan Health System ulysses@umich.edu

Disclosures* • Aperio: • Technical Advisory Board and Shareholder • Living Microsystems/Artemis Health, Inc.: • Founder and Shareholder • Cellpoint Diagnostics: • Founder and Shareholder *These are listed for completeness only; this presentation does not contain proprietary or commercial content from any of the above entities.

Overview of Topics • Thesis statement • Definitions • A quick history of content-based image retrieval (CBIR) • Prior work • The challenge that is Pathology CBIR • Current technology and recent developments • Demonstrations • Opportunities: • upcoming Web-enabled tool suites • Intended use-cases Slide 3 of 94

Topics • Thesis statement • Definitions • A quick history of content-based image retrieval (CBIR) • Prior work • The challenge that is Pathology CBIR • Current technology and recent developments • Demonstrations • Opportunities: • upcoming Web-enabled tool suites • Intended use-cases Slide 4 of 139

The availability of digital whole slide data sets represent an enormous opportunity to carry out new forms of numerical and data- driven query, in modes not based on textual, ontological or lexical matching. Search image repositories with whole images or image regions of interest Carry our search in real-time via use of scalable computational architectures or Resultant Surface Map or gallery of matching images Thesis Statement Extraction from Image repositories based upon spatial information …001011010111010111.. Analysis of data in the digital domain

Topics • Thesis statement • Definitions • A quick history of content-based image retrieval (CBIR) • Prior work • The challenge that is Pathology CBIR • Current technology and recent developments • Demonstrations • Opportunities: • upcoming Web-enabled tool suites • Intended use-cases

Definition • Content-Based Image Retrieval (CBIR): • Within the context of an image-based repository, searching for matching predicates with image-based operators in lieu of text matching • Reverse Metadata Lookup (RML): • Using the cohort of returned images from a CBIR query to generate a list of associated metadata concept terms • Anatomic frame of reference • Prior diagnoses • Differential Diagnosis

…001011010111010111.. A Quick History of CBIR • 1970’s: Corona Satellite Remote Sensing Initiative • Film-based • Resultant analog content, when digitized, represented Gigabytes of data (consider the computational burden for 1972… • Several numerical approaches devised to quickly crunch data • Many approaches based on conventional image analysis: one or more specific algorithms developed for each feature to be extracted / identified • Technically challenging • Time consuming • Computationally expensive • The term CBIR first coined in 1992 by T. Kato to describe automatic retrieval of images from a database. • One promising approach also explored was Vector Quantization (V.Q.) • Many-log increase in computational throughput required for routine use

35 OPS 1.026 PFLOPS 478.2 TFLOPS

CBIR Operational Modes • Query by Example • Find pictures that contain this snippet / ROI • Semantic Retrieval • Find pictures like adenocarcinoma • Like this adenocarcinoma • Multimodal Retrieval • Search for matches based on imagery data combined with other search metrics • High-throughput “omics” data, etc. • Patient clinical outcomes and therapeutic response data • Other imaging modalities

CBIR Techniques (conventional) • Color Operators • Texture operators • Shape • Spectral information • Frequency and phase domain information There are at least several thousand major classes of conventional image analysis operations, with most exhibiting the common trait of requiring some degree of application tuning for the intended use-case. Hence, this class of approaches should not be generally viewed as turnkey solutions.

CBIR Techniques (innovative) • “Genetic” Image Exploration • Originally designed to analyze multispectral satellite data • Semi-autonomous systems that employ a decision-tree to search a known repertoire of conventional image analysis algorithms for the most sensitive and specific combination of algorithms that fits the query predicate • is representative • (Los Alamos National Labs) • Autonomous operation comes at a price: the need for significant computational throughput in training mode (e.g. slow…)

Prior Work • Conventional Image analysis • Conventional Vector Quantization

Conventional Image Analysis • At present, confined to specific use-cases: • Quantitative IHC • FDA validation an ongoing challenge • Not reduced to practice as an integral tool of the “pathologist’s workstation”

Conventional Vector Quantization Original Image Division of image into local domains Extraction of Local Domain Composite Vectors ? VK=Σ{[L•x0y0]Order ,… [L•xnym]Order} Vectorization of each local kernel Individual assessment of each vector dimension 38857448643

Conventional Vector Quantization 8865433 354554343 776956468 865438676 66963658 554323267 446854456 53887 446854 553246564 55565435 38857448643 VK=Σ{[L•x0y0]Order ,… [L•xnym]Order} Established Vocabulary Query Against library (Vocabulary) of Established Vectors Previously Identified Vector Novel Vector Assignment of a unique serial number and inclusion into global vocabulary Assembly of compressed dataset 38857448643

VQ-Based Image Compression as the Original Predicate for Carrying OutImage-Based Search 8865433 354554343 776956468 865438676 66963658 554323267 446854456 53887 446854 553246564 55565435 38857448643 Raw Data Restored Data Compressed data The spatially-preserved organization of the encoded data represents a many-fold decrease in overall search dataset size, thus providing a significant computational opportunity for accelerated search. Additionally, the vectors identified as contributing to a match may be visually interrogated for confirmation of their predictive morphologic content.

The Challenge That IsPathology CBIR • Start with some conservative initial assumptions, concerning a prototypic image repository, in terms of search potential: • Ability to search 10 years of data • 1000 slides day  200,000 slides/year • 500 Mb of compressed whole slide data/slide • Operational goal of being able to: • Search in real-time • Re-index the database every evening, such that searches carried out the next day are current

The Challenge That IsPathology CBIR • Net storage required for ten year’s worth of data: • 1 Billion Megabytes • 106 Gigabytes • 103 Terabytes • 100 Petabytes  1 Petabyte • Current conservative enterprise storage is $2000/ Terabyte • The full Petabyte would cost $2M • A single Genetic-type search across all images, assuming 5 seconds of computation / slide, would be: • 200,000 slides x 10 x 5 seconds  5 million seconds • This is 6 log too slow • 8.27 weeks or about 6 searches per year • (original Apple 2e: 78 years) • So we would need to save our queries for those “really important” image searches…. • Conventional VQ, which is ~100 times faster, is still not fast enough: 13.8 hours per feature search • Yet another 4 log of performance is required… • Two ways to address this: • 10,000 parallel processors or • better algorithms

On Current Technology… • Modern computational throughput continues to increase, with this capability representing an opportunity for perhaps 1-2 log performance increase in the next decade • With a one-log increase, we are still left with a five-log gap that needs to be made up by improved algorithmic performance.

Recent Developments • A number of promising algorithms being developed • Support Vector Machines (SVM) • Principle Component analysis • High-dimensional reduction approaches • Spatially-invariant VQ (SiVQ)

VQ Revisited and SiVQ Q: What is conventional VQ’s greatest weakness: A: Too many required vectors to represent a single atomic morphologic feature • (promiscuity of vector set growth with continued training)

Conventional VQ Vector Growth during training

Candidate Feature A Matter of Degrees of Freedom… How many ways can this be sampled?

How Many Ways Can A Candidate Feature Be Matched During Training? Y Translational Freedom X Translational Freedom Rotational Freedom

In VQ: it may be the same feature but there are excessively enumerable ways to sample • Typical Feature Vector: • 25 x 25 pixels (x by y) or larger •  625 translational degrees of freedom • Effective radius of 12.5 pixels • After Nyquist rotational sampling (2x spatial frequency) • 2 x (2 x 12.5 x π)  79 separate rotations • 3 color planes • 2 mirror symmetries • At least 20 possible semi-discreet length-scale Nyquist samples • All together, there are at least 625 x 79 x 3 x 2 x 20 5,925,000 possible ways to represent one possible vector (assuming twenty fixed magnifications in use) • This explains the non-asymptotic (unbounded) vector growth observed of some histology patterns. • Multispectral data (e.g. 28 vs. 3 bands) will further multiply the diagnostic power of SiVQ vectors (55,300,000 degrees of freedom / vector)

Consequences of SiVQ • Use one spatially-invariant vector to do the work of 5,925,000 spatially-constrained vectors • 5,925,000x faster • 5,925,000 fewer vectors to store per feature archetype • 6 log+ increase in algorithmic performance (we only needed 4 log, so we have CPU to burn) • Implies an operational solution to the real-time requirement for large datasets • CBIR is essentially reduced to practice for a sizable contingent of textural-based whole slide image-retrieval use-cases • Emergent property: SiVQ works equally-well on all structurally-repetitive data sets (e.g. remote sensing, Google-like image searches of the Web)

Interactive Demonstration

Opportunities and Future Work • CBIR development will continue • Many groups already demonstrating feasibility of real-time query capability • Activity at Rutgers, U. of Pittsburgh and Cal Tech • For the UofM Group: • Rapid dissemination of the algorithm and libraries via peer-reviewed publications and/or e-pubs • Extension of the discovery tool suite to support multiple-vector classification, similar to the approaches taken for prior VQ systems, with rapid follow-on publications • “Ground-Truth Engine” for integrative multimodality studies • Activation of an open-architectures website that will provide a downloadable tool suite and a Web-Based, real-time decision support environment for submitted images, operating in two general use-cases: • Surface classification with rare event detection (anything not classified as normal) • Differential diagnosis generation with return of matching images and associated metadata • Generation of a classification library of extensive “normal SiVQ vectors” for each organ system • Actively pursue collaboration to form a core team to adjudicate needed normal and abnormal vector classes

Closing Remarks • CBIR is not vaporware or an elusive computational goal • Contemporary computation speed is, actually, quite adequate for many CBIR tasks • Much work remains to realize its full potential • SiVQ will likely be one of a plurality of compelling solutions in the Image Query / Decision-support armamentarium

Acknowledgements • Jerome Cheng, U. of Michigan • Anastasios Markas, Insilica Corporation • Mehmet Toner and Ronald Tompkins, Harvard Medical School • Mike Feldman, U. of Pennsylvania

Ulysses J. Balis, M.D. Director, Division of Pathology Informatics Department of Pathology