Clinical Decision Making in Diagnostics: Statistical Insights and Best Practices

…Borrelia diagnostics – statistical aspects Jørgen Hilden jh@biostat.ku.dk Notes have been added in this file February 2009

Biostatistical motto: Formalism with a human face Plan of my talk Clinicometric framework Descriptors of diagnostic power Displays of diagnostic power including the ROC diagram Simultaneous use of 2 measurements Randomized testing of diagn. procedures Special topics in supplementary slides

Topics not mentioned Systematic reviews & meta-analyses

”Clinicometrics” …always considers a stream of cases ( statisticians say: a population of cases ): They are the units of clinical experience and also of clinical decision making. They are instances of a (well-defined?) clinical problem, ”the who-how-where-why of a patient-doctor encounter.” Therefore…

* In clinical studies the choice of sample, and of the variables on which one bases one's prediction, must match the clinical problem as it presents itself at the time of decision making. In particular, one mustn't discard subgroups (as ‘atypical’ or ‘impurities’) that did not become identifiable until later: ensure prospective recognizability ! Data collection *as opposed to the ’engineering’ phases

Purity vs. representativeness:A meticulously filtered case stream ( 'proven single-agent infections', or 'meeting CDC criteria' )may be needed for patho- and pharmaco-physiological research, but is inappropriate as a basis for clinical decision policies [incl. cost studies]. Data collection

Your job is to create decision rules that help the clinician decide, e.g.- whether to proceed with antibiotics- when to plan clin. & serol. follow-up checks - when to apply other tests for, e.g. HSV►ideallydrawing a completemanagement flowchart,i.e. a bushy tree of action diagnoses, not etiological diagnoses Don’t forget…

Consecutivity as a safeguard against selection bias.Standardization: How? Who? Where? When? Gold standard … the big problem !! w. blinding, etc.Safeguards against change of data after the fact. Data collection w3.consort-statement.org/Initiatives/stardClinical_Chemistry_statement.pdf

Quantitative markers Focussing on A quantity holds the result of a diagnostic procedure. Histograms describe its distribution in two subpopulations. We can interpret ordinates and areas under the two humps in terms of true and false decisions … and get a feel for the trade-off involved, provided that the pre-test probability of disease (percentage diseased) is known.

… principle … Diseased Non-disease Healthy False negatives False positives Positive range Negative range False negative Measurement False positive Cutoff point Measurement Each area = 1.00 = 100 % of the subpopulation

… principle … Diseased Sensitivity ( true positive fraction ) Non-disease Specificity ( true negative fraction ) Healthy False negatives False positives Positive range Negative range False negative Measurement False positive Cutoff point Measurement Note: BLACK&WHITE paradigm!

the probability square’ All the positives Pre-test ’case mix’ 30% 70% diseased non-diseased I.e., 64.4 % of cases are true negatives; the other three areas are analogous. 1 – spec. = false positive fraction Sensitivity, true posit. fraction True negatives area = 0.70 × 0.92 = 0.644 Specificity = 0.92, say

Classical terminology ”Positive” = suggestive of (target) disease ”Negative” = suggestive of its absence ”False / True Positive / Negative …” Sensitivity = TP/(those diseased) Specificity = TN/(those without it) What is meant by PV( ”predictive value” )? What is meant by LR( ”likelihood ratio” )?

Classical terminology ”Positive” = suggestive of (target) disease ”Negative” = suggestive of its absence ”False / True Positive / Negative …” Sensitivity = TP/(those diseased) Specificity = TN/(those without it) PVpos = the ”predictive value” of a positive outcome = TP/(all positives) = Pr{ disease | pos } …chance that the test is right when it says ”positive”

Classical terminology ”Positive” = suggestive of (target) disease ”Negative” = suggestive of its absence ”False / True Positive / Negative …” Sensitivity = TP/(those diseased) Specificity = TN/(those without it) PVneg = the ”predictive value” of a negative outcome = TN/(all negatives) = Pr{ non-disease | neg } …chance that the test is right when its verdict is ”negative”

”Likelihood ratio” principle … pre-test odds = 3 : 7 30% 70% diseased non-diseased ”LR” = 5 : 1 (the ratio of red arrows); ergo post-test odds = 15 : 7. 1 – specificity Sensitivity

Specificity is not bad. Yet most positives are false positives Pre-test odds low in Lyme problems diseased non-diseased ”LRpos” = 5 : 1 is fair; but post-test odds and PVpos are still low. 1 – specificity Sensitivity

classical terminology: Not quite so Sensitivity = TP/(those diseased) Specificity = TN/(those without it) LRpos = the ”likelihood ratio” occasioned by a positive outcome = (sensitivity) / (1 – specificity) = Pr{ pos | disease } / Pr{ pos | non-disease }

classical terminology: Not quite so Sensitivity = TP/(those diseased) Specificity = TN/(those without it) LRneg = the ”likelihood ratio” occasioned by a negative outcome = (1 – sensitivity) / (specificity) = Pr{ neg | disease } / Pr{ neg | non-disease } = 0.1 = 1 : 10, for instance. If the pre-test risk of Lyme Disease is low, say p= 2%, a negative outcome almost eliminates it: (post-test odds) = (pre-test odds)(LR) = (1 : 49)(1 : 10) = (1 : 490) .

”LR” principle: it’s the factor by which the observed data will change the odds Diseased Non-disease Healthy LRpos = ratio – False negatives – ratio = LRneg* False positives Positive range Negative range False negative Measurement False positive Cutoff point Measurement *LRneg < 1 (!)

”LR” principle: it’s still the factor by which the observed data will change the odds LRdatawhen data = a measurement value* Diseased Non-disease Healthy False negatives False positives Cutpoint now irrelevant Positive range Negative range False negative Measurement False positive Cutoff point * Measurement LR= ratio

Warning… A 2-gate study 50 diseased 75 non-diseased ”LRpos” = 5 : 1 but the ”predictive values” and the post-test odds are unavailable. 1 – specificity Sensitivity

Confirmed non-infected case Confirmed infection IgM A 2-dim. task IgG

IgM ? A 2-dim. task Iso-Likelih. Ratio lines (uphill arrow) IgG ?

IgM A 2-dim. task Nearest-neighbours classification of a new patient ? IgG

IgM A 2-dim. task Kernel methods form a weighted average of neighbouring prototypes (diagnosed cases) etc., with decreasing influence, the farther away. IgG

IgM Iso-density (iso-tætheds-) linier IgG

IgM Iso-Likelihood Ratio lines (uphill arrows) IgG

Simulated data (100+100) Infection Non-infected IgM IgG

A ROC diagram shows the true positive fraction against the false positive fractionas a function of the choice of cutoff point Everyone treated as positive Liberal cutoff Strict cutoff Hypothetical smooth trajectory, and two raw empirical ones [ sample sizes: 17+17 ( ), and 40+40 ( ) ] Everyone negative

The ROC diagram describes the nosographics nosographic properties Sens, spec. LRpos, LRneg = slopes of segments. Y = Youden’s Index = sens + spec – 1 is equivalent to AUC [Area Under Curve] = ½(sens + spec) in this case. ROC Y = 1 FN neg We are within the Black&White Paradigm pos TP Y = 0 BLACK&WHITE FP TN

The ROC diagram describes the nosographics* The slope of each outcome line is its LR; e.g. LRpos = (TP fraction of Diseased)/(FP fraction of non-dis.) ROC Y = 1 FN neg pos TP Y = 0 *i.e., the information obtainable from a 2-gate study FP TN

Idealtest Reassuring Almost no evidence either way Three test outcomes Ominous

Idealtest Negative Neg.? +/- Ordered how? By increasing slope, i.e. LR [ concavity ! ] Pos.? Ordered (ordinal) test outcomes Three test outcomes Positive

The slope reflects the medical trade-off between % sensitivity and % specificity Negative Those with a ”+/ – ” test result are best treated as negative in this situation Neg.? +/- Ordered how? By increasing slope, i.e. LR [ concavity ! ] Possibly positive A ’constant-benefit’ line Ordered (ordinal) test outcomes Three test outcomes Definitely positive Trade-off? Constant benefit? … Please take a look at the supplementary figures

Interpretation of the area under the ROC as a rank statistic ( cf. Wilcoxon-Mann-Whitney ) E.g., 5 cases of disease D and 10 non-D cases: The ROC square holds 50 small rectangles, 40 of which happen to be below the ROC trajectory, because 40 times (out of 50) it so happens that a a non-D finding > a D-group finding [the desired ordering]. For an example, see patient *vs. patient **. Area Under ROC Curve = freq{ (non-D value) > (D value) } = 0.80.

The 5 cases of disease D and 10 non-D cases: The ROC square holds 50 small rectangles, 40 of which happen to be below the ROC trajectory, because 40 times (out of 50) it so happens that a a non-D finding > a D-group finding [the desired ordering]. For an example, see patient *vs. patient **. Area Under ROC Curve = freq{ (non-D value) > (D value) } = 0.80. But where does that lead us? The AUC has no definable interpretation in terms of blood, sweat and tears (loss, benefit, utility). It only has a softassociation with decision-analytic measures of diagnostic power (separation, discrimination). Its frequent use is purely a matter of being the popular girl in the class.

What!? The primary virtues of the ROC: it allows you (1) to compare tests regardless of scale, units, & transformations (2) to see oddities [ which may point to a technical problem, or call for a revised test interpretation rule ]

Lesions in floating locations Suspect area? Red = as the imagist saw it Green = surgical truth How do we score diagnostic performance in such situations ???

Digression… Randomized trials of diagn. tests … theory under development Purpose & design: many variants Sub(-set-)randomization, depending on the pt.’s data so far collected. ”Non-disclosure”: some data are kept under seal until analysis. No parallel in therapeutic trials! Main purposes …

… Randomized trials of diagn. tests • when the diagnostic intervention is itself potentially therapeutic; • when the new test is likely to redefine the disease(s) ( cutting the cake in a completely new way ); • when there is no obvious rule of translation from the outcomes of the new test to existing treatment guidelines; 4) when clinician behaviour is part of the research question… …end of digression

Statistical analysis … in the narrow sense: … is very much standard once you know what aspects to count and compare. To know that, work backwards from (likely) consequences: what would have happened to these patients? And what would have happened in the alternative scenario? Never argue ”It’s customary to calculate … (this or that)” !

Thank you ! Let me add a personal maxim: Never ask ”What can the journal impact factors do for me?” Ask instead ”What can I do for the journal impact factors?”

Supplementary pictures follow here … Vassily Vlassov pixit

The rôle of noise Pure noise, independent of the patient’s true condition, flattens distributions and hence flattens the ROC;  less information. Remedies: technical & procedural standardization, duplicate measurements, (averaging over assessors, dominance-free consensus formation) … … may be ineffective if the noise is ”inter-patient”

Its slope reflects the medical trade-off between % sensitivity and % specificity Negative Neg.? +/- Presumably positive Ordered how? By increasing slope, i.e. LR [ concavity ! ] An ”iso-benefit” line Ordered (ordinal) test outcomes Three test outcomes Definitely positive Slope? Constant benefit? … Let’s first look at a continuous test & selection of cutoff that maximizes benefit

The slope chosen so as to imply constant benefit x = c How do we find that critical slope? It depends on the pre-test ’disease mix’ – and on the (human) loss associated with wrong or suboptimal treatment - when only two courses of action are available (otherwise there will be more lines, reflecting several trade-offs). A continuous test Cutoff at measurement x = c maximizes benefit

The slope chosen so as to imply constant benefit x = c Treat no-one Treat everybody A continuous test : Cutoff at measurement x = c maximizes benefit

The slope chosen so as to imply constant benefit No misdiagnoses Without the test, it’s (slightly) better to treat everybody than to treat no-one. With the test available, about 60 % of the ’misdiagnostic burden’ is eliminated; cf. purple bar. x = c Treat no-one Treat everybody A continuous test : Cutoff at measurement x = c maximizes benefit

Clinical Decision Making in Diagnostics: Statistical Insights and Best Practices

Clinical Decision Making in Diagnostics: Statistical Insights and Best Practices

Presentation Transcript

Dr. Rick Bierman

VP Quarterly Report on Strategies Q1 – 2015/16

On Some Statistical Aspects of Agreement Among Measurements BIKAS K SINHA [ISI, Kolkata]

Statistical Analysis

COMBINING SURVEY AND ADMINISTRATIVE DATA IN THE ITALIAN EU-SILC EXPERIENCE: POSITIVE AND CRITICAL ASPECTS

Plasma diagnostics using spectroscopic techniques

Automated Volume Diagnostics

The Value Of In-vitro Diagnostics

Quality aspects of rounding statistical results

Vibration Diagnostics

Borrelia and Babesia in wild vertebrates, ticks, and humans in Florida

Diagnostics for intense e-cooled ion beams

Integration of Diagnostics and Prognostics

Is there a simple Statistical Model of Generic Natural Images? David Mumford

Applications of ADAS to ITER Diagnostics

Statistical Testing

Rickettsia, Ehrlichia, and Borrelia

Practical aspects of GWAS

Statistics 202: Statistical Aspects of Data Mining Professor David Mease

Anatomic Considerations

Lyme Disease Borrelia burgdorferi