Multiple Tests, Multivariable Decision Rules, and Studies of Diagnostic Test Accuracy

Multiple Tests, Multivariable Decision Rules, and Studies of Diagnostic Test Accuracy Coursebook Chapter 8 – Multiple Tests and Multivariable Decision Rules Coursebook Chapter 5 – Studies of Diagnostic Test Accuracy Michael A. Kohn, MD, MPP 10/27/2005

Outline of Topics • Combining results of multiple tests: importance of test non-independence • Recursive Partitioning • Logistic Regression • Published “rules” for combining test results: importance of validation separate from derivation • Biases in studies of diagnostic test accuracy Overfitting bias Incorporation bias Referral bias Double gold standard bias Spectrum bias

Warning: Different Example Example of combining two tests in this talk: Exercise ECG and Nuclide Scan as dichotomous tests for CAD (assumed to be a dichotomous D+/D- disease)* Example of combining two tests in Coursebook: Premature birth (GA < 36 weeks) and low birth weight (BW < 2500 grams) as dichotomous tests for neonatal morbidity *Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical epidemiology : a basic science for clinical medicine. 2nd ed. Boston: Little Brown; 1991.

One Dichotomous Test Exercise ECG CAD+ CAD- LR Positive 299 44 6.80 Negative 201 456 0.44 Total 500 500 Do you see that this is (299/500)/(44/500)? Review of Chapter 3: What are the sensitivity, specificity, PPV, and NPV of this test? (Be careful.)

Clinical Scenario – One TestPre-Test Probability of CAD = 33%EECG Positive Pre-test prob: 0.33 Pre-test odds: 0.33/0.67 = 0.5 LR(+) = 6.80 Post-Test Odds = Pre-Test Odds x LR(+) = 0.5 x 6.80 = 3.40 Post-Test prob = 3.40/(3.40 + 1) = 0.77

Clinical Scenario – One Test Pre-Test Probability of CAD = 33%EECG PositivePost-Test Probability of CAD = 77% Using Probabilities Using Odds Pre-Test Odds of CAD = 0.50EECG Positive (LR = 6.80)Post-Test Odds of CAD = 3.40

Clinical Scenario – One TestPre-Test Probability of CAD = 33%EECG Positive EECG + (LR = 6.80) |-----------------> +------------------------------------------X------------------X----------+ | | | | | | | Log(Odds) 2 -1.5 -1 -0.5 0 0.5 1 Odds 1:100 1:33 1:10 1:3 1:1 3:1 10:1 Prob 0.01 0.03 0.09 0.25 0.5 0.75 0.91 Odds = 0.50 Prob = 0.33 Odds = 3.40 Prob = 0.77

Second Dichotomous Test Nuclide Scan CAD+ CAD- LR Positive 416 190 2.19 Negative 84 310 0.27 Total 500 500 Do you see that this is (416/500)/(190/500)?

Pre-Test Probability of CAD = 33%EECG PositivePost-EECG Probability of CAD = 77%Nuclide Scan PositivePost-Nuclide Probability of CAD = ? Clinical Scenario –Two Tests Using Probabilities

Clinical Scenario – Two Tests Using Odds Pre-Test Odds of CAD = 0.50EECG Positive (LR = 6.80)Post-Test Odds of CAD = 3.40Nuclide Scan Positive (LR = 2.19?)Post-Test Odds of CAD = 3.40 x 2.19? = 7.44? (P = 7.44/(1+7.44) = 88%?)

Clinical Scenario – Two TestsPre-Test Probability of CAD = 33%EECG Positive E-ECG + (LR = 6.80) |-----------------> Nuclide + (LR = 2.19) |------> E-ECG + Nuclide + Can we do this? |----------------->|-----> E-ECG + and Nuclide + +--------------------------------X------------------X------X---+ | | | | | | | Log(Odds) 2 -1.5 -1 -0.5 0 0.5 1 Odds 1:100 1:33 1:10 1:3 1:1 3:1 10:1 Prob 0.01 0.03 0.09 0.25 0.5 0.75 0.91 Odds = 0.50 Prob = 0.33 Odds = 7.44 Prob = 0.88 Odds = 3.40 Prob = 0.77

Question Can we use the post-test odds after a positive Exercise ECG as the pre-test odds for the positive nuclide scan? i.e., can we combine the positive results by multiplying their LRs? LR(E-ECG +, Nuclide +) = LR(E-ECG +) x LR(Nuclide +) ? = 6.80 x 2.19 ? = 14.88 ?

Answer = No Not 14.88

Non-Independence A positive nuclide scan does not tell you as much if the patient has already had a positive exercise ECG.

Clinical Scenario Using Odds Pre-Test Odds of CAD = 0.50EECG +/Nuclide Scan + (LR = 10.62)Post-Test Odds of CAD = 0.50 x 10.62 = 5.31 (P = 5.31/(1+5.31) = 84%, not 88%)

Non-Independence E-ECG + |-----------------> Nuclide + |------> E-ECG + Nuclide + if tests were independent |----------------->|-----> E-ECG + and Nuclide + since tests are dependent |--------------------> +--------------------------------X--------------------X--------+ | | | | | | | Log(Odds) 2 -1.5 -1 -0.5 0 0.5 1 Odds 1:100 1:33 1:10 1:3 1:1 3:1 10:1 Prob 0.01 0.03 0.09 0.25 0.5 0.75 0.91 Prob = 0.84

Non-Independence Instead of the nuclide scan, what if the second test were just a repeat exercise ECG? A second positive E-ECG would do little to increase your certainty of CAD. If it was false positive the first time around, it is likely to be false positive the second time.

Reasons for Non-Independence Tests measure the same aspect of disease. In this example, the gold standard is anatomic narrowing of the arteries, but both EECG and nuclide scan measure functional narrowing. In a patient without anatomic narrowing (a D- patient), coronary artery spasm could cause false positives on both tests.

Reasons for Non-Independence Spectrum of disease severity. In this example, CAD is defined as ≥70% stenosis on angiogram. A D+ patient with 71% stenosis is much more likely to have a false negative on both the EECG and the nuclide scan than a D+ patient with 99% stenosis.

Reasons for Non-Independence Spectrum of non-disease severity. In this example, CAD is defined as ≥70% stenosis on angiogram. A D- patient with 69% stenosis is much more likely to have a false positive on both the EECG and the nuclide scan than a D- patient with 33% stenosis.

Counterexamples: Possibly Independent Tests For Venous Thromboembolism: • CT Angiogram of Lungs and Doppler Ultrasound of Leg Veins • Alveolar Dead Space and D-Dimer • MRA of Lungs and MRV of leg veins

Unless tests are independent, we can’t combine results by multiplying LRs

Ways to Combine Multiple Tests On a group of patients (derivation set), perform the multiple tests and determine true disease status (apply the gold standard) • Measure LR for each possible combination of results • Recursive Partitioning • Logistic Regression

Determine LR for Each Result Combination *Assumes pre-test prob = 33%

Determine LR for Each Result Combination 2 dichotomous tests: 4 combinations 3 dichotomous tests: 8 combinations 4 dichotomous tests: 16 combinations Etc. 2 3-level tests: 9 combinations 3 3-level tests: 27 combinations Etc.

Determine LR for Each Result Combination How do you handle continuous tests? Not practical for most groups of tests.

Recursive Partitioning

Recursive Partioning • Same as Classification and Regression Trees (CART) • Don’t have to work out probabilities (or LRs) for all possible combinations of tests, because of “tree pruning”

Tree Pruning: Goldman Rule* 8 “Tests” for Acute MI in ER Chest Pain Patient : • ST Elevation on ECG; • CP < 48 hours; • ST-T changes on ECG; • Hx of ACI; • Radiation of Pain to Neck/LUE; • Longest pain > 1 hour; • Age > 40 years; • CP not reproduced by palpation. *Goldman L, Cook EF, Brand DA, et al. A computer protocol to predict myocardial infarction in emergency department patients with chest pain. N Engl J Med. 1988;318(13):797-803.

8 tests  28 = 256 Combinations

Recursive Partitioning • Does not deal well with continuous test results

Logistic Regression Ln(Odds(D+)) = a + bE-ECGE-ECG+ bNuclideNuclide + binteract(E-ECG)(Nuclide) “+” = 1 “-” = 0 More on this later in ATCR!

Logistic Regression Approach to the “R/O ACI patient” *Selker HP, Griffith JL, D'Agostino RB. A tool for judging coronary care unit admission appropriateness, valid for both real-time and retrospective use. A time-insensitive predictive instrument (TIPI) for acute cardiac ischemia: a multicenter study. Med Care. Jul 1991;29(7):610-627. For corrected coefficients, see http://medg.lcs.mit.edu/cardiac/cpain.htm

Clinical Scenario* 71 y/o man with 2.5 hours of CP, substernal, non-radiating, described as “bloating.” Cannot say if same as prior MI or worse than prior angina. Hx of CAD, s/p CABG 10 yrs prior, stenting 3 years and 1 year ago. DM on Avandia. ECG: RBBB, Qs inferiorly. No ischemic ST-T changes. *Real patient seen by MAK 1 am 10/12/04

What Happened to Pre-test Probability? Typically clinical decision rules report probabilities rather than likelihood ratios for combinations of results. Can “back out” LRs if we know prevalence, p[D+], in the study dataset. With logistic regression models, this “backing out” is known as a “prevalence offset.” (See Chapter 8A.)

Optimal Cutoff for a Single Continuous Test Depends on • Pre-test Probability of Disease • ROC Curve (Likelihood Ratios) • Relative Misclassification Costs Cannot choose an optimal cutoff with just the ROC curve.

Optimal Cutoff Line for Two Continuous Tests

Choosing Which Tests to Include in the Decision Rule Have focused on how to combine results of two or more tests, not on which of several tests to include in a decision rule. Options include: • Recursive partitioning • Automated stepwise logistic regression* Choice of variables in derivation data set requires confirmation in a separate validation data set.

Need for Validation: Example* Study of clinical predictors of bacterial diarrhea. Evaluated 34 historical items and 16 physical examination questions. 3 questions (abrupt onset, > 4 stools/day, and absence of vomiting) best predicted a positive stool culture (sensitivity 86%; specificity 60% for all 3). Would these 3 be the best predictors in a new dataset? Would they have the same sensitivity and specificity? *DeWitt TG, Humphrey KF, McCarthy P. Clinical predictors of acute bacterial diarrhea in young children. Pediatrics. Oct 1985;76(4):551-556.

Need for Validation Develop prediction rule by choosing a few tests and findings from a large number of possibilities. Takes advantage of chance variations in the data. Predictive ability of rule will probably disappear when you try to validate on a new dataset. Can be referred to as “overfitting.”

VALIDATION No matter what technique (CART or logistic regression) is used, the “rule” for combining multiple test results must be tested on a data set different from the one used to derive it. Beware of “validation sets” that are just re-hashes of the “derivation set”. (This begins our discussion of potential problems with studies of diagnostic tests.)

Studies of Diagnostic Test AccuracySackett, EBM, pg 68 • Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? • Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? • Was the reference standard applied regardless of the diagnostic test result? • Was the test (or cluster of tests) validated in a second, independent group of patients?

Bias in Studies of Diagnostic Test Accuracy Index Test = Test Being Evaluated Gold Standard = Test Used to Determine True Disease Status

Studies of Diagnostic TestsSackett, EBM, pg 68 • Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? • Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? • Was the reference standard applied regardless of the diagnostic test result? • Was the test (or cluster of tests) validated in a second, independent group of patients?

Studies of Diagnostic TestsIncorporation Bias Index Test is “incorporated” into gold standard. Consider a study of the usefulness of various findings for diagnosing pancreatitis. If the "Gold Standard" is a discharge diagnosis of pancreatitis, which in many cases will be based upon the serum amylase, then the study can't quantify the accuracy of the amylase for this diagnosis.

Studies of Diagnostic TestsIncorporation Bias A study* of BNP in dyspnea patients as a diagnostic test for CHF also showed that the CXR performed extremely well in predicting CHF. The two cardiologists who determined the final diagnosis of CHF were blinded to the BNP level but not to the CXR report, so the assessment of BNP should be unbiased, but not the assessment CXR. *Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of B-type natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med 2002;347(3):161-7.

Studies of Diagnostic TestsSackett, EBM, pg 68 • Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? • Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? • Was the reference standard applied regardless of the diagnostic test result? • Was the test (or cluster of tests) validated in a second, independent group of patients?

Multiple Tests, Multivariable Decision Rules, and Studies of Diagnostic Test Accuracy