Multiple tests multivariable decision rules and studies of diagnostic test accuracy
This presentation is the property of its rightful owner.
Sponsored Links
1 / 75

Multiple Tests, Multivariable Decision Rules, and Studies of Diagnostic Test Accuracy PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on
  • Presentation posted in: General

Multiple Tests, Multivariable Decision Rules, and Studies of Diagnostic Test Accuracy. Chapter 8 – Multiple Tests and Multivariable Decision Rules Chapter 5 – Studies of Diagnostic Test Accuracy. Michael A. Kohn, MD, MPP 10/19/2006. Outline of Topics.

Download Presentation

Multiple Tests, Multivariable Decision Rules, and Studies of Diagnostic Test Accuracy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Multiple tests multivariable decision rules and studies of diagnostic test accuracy

Multiple Tests, Multivariable Decision Rules, and Studies of Diagnostic Test Accuracy

Chapter 8 – Multiple Tests and Multivariable Decision Rules

Chapter 5 – Studies of Diagnostic Test Accuracy

Michael A. Kohn, MD, MPP

10/19/2006


Outline of topics

Outline of Topics

  • Combining results of multiple tests: importance of test non-independence

  • Recursive Partitioning

  • Logistic Regression

  • Published “rules” for combining test results: importance of validation separate from derivation

  • Biases in studies of diagnostic test accuracy

    Overfitting bias

    Incorporation bias

    Referral bias

    Double gold standard bias

    Spectrum bias


Warning different example

Warning: Different Example

Example of combining two tests in this talk:

Prenatal sonographic Nuchal Translucency (NT) and Nasal Bone Exam (NBE) as dichotomous tests for Trisomy 21*

Example of combining two tests in book**:

Premature birth (GA < 36 weeks) and low birth weight (BW < 2500 grams) as dichotomous tests for neonatal morbidity

*Cicero, S., G. Rembouskos, et al. (2004). "Likelihood ratio for trisomy 21 in fetuses with absent nasal bone at the 11-14-week scan." Ultrasound Obstet Gynecol23(3): 218-23.

**Soon to be replaced


Multiple tests multivariable decision rules and studies of diagnostic test accuracy

If NT ≥ 3.5 mm Positive for Trisomy 21*

*What’s wrong with this definition?


Multiple tests multivariable decision rules and studies of diagnostic test accuracy

  • In general, don’t make multi-level tests like NT into dichotomous tests by choosing a fixed cutoff

  • I did it here to make the discussion of multiple tests easier

  • I arbitrarily chose to call ≥ 3.5 mm positive


One dichotomous test

One Dichotomous Test

Trisomy 21

Nuchal D+ D- LR

Translucency

≥ 3.5 mm212 4787.0

< 3.5 mm12147450.4

Total3335223

Do you see that this is (212/333)/(478/5223)?

Review of Chapter 3: What are the sensitivity, specificity, PPV, and NPV of this test? (Be careful.)


Nuchal translucency

Nuchal Translucency

  • Sensitivity = 212/333 = 64%

  • Specificity = 4745/5223 = 91%

  • Prevalence = 333/(333+5223) = 6%

    (Study population: pregnant women about to under go CVS, so high prevalence of Trisomy 21)

    PPV = 212/(212 + 478) = 31%

    NPV = 4745/(121 + 4745) = 97.5%*

* Not that great; prior to test P(D-) = 94%


Clinical scenario one test pre test probability of down s 6 nt positive

Clinical Scenario – One TestPre-Test Probability of Down’s = 6%NT Positive

Pre-test prob: 0.06

Pre-test odds: 0.06/0.94 = 0.064

LR(+) = 7.0

Post-Test Odds = Pre-Test Odds x LR(+)

= 0.064 x 7.0 = 0.44

Post-Test prob = 0.44/(0.44 + 1) = 0.31


Pre test probability of tri21 6 nt positive post test probability of tri21 31

Clinical Scenario – One Test

Pre-Test Probability of Tri21 = 6%NT PositivePost-Test Probability of Tri21 = 31%

Using Probabilities

Using Odds

Pre-Test Odds of CAD = 0.064EECG Positive (LR = 7.0)Post-Test Odds of CAD = 0.44


Clinical scenario one test pre test probability of tri21 6 nt positive

Clinical Scenario – One TestPre-Test Probability of Tri21 = 6%NT Positive

NT + (LR = 7.0)

|--------------->

+-------------------------X---------------X------------------------------+

| | | | | | |

Log(Odds) 2 -1.5 -1 -0.5 0 0.5 1

Odds 1:100 1:33 1:10 1:3 1:1 3:1 10:1

Prob 0.01 0.03 0.09 0.25 0.5 0.75 0.91

Odds = 0.064

Prob = 0.06

Odds = 0.44

Prob = 0.31


Multiple tests multivariable decision rules and studies of diagnostic test accuracy

Nasal Bone Seen

NBE Negative

for Trisomy 21

Nasal Bone Absent

NBE Positive

for Trisomy 21


Second dichotomous test

Second Dichotomous Test

Nasal Bone Tri21+ Tri21-LR

Absent229 12927.8

Present10450940.32

Total3335223

Do you see that this is (229/333)/(129/5223)?


Multiple tests multivariable decision rules and studies of diagnostic test accuracy

Pre-Test Probability of Trisomy 21 = 6%NT Positive for Trisomy 21 (≥ 3.5 mm)Post-NT Probability of Trisomy 21 = 31%NBE Positive for Trisomy 21 (no bone seen)Post-Nuclide Probability of Trisomy 21 = ?

Clinical Scenario –Two Tests

Using Probabilities


Multiple tests multivariable decision rules and studies of diagnostic test accuracy

Clinical Scenario – Two Tests

Using Odds

Pre-Test Odds of Tri21 = 0.064NT Positive (LR = 7.0)Post-Test Odds of Tri21 = 0.44NBE Positive (LR = 27.8?)Post-Test Odds of Tri21 = .44 x 27.8? = 12.4? (P = 12.4/(1+12.4) = 92.5%?)


Clinical scenario two tests pre test probability of trisomy 21 6 nt 3 5 mm and nasal bone absent

Clinical Scenario – Two TestsPre-Test Probability of Trisomy 21 = 6%NT ≥ 3.5 mm AND Nasal Bone Absent

NT + (LR = 6.96)

|--------------->

NBE + (LR = 27.8)

|--------------------------->

NT + NBE +

Can we do this? |--------------->|--------------------------->

NT + and NBE +

+---------------X----------------X----------------------------X-+

| | | | | | |

Log(Odds) 2 -1.5 -1 -0.5 0 0.5 1

Odds 1:100 1:33 1:10 1:3 1:1 3:1 10:1

Prob 0.01 0.03 0.09 0.25 0.5 0.75 0.91

Odds = 0.064

Prob = 0.06

Odds = 12.4

Prob = 0.925

Odds = 0.44

Prob = 0.31


Question

Question

Can we use the post-test odds after a positive Nuchal Translucency as the pre-test odds for the positive Nasal Bone Examination?

i.e., can we combine the positive results by multiplying their LRs?

LR(NT+, NBE +) = LR(NT +) x LR(NBE +) ?

= 7.0 x 27.8 ?

= 194 ?


Answer no

Answer = No

Not 194


Non independence

Non-Independence

Absence of the nasal bone does not tell you as much if you already know that the nuchal translucency is ≥ 3.5 mm.


Multiple tests multivariable decision rules and studies of diagnostic test accuracy

Clinical Scenario

Using Odds

Pre-Test Odds of Tri21 = 0.064NT+/NBE + (LR =68.8)Post-Test Odds = 0.064 x 68.8 = 4.40 (P = 4.40/(1+4.40) = 81%, not 92.5%)


Non independence1

Non-Independence

NT +

|--------------->

NBE +

|--------------------------->

NT + NBE +

if tests were independent|--------------->|---------------------------->

NT + and NBE +

since tests are dependent|----------------------------------->

+---------------X----------------X------------------X----------+

| | | | | | |

Log(Odds) 2 -1.5 -1 -0.5 0 0.5 1

Odds 1:100 1:33 1:10 1:3 1:1 3:1 10:1

Prob 0.01 0.03 0.09 0.25 0.5 0.75 0.91

Prob = 0.81


Non independence of nt and nbe

Non-Independence of NT and NBE

Apparently, even in chromosomally normal fetuses, enlarged NT and absence of the nasal bone are associated. A false positive on the NT makes a false positive on the NBE more likely. Of normal (D-) fetuses with NT < 3.5 mm only 2.0% had nasal bone absent. Of normal (D-) fetuses with NT ≥ 3.5 mm, 7.5% had nasal bone absent.

Some (but not all) of this may have to do with ethnicity. In this London study, chromosomally normal fetuses of “Afro-Caribbean” ethnicity had both larger NTs and more frequent absence of the nasal bone.

In Trisomy 21 (D+) fetuses, normal NT was associated with the presence of the nasal bone, so a false negative on the NT was associated with a false negative on the NBE.


Non independence2

Non-Independence

Instead of looking for the nasal bone, what if the second test were just a repeat measurement of the nuchal translucency?

A second positive NT would do little to increase your certainty of Trisomy 21. If it was false positive the first time around, it is likely to be false positive the second time.


Reasons for non independence

Reasons for Non-Independence

Tests measure the same aspect of disease.

Consider exercise ECG (EECG) and radionuclide scan as tests for coronary artery disease (CAD) with the gold standard being anatomic narrowing of the arteries on angiogram. Both EECG and nuclide scan measure functional narrowing. In a patient without anatomic narrowing (a D- patient), coronary artery spasm could cause false positives on both tests.


Reasons for non independence1

Reasons for Non-Independence

Spectrum of disease severity.

In the EECG/nuclide scan example, CAD is defined as ≥70% stenosis on angiogram. A D+ patient with 71% stenosis is much more likely to have a false negative on both the EECG and the nuclide scan than a D+ patient with 99% stenosis.


Reasons for non independence2

Reasons for Non-Independence

Spectrum of non-disease severity.

In this example, CAD is defined as ≥70% stenosis on angiogram. A D- patient with 69% stenosis is much more likely to have a false positive on both the EECG and the nuclide scan than a D- patient with 33% stenosis.


Counterexamples possibly independent tests

Counterexamples: Possibly Independent Tests

For Venous Thromboembolism:

  • CT Angiogram of Lungs and Doppler Ultrasound of Leg Veins

  • Alveolar Dead Space and D-Dimer

  • MRA of Lungs and MRV of leg veins


Unless tests are independent we can t combine results by multiplying lrs

Unless tests are independent, we can’t combine results by multiplying LRs


Ways to combine multiple tests

Ways to Combine Multiple Tests

On a group of patients (derivation set), perform the multiple tests and determine true disease status (apply the gold standard)

  • Measure LR for each possible combination of results

  • Recursive Partitioning

  • Logistic Regression


Determine lr for each result combination

Determine LR for Each Result Combination

*Assumes pre-test prob = 6%


Determine lr for each result combination1

Determine LR for Each Result Combination

2 dichotomous tests: 4 combinations

3 dichotomous tests: 8 combinations

4 dichotomous tests: 16 combinations

Etc.

2 3-level tests: 9 combinations

3 3-level tests: 27 combinations

Etc.


Determine lr for each result combination2

Determine LR for Each Result Combination

How do you handle continuous tests?

Not practical for most groups of tests.


Recursive partitioning measure nt first

Recursive PartitioningMeasure NT First


Recursive partitioning examine nasal bone first

Recursive PartitioningExamine Nasal Bone First


Recursive partitioning examine nasal bone first cvs if p trisomy 21 5

Recursive PartitioningExamine Nasal Bone FirstCVS if P(Trisomy 21 > 5%)


Recursive partitioning examine nasal bone first cvs if p trisomy 21 51

Recursive PartitioningExamine Nasal Bone FirstCVS if P(Trisomy 21 > 5%)


Recursive partioning

Recursive Partioning

  • Same as Classification and Regression Trees (CART)

  • Don’t have to work out probabilities (or LRs) for all possible combinations of tests, because of “tree pruning”


Tree pruning goldman rule

Tree Pruning: Goldman Rule*

8 “Tests” for Acute MI in ER Chest Pain Patient :

  • ST Elevation on ECG;

  • CP < 48 hours;

  • ST-T changes on ECG;

  • Hx of MI;

  • Radiation of Pain to Neck/LUE;

  • Longest pain > 1 hour;

  • Age > 40 years;

  • CP not reproduced by palpation.

*Goldman L, Cook EF, Brand DA, et al. A computer protocol to predict myocardial infarction in emergency department patients with chest pain. N Engl J Med. 1988;318(13):797-803.


Multiple tests multivariable decision rules and studies of diagnostic test accuracy

8 tests  28 = 256 Combinations


Recursive partitioning

Recursive Partitioning

  • Does not deal well with continuous test results*

    *when there is a monotonic relationship between between the rest result and the probability of disease


Logistic regression

Logistic Regression

Ln(Odds(D+)) =

a + bNTNT+ bNBENBE + binteract(NT)(NBE)

“+” = 1

“-” = 0

More on this later in ATCR!


Multiple tests multivariable decision rules and studies of diagnostic test accuracy

Logistic Regression Approach to the “R/O ACI patient”

*Selker HP, Griffith JL, D'Agostino RB. A tool for judging coronary care unit admission appropriateness, valid for both real-time and retrospective use. A time-insensitive predictive instrument (TIPI) for acute cardiac ischemia: a multicenter study. Med Care. Jul 1991;29(7):610-627. For corrected coefficients, see http://medg.lcs.mit.edu/cardiac/cpain.htm


Clinical scenario

Clinical Scenario*

71 y/o man with 2.5 hours of CP, substernal, non-radiating, described as “bloating.” Cannot say if same as prior MI or worse than prior angina.

Hx of CAD, s/p CABG 10 yrs prior, stenting 3 years and 1 year ago. DM on Avandia.

ECG: RBBB, Qs inferiorly. No ischemic ST-T changes.

*Real patient seen by MAK 1 am 10/12/04


What happened to pre test probability

What Happened to Pre-test Probability?

Typically clinical decision rules report probabilities rather than likelihood ratios for combinations of results.

Can “back out” LRs if we know prevalence, p[D+], in the study dataset.

With logistic regression models, this “backing out” is known as a “prevalence offset.” (See Chapter 8A.)


Optimal cutoff for a single continuous test

Optimal Cutoff for a Single Continuous Test

Depends on

  • Pre-test Probability of Disease

  • ROC Curve (Likelihood Ratios)

  • Relative Misclassification Costs

    Cannot choose an optimal cutoff with just the ROC curve.


Optimal cutoff line for two continuous tests

Optimal Cutoff Line for Two Continuous Tests


Choosing which tests to include in the decision rule

Choosing Which Tests to Include in the Decision Rule

Have focused on how to combine results of two or more tests, not on which of several tests to include in a decision rule.

Options include:

  • Recursive partitioning

  • Automated stepwise logistic regression*

Choice of variables in derivation data set requires confirmation in a separate validation data set.


Need for validation example

Need for Validation: Example*

Study of clinical predictors of bacterial diarrhea.

Evaluated 34 historical items and 16 physical examination questions.

3 questions (abrupt onset, > 4 stools/day, and absence of vomiting) best predicted a positive stool culture (sensitivity 86%; specificity 60% for all 3).

Would these 3 be the best predictors in a new dataset? Would they have the same sensitivity and specificity?

*DeWitt TG, Humphrey KF, McCarthy P. Clinical predictors of acute bacterial diarrhea in young children. Pediatrics. Oct 1985;76(4):551-556.


Need for validation

Need for Validation

Develop prediction rule by choosing a few tests and findings from a large number of possibilities.

Takes advantage of chance variations in the data.

Predictive ability of rule will probably disappear when you try to validate on a new dataset.

Can be referred to as “overfitting.”


Validation

VALIDATION

No matter what technique (CART or logistic regression) is used, the “rule” for combining multiple test results must be tested on a data set different from the one used to derive it.

Beware of “validation sets” that are just re-hashes of the “derivation set”.

(This begins our discussion of potential problems with studies of diagnostic tests.)


Studies of diagnostic test accuracy sackett ebm pg 68

Studies of Diagnostic Test AccuracySackett, EBM, pg 68

  • Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis?

  • Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)?

  • Was the reference standard applied regardless of the diagnostic test result?

  • Was the test (or cluster of tests) validated in a second, independent group of patients?


Bias in studies of diagnostic test accuracy

Bias in Studies of Diagnostic Test Accuracy

Index Test = Test Being Evaluated

Gold Standard = Test Used to Determine True Disease Status


Studies of diagnostic tests sackett ebm pg 68

Studies of Diagnostic TestsSackett, EBM, pg 68

  • Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis?

  • Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)?

  • Was the reference standard applied regardless of the diagnostic test result?

  • Was the test (or cluster of tests) validated in a second, independent group of patients?


Studies of diagnostic tests incorporation bias

Studies of Diagnostic TestsIncorporation Bias

Index Test is “incorporated” into gold standard.

Consider a study of the usefulness of various findings for diagnosing pancreatitis. If the "Gold Standard" is a discharge diagnosis of pancreatitis, which in many cases will be based upon the serum amylase, then the study can't quantify the accuracy of the amylase for this diagnosis.


Studies of diagnostic tests incorporation bias1

Studies of Diagnostic TestsIncorporation Bias

A study* of BNP in dyspnea patients as a diagnostic test for CHF also showed that the CXR performed extremely well in predicting CHF.

The two cardiologists who determined the final diagnosis of CHF were blinded to the BNP level but not to the CXR report, so the assessment of BNP should be unbiased, but not the assessment CXR.

*Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of B-type natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med 2002;347(3):161-7.


Studies of diagnostic tests sackett ebm pg 681

Studies of Diagnostic TestsSackett, EBM, pg 68

  • Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis?

  • Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)?

  • Was the reference standard applied regardless of the diagnostic test result?

  • Was the test (or cluster of tests) validated in a second, independent group of patients?


Studies of diagnostic tests verification bias

Studies of Diagnostic TestsVerification Bias*

The study population only includes those to whom the gold standard was applied, but patients with positive index tests are more likely to be referred for the gold standard.

Example: V/Q Scan as a test for PE. Gold standard is a PA-gram. Patients with negative V/Q scans are less frequently referred for PA-gram than those with positive V/Q scans. Only patients who had PA-grams are included in the study.

*AKA Work-up, Referral Bias, or Ascertainment Bias


Studies of diagnostic tests verification bias1

Studies of Diagnostic TestsVerification Bias

Sensitivity (a/(a+c)) is biased UP.

Specificity (d/(b+d)) is biased DOWN.


Studies of diagnostic tests double gold standard bias

Studies of Diagnostic TestsDouble Gold Standard Bias

One gold standard (e.g. biopsy) is applied in patients with positive index test, another gold standard (e.g., clinical follow-up) is applied in patients with a negative index test.


Studies of diagnostic tests double gold standard

Studies of Diagnostic TestsDouble Gold Standard

Test: V/Q Scan

Disease: PE

Gold Standard: PA-gram in patients who had one, clinical follow-up in patients who didn’t

Study Population: All patients presenting to the ED who received a V/Q scan.

Assume some patients did not get PA-gram because of normal/low probability V/Q scans but would have had positive PA-grams. Instead they had negative clinical follow-up and were counted as true negatives. If they had had PA-grams, they would have been counted as false negatives.

*PIOPED. JAMA 1990;263(20):2753-9.


Studies of diagnostic tests double gold standard1

Studies of Diagnostic TestsDouble Gold Standard

Sensitivity (a/(a+c)) biased UP

Specificity (d/(b+d)) biased UP


Studies of diagnostic tests sackett ebm pg 682

Studies of Diagnostic TestsSackett, EBM, pg 68

  • Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis?

  • Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)?

  • Was the reference standard applied regardless of the diagnostic test result?

  • Was the test (or cluster of tests) validated in a second, independent group of patients?


Studies of diagnostic tests spectrum bias

Studies of Diagnostic TestsSpectrum Bias

So far, we have said that PPV and NPV of a test depend on the population being tested, specifically on the prevalence of D+ in the population.

We said that sensitivity and specificity are properties of the test and independent of the prevalence and, by implication at least, the population being tested.

In fact, …


Studies of diagnostic tests spectrum bias1

Studies of Diagnostic TestsSpectrum Bias

Sensitivity depends on the spectrum of disease in the population being tested.

Specificity depends on the spectrum of non-disease in the population being tested.


Studies of diagnostic tests spectrum bias2

Studies of Diagnostic TestsSpectrum Bias

D+ and D- groups are not homogeneous.

D-/D+ really is D-,D+, D++, or D+++

D-/D+ really is (D1-, D2-, or D3-)/D+


Studies of diagnostic tests spectrum bias3

Studies of Diagnostic TestsSpectrum Bias

Example: Absence of Nasal Bone (on 13-week ultrasound) as a Test for Chromosomal Abnormality


Spectrum bias absence of nasal bone as a test for chromosomal abnormality

Spectrum BiasAbsence of Nasal Bone as a Test for Chromosomal Abnormality

Nasal D+ D- LR

Bone

Absent229 1297.0

Present10450940.4

Total3335223

Sensitivity = 229/333 = 69%

BUT

the D+ group only included fetuses with Trisomy 21


Spectrum bias absence of nasal bone as a test for chromosomal abnormality1

Spectrum BiasAbsence of Nasal Bone as a Test for Chromosomal Abnormality

D+ group excluded 295 fetuses with other chromosomal abnormalities (esp. Trisomy 18)

If the purpose of the nasal bone exam is to determine on whom to get CVS, these 295 fetuses with chromosomal abnormalities other than trisomy 21 should be included in the D+ group.

95/295 (32%, not 69%) had absent nasal bone.


Spectrum bias absence of nasal bone as a test for chromosomal abnormality2

Spectrum BiasAbsence of Nasal Bone as a Test for Chromosomal Abnormality

Nasal D+ D- LR

Bone

Absent229+95 =324 4787.0

Present104+200=30447450.4

Total333+295=6285223

Sensitivity = 324/628 = 52%

NOT 69% obtained

when the D+ group only included fetuses with Trisomy 21


Spectrum bias absence of nasal bone as a test for chromosomal abnormality3

Spectrum BiasAbsence of Nasal Bone as a Test for Chromosomal Abnormality

By excluding chromosomal abnormalities other than Trisomy 21 from the D+ group, the study exaggerates the sensitivity of the Nasal Bone Exam (NBE) for chromosomal abnormalities.

“True” Sensitivity of NBE for chromosomal abnormalities = 52%

Biased estimate due to spectrum bias (excluding other chromosomal problems) = 69%


Biases in studies of tests

Biases in Studies of Tests

  • Overfitting Bias – “Data snooped” cutoffs take advantage of chance variations in derivations set making test look falsely good.

  • Incorporation Bias – index test part of gold standard (Sensitivity Up, Specificity Up)

  • Verification/Referral Bias – positive index test increases referral to gold standard (Sensitivity Up, Specificity Down)

  • Double Gold Standard – positive index test causes application of definitive gold standard, negative index test results in clinical follow-up (Sensitivity Up, Specificity Up)

  • Spectrum Bias

    • D+ sickest of the sick (Sensitivity Up)

    • D- wellest of the well (Specificity Up)


Biases in studies of tests1

Biases in Studies of Tests

Don’t just identify potential biases, figure out how the biases could affect the conclusions.

Studies concluding a test is worthless are not invalid if biases in the design would have led to the test looking BETTER than it really is.


  • Login