1 / 59

Evaluation of Medical Systems

Evaluation of Medical Systems. Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute of Technology. Main Concepts. Example of System Discrimination

zhen
Download Presentation

Evaluation of Medical Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of Medical Systems Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute of Technology

  2. Main Concepts • Example of System • Discrimination • Discrimination: sensitivity, specificity, PPV, NPV, accuracy, ROC curves • Calibration • Calibration curves • Hosmer and Lemeshow goodness of fit

  3. Example

  4. Comparison of Practical Prediction Models for Ambulation Following Spinal Cord Injury Todd Rowland, M.D. Decision Systems Group Brigham and Womens Hospital Harvard Medical School

  5. Study Rationale • Patient’s most common question: “Will I walk again” • Study was conducted to compare logistic regression , neural network, and rough sets models which predict ambulation at discharge based upon information available at admission for individuals with acute spinal cord injury. • Create simple models with good performance • 762 training set • 376 test set • univariate statistic compared (means)

  6. Admission Info (9 items) system days injury days age gender racial/ethnic group level of neurologic fxn ASIA impairment index UEMS LEMS Ambulation (1 item) yes no SCI Ambulation NN DesignInputs & Output

  7. Evaluation Indices

  8. General indices • Brier score (a.k.a. mean squared error) S(ei - oi)2 n e = estimate (e.g., 0.2) o = observation (0 or 1) n = number of cases

  9. Murphy Brier = d(1-d) - CI + DI where d = prior probability CI = calibration index DI = discrimination index Decompositions of Brier score

  10. Calibration and Discrimination • Calibration measures how close the estimates are to a “real” probability • Discrimination measures how much the system can discriminate between cases with gold standard ‘1’ and gold standard ‘0’ • “If the system is good in discrimination, calibration may be fixed”

  11. Discrimination Indices

  12. Discrimination • The system can “somehow” differentiate between cases in different categories • Special case: • diagnosis (differentiate sick and healthy individuals) • prognosis (differentiate poor and good outcomes)

  13. Discrimination of Binary Outcomes • Real outcome is 0 or 1, estimated outcome is usually a number between 0 and 1 (e.g., 0.34) or a rank • Based on Thresholded Results (e.g., if output > 0.5 then consider “positive”) • Discrimination is based on: • sensitivity • specificity • PPV, NPV • accuracy • area under ROC curve

  14. Notation D means disease present “D” means system says disease is present nl means disease absent “nl” means systems says disease is absent TP means true positive FP means false positive TN means true positive

  15. normal Disease True Negative (TN) True Positive (TP) FN FP 1.0 1.7 threshold 3.0

  16. D nl FN TN “nl” Prevalence= P(D) = FN + TP FP “D” TP “D” “nl”

  17. D nl FN TN “nl” • Specificity Spec = TN/TN+FP FP “D” TP “D” “nl”

  18. D nl FN TN “nl” • Positive Predictive Value PPV = TP/TP+FP FP “D” TP “D” “nl”

  19. D nl FN TN “nl” • Negative Predictive Value NPV = TN/TN+FN FP “D” TP “D” “nl”

  20. D nl FN TN “nl” • Accuracy = (TN +TP)/all FP “D” TP “D” “nl”

  21. D nl FN TN “nl” • Sensitivity Sens = TP/TP+FN • Specificity Spec = TN/TN+FP • Positive Predictive Value PPV = TP/TP+FP • Negative Predictive Value NPV = TN/TN+FN • Accuracy = (TN +TP)/all FP “D” TP “D” “nl”

  22. D nl 10 30 “nl” Sens = TP/TP+FN 40/50 = .8 Spec = TN/TN+FP 30/50 = .6 PPV = TP/TP+FP 40/60 = .67 NPV = TN/TN+FN 30/40 = .75 Accuracy = TN +TP 70/100 = .7 20 “D” 40 “D” “nl”

  23. D nl threshold Sensitivity = 50/50 = 1 Specificity = 40/50 = 0.8 “nl” 0 40 40 “D” 60 10 50 50 50 disease nl TP TN FP 1.0 1.4 3.0

  24. D nl threshold Sensitivity = 40/50 = .8 Specificity = 40/50 = .8 “nl” 10 50 40 “D” 50 10 40 50 50 nl disease TP TN FN FP 1.0 1.7 3.0

  25. D nl threshold Sensitivity = 30/50 = .6 Specificity = 1 “nl” 20 70 50 “D” 30 0 30 50 50 nl disease TP TN FN 1.0 2.0 3.0

  26. D nl “nl” 0 40 40 Threshold 1.4 1 ROC curve “D” 60 10 50 50 50 D nl Sensitivity “nl” 10 50 40 Threshold 1.7 “D” 50 10 40 50 50 D nl “nl” 20 70 Threshold 2.0 50 0 1 1 - Specificity “D” 30 0 30 50 50

  27. 1 ROC curve Sensitivity All Thresholds 0 1 1 - Specificity

  28. 1 45 degree line: no discrimination Sensitivity 0 1 1 - Specificity

  29. 1 45 degree line: no discrimination Sensitivity Area under ROC: 0.5 0 1 1 - Specificity

  30. 1 Perfect discrimination Sensitivity 0 1 1 - Specificity

  31. 1 Perfect discrimination Sensitivity 1 Area under ROC: 0 1 1 - Specificity

  32. 1 ROC curve Sensitivity Area = 0.86 0 1 1 - Specificity

  33. What is the area under the ROC? • An estimate of the discriminatory performance of the system • the real outcome is binary, and systems’ estimates are continuous (0 to 1) • all thresholds are considered • NOT an estimate on how many times the system will give the “right” answer

  34. Systems’ estimates for Healthy (real outcome 0) 0.3 0.2 0.5 0.1 0.7 Sick (real outcome 1) 0.8 0.2 0.5 0.7 0.9 Interpretation of the Area

  35. Systems’ estimates for Healthy 0.3 0.2 0.5 0.1 0.7 Sick 0.8 0.2 0.5 0.7 0.9 < All possible pairs 0-1 concordant discordant concordant concordant concordant

  36. Systems’ estimates for Healthy 0.3 0.2 0.5 0.1 0.7 Sick 0.8 0.2 0.5 0.7 0.9 All possible pairs 0-1 concordant tie concordant concordant concordant

  37. Concordant 18 Discordant 4 C - index • Ties 3 C -index = Concordant + 1/2 Ties = 18 + 1.5 All pairs 25

  38. Systems’ estimates for Healthy 0.3 0.2 0.5 0.1 0.7 Sick 0.8 0.2 0.5 0.7 0.9 < All possible pairs 0-1 concordant discordant concordant concordant concordant

  39. Mann-Whitney U p(x < y) = U(x < y) mn where m is the number of real ‘0’s (e.g., deaths) and n is the number of real ‘1’s (e.g. survivals) Wilcoxon rank test Related statistics

  40. Calibration Indices

  41. Calibration • System can reliably estimate probability of • a diagnosis • a prognosis • Probability is close to the “real” probability for similar cases

  42. What is the problem? • Binary events are YES/NO (0/1) i.e., probabilities are 0 or 1 for a given individual • Some models produce continuous (or quasi-continuous estimates for the binary events) • Example: • Database of patients with spinal cord injury, and a model that predicts whether a patient will ambulate or not at hospital discharge • Event is 0: doesn’t walk or 1: walks • Models produce a probability that patient will walk: 0.05, 0.10, ...

  43. How close are the estimates to the “true” probability for a patient? • “True” probability can be interpreted as probability within a set of similar patients • What are similar patients? • Patients who get similar scores from models • How to define boundaries for similarity? • Examples out there in the real world... • APACHE III for ICU outcomes • ACI-TIPI for diagnosis of MI

  44. Sort systems’ estimates, group, sum 0.1 0 0.2 0 0.2 sum of group = 0.5 1 sum = 1 0.3 0 0.5 0 0.5 sum of group = 1.3 1 sum = 1 0.7 0 0.7 1 0.8 1 0.9 sum of group = 3.1 1 sum = 3 Calibration

  45. overestimation 1 Calibration Curves Sum of system’s estimates 0 1 Sum of real outcomes

  46. Regression line 1 Linear Regression and 450 line Sum of system’s estimates 0 1 Sum of real outcomes

  47. Sort systems’ estimates, group, sum, chi-square Expected Observed 0.1 0 0.2 0 0.2 sum of group = 0.5 1 sum = 1 0.3 0 0.5 0 0.5 sum of group = 1.3 1 sum = 1 0.7 0 0.7 1 0.8 1 0.9 sum of group = 3.1 1 sum = 3 c2 = S [(observed - expected)2/expected] c2 = (0.5 - 1)2 / 0.5 + (1.3 - 1)2 / 1.3 + (3.1 - 3)2 / 3.1 Goodness-of-fit

  48. Groups based on n-iles (e.g., deciles, terciles), n-2 d..f. Measured Groups“Mirror groups” Expected Observed Expected Observed 0.1 0 0.9 1 0.2 0 0.8 1 0.2 sum = 0.5 1 sum = 1 0.8 sum = 2.5 0 sum = 2 0.3 0 0.7 1 0.5 0 0.5 1 0.5 sum = 1.3 1 sum = 1 0.5 sum = 1.7 0 sum = 2 0.7 0 0.3 1 0.7 1 0.3 0 0.8 1 0.2 0 0.9 sum = 3.1 1 sum = 3 0.1 sum=0.9 0 sum = 1 c2 =(0.5-1)2/0.5+(1.3-1)2/1.3+(3.1-3)2/3.1+(2.5-2)2/2.5+(1.7-2)2/1.7+(0.9-1)2/0.9 Hosmer-Lemeshow C-hat

  49. Groups based on n fixed thresholds (e.g., 0.3, 0.6, 0.9), n-2 d.f. Measured Groups“Mirror groups” Expected Observed Expected Observed 0.1 0 0.9 1 0.2 0 0.8 1 0.2 1 0.8 0 0.3 sum = 0.8 0 sum = 1 0.7 sum = 3.2 1 sum = 2 0.5 0 0.5 1 0.5 sum = 1.0 1 sum = 1 0.5 sum = 1.0 0 sum = 1 0.7 0 0.3 1 0.7 1 0.3 0 0.8 1 0.2 0 0.9 sum = 3.1 1 sum = 3 0.1 sum=0.9 0 sum = 1 c2 =(0.8-1)2/0.8+(1.0-1)2/1+(3.1-3)2/3.1+(3.2-2)2/3.2+(1.0-1)2/1+(0.9-1)2/0.9 Hosmer-Lemeshow H-hat

  50. Covariance decomposition • Arkes et al, 1995 Brier = d(1-d) + bias2 + d(1-d)slope(slope-2) + scatter • where d = prior • bias is a calibration index • slope is a discrimination index • scatter is a variance index

More Related