1 / 149

David G. Brown and Frank Samuelson Center for Devices and Radiological Health, FDA 6 July 2014

Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife, Bootstrap and other Statistical Methodologies. David G. Brown and Frank Samuelson Center for Devices and Radiological Health, FDA

cecil
Download Presentation

David G. Brown and Frank Samuelson Center for Devices and Radiological Health, FDA 6 July 2014

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife, Bootstrap and other Statistical Methodologies David G. Brown and Frank Samuelson Center for Devices and Radiological Health, FDA 6 July 2014

  2. Course Outline • Performance measures for Computational Intelligence (CI) observers • Accuracy • Prevalence dependent measures • Prevalence independent measures • Maximization of performance: Utility analysis/Cost functions • Receiver Operating Characteristic (ROC) analysis • Sensitivity and specificity • Construction of the ROC curve • Area under the ROC curve (AUC) • Error analysis for CI observers • Sources of error • Parametric methods • Nonparametric methods • Standard deviations and confidence intervals • Boot strap methods • Theoretical foundation • Practical use • References

  3. What’s the problem? • Emphasis on algorithm innovation to exclusion of performance assessment • Use of subjective measures of performance – “beauty contest” • Use of “accuracy” as a measure of success • Lack of error bars—My CIO is .01 better than yours (+/- ?) • Flawed methodology—training and testing on same data • Lack of appreciation for the many different sources of error that can be taken into account

  4. Original imageLena. Courtesy of the Signal and Image Processing Institute at the University of Southern California.

  5. CI improved imageBaboon. Courtesy of the Signal and Image Processing Institute at the University of Southern California.

  6. Panel of expertsfunnymonkeysite.com

  7. I. Performance measures for computational intelligence (CI) observers • Task based: (binary) discrimination task • Two populations involved: “normal” and “abnormal,” • Accuracy – Intuitive but incomplete • Different consequences for success or failure for each population • Some measures depend on the prevalence (Pr) some do not, Pr = • Accuracy, positive predictive value, negative predictive value • Sensitivity, specificity, ROC, AUC • True optimization of performance requires knowledge of cost functions or utilities for successes and failures in both populations

  8. How to make a CIO with >99% accuracy • Medical problem: Screening mammography (“screening” means testing in an asymptomatic population) • Prevalence of breast cancer in the screening population Pr = 0.5 % • My CIO always says “normal” • Accuracy (Acc) is 99.5% (accuracy of accepted present-day systems ~75%) • Accuracy in a diagnostic setting (Pr~20%) is 80% -- Acc=1-Pr (for my CIO)

  9. CIO operates on two different populations Normal cases p(t|0) Abnormal cases p(t|1) Threshold t = T t-axis

  10. Must consider effects on normal and abnormal populations separately • CIO output t • p(t|0) probability distribution of t for the population of normals • p(t|1) probability distribution of t for the population of abnormals • Threshold T. Everything to the right of T called abnormal, and everything to the left of T called normal • Area of p(t|0) to left of T is the true negative fraction (TNF = specificity) and to the right the false positive fraction (FPF = type 1 error). • TNF + FPF = 1 • Area of p(t|1) to left of T is the false negative fraction (FNF = type 2 error) and to the right is the true positive fraction (TPF = sensitivity) • FNF + TPF = 1 • TNF, FPF, FNF, TPF all are prevalence independent, since each is some fraction of one of our two probability distributions • {Accuracy = Pr x TPF + (1-Pr) x TNF}

  11. Normal cases FPF (.5) TNF (.5) t-axis Threshold T Abnormal cases TPF (.95) FNF (.05) t-axis

  12. Prevalence dependent measures • Accuracy (Acc) • Acc = Pr x TPF + (1-Pr) x TNF • Positive predictive value (PPV): fraction of positives that are true positives • PPV = TPF x Pr / (TPF x Pr + FPF x (1-Pr)) • Negative predictive value (NPV): fraction of negatives that are true negatives • NPV = TNF x (1-Pr) / (TNF x (1-Pr) + FNF x Pr) • Using the mammography screening Pr and previous TPF, TNF, FNF, FPF values: Pr = .05, TPF = .95, TNF = 0.5, FNF=.05, FPF=0.5 • Acc = .05x.95+.95x.5 = .52 • PPV = .95x.05/(.95x.05+.5x.95) = .10 • NPV = .5x.95/(.5x.95+.05x.05) = .997

  13. Prevalence dependent measures • Accuracy (Acc) • Acc = Pr x TPF + (1-Pr) x TNF • Positive predictive value (PPV): fraction of positives that are true positives • PPV = TPF x Pr / (TPF x Pr + FPF x (1-Pr)) • Negative predictive value (NPV): fraction of negatives that are true negatives • NPV = TNF x (1-Pr) / (TNF x (1-Pr) + FNF x Pr) • Using the mammography screening Pr and previous TPF, TNF, FNF, FPF values: Pr = .005, TPF = .95, TNF = 0.5, FNF=.05, FPF=0.5 • Acc = .005x.95+.995x.5 = .50 • PPV = .95x.005/(.95x.005+.5x.995) = .01 • NPV = .5x.995/(.5x.995+.05x.005) = .995

  14. Acc, PPV, NPV as functions of prevalence(screening mammography) • TPF=.95 • FNF=.05 • TNF=0.5 • FPF=0.5

  15. Acc = NPV as function of prevalence(forced “normal” response CIO)

  16. Prevalence independent measures • Sensitivity = TPF • Specificity = TNF (1-FPF) • Receiver Operating Characteristic (ROC) = TPF as a function of FPF (Sensitivity as a function of 1 – Specificity) • Area under the ROC curve (AUC) = Sensitivity averaged over all values of Specificity

  17. Threshold Normal / Class 0 subjects Entire ROC curve ROC slope TPF, sensitivity Abnormal / Class 1 subjects FPF, 1-specificity

  18. Empirical ROC data for mammography screening in the US Craig Beam et al.

  19. Maximization of performance • Need to know utilities or costs of each type of decision outcome – but these are very hard to estimate accurately. You don’t just maximize accuracy. • Need prevalence • For mammography example • TPF: prolongation of life minus treatment cost • FPF: diagnostic work-up cost, anxiety • TNF: peace of mind • FNF: delay in treatment => shortened life • Hypothetical assignment of utilities for some decision threshold T: • UtilityT= U(TPF) x TPF x Pr + U(FPF) x FPF x (1-Pr) + U(TNF) x TNF x (1-Pr) + U(FNF) x FNF x Pr • U(TPF) = 100, U(FPF) = -10, U(TNF) = 4, U(FNF) = -20 • UtilityT= 100 x .95 x .05 – 10 x .50 x .95 + 4 x .50 x .95 – 20 x .05 x .05 = 1.85 • Now if we only knew how to trade off TPF versus FPF, we could optimize (?) medical performance.

  20. Utility maximization(mammography example)

  21. Choice of ROC operating point through utility analysis—screening mammography

  22. Utility maximization(mammography example)

  23. Utility maximization calculation u = (UTPFTPF+UFNFFNF)PR+(UTNFTNF+UFPFFPF)(1-PR) =(UTPFTPF+UFNF(1-TPF))PR+(UTNF(1-FPF)+UFPFFPF)(1-PR) du/dFPF=(UFPF-UTNF)(1-PR)+(UTPF-UFNF)PRdTPF/dFPF =0  dTPF/dFPF=(UTNF-UFPF)(1-PR)/(UTPF-UFNF)PR PR=.005  dTPF/dFPF = 23. PR=.05  dTPF/dFPF = 2.2 (UTPF=100, UFNF=-20, UTNF=4, UFPF=-20)

  24. Threshold Normal cases Entire ROC curve ROC slope TPF, sensitivity Abnormal cases FPF, 1-specificity

  25. Estimators • TPF, FPF, TNF, FNF, Accuracy, the ROC curve, and AUC are all fractions or probabilities. • Normally we have a finite sample of subjects on which to test our CIO. From this finite sample we try to estimate the above fractions • These estimates will vary depending upon the sample selected (statistical variation). • Estimates can be nonparametric or parametric

  26. Number of abnormals that would be selected by CIO in the population Number of abnormals that were selected by CIO in the sample Number of abnormals in the population Number of abnormals in the sample Estimators • TPF= • TPF= • Number in sample << Number in population (at least in theory)

  27. II. Receiver Operating Characteristic (ROC) • Receiver Operating Characteristic • Binary Classification • Test result is compared to a threshold

  28. Distribution of CIO Output for all Subjects Threshold Computational intelligence observer output

  29. Distribution of Output for Normal / Class 0 Subjects, p(t|0) Distribution of Output for Abnormal / Class 1 Subjects, p(t|1) Threshold t-axis Computational intelligence observer output

  30. Distribution of Output for Normal / Class 0 Subjects, p(t|0) Threshold Abnormal / Class 1 subjects

  31. Distribution of Output for Normal / Class 0 Subjects, p(t|0) Specificity = True Negative Fraction = TNF Threshold Abnormal / Class 1 subjects Sensitivity = True Positive Fraction = TPF

  32. Normal / Class 0 subjects Specificity Decision D0 D1 TNF 0.50 Threshold Truth H1 H0 TPF 0.95 Abnormal / Class 1 subjects Sensitivity

  33. Normal / Class 0 subjects 1 - Specificity = False Positive Fraction = FPF Threshold Abnormal / Class 1 subjects 1 - Sensitivity = False Negative Fraction = FNF

  34. Normal / Class 0 subjects 1 - Specificity Decision D0 D1 TNF 0.50 FPF 0.50 Threshold Truth H1 H0 FNF 0.05 TPF 0.95 Abnormal / Class 1 subjects 1 - Sensitivity

  35. Normal / Class 0 subjects high sensitivity TPF, sensitivity Threshold Abnormal / Class 1 subjects FPF, 1-specificity

  36. Normal / Class 0 subjects sensitivity = specificity TPF, sensitivity Threshold Abnormal / Class 1 subjects FPF, 1-specificity

  37. Normal / Class 0 subjects TPF, sensitivity Threshold high specificity Abnormal / Class 1 subjects FPF, 1-specificity

  38. Which CIO is best? Normal / Class 0 subjects CIO #3 CIO #2 TPF, sensitivity Threshold CIO #1 Abnormal / Class 1 subjects FPF, 1-specificity

  39. Do not compare rates of one class, e.g. TPF, at different rates of the other class (FPF). Normal / Class 0 subjects CIO #3 CIO #2 TPF, sensitivity Threshold CIO #1 Abnormal / Class 1 subjects FPF, 1-specificity

  40. Threshold Normal / Class 0 subjects Entire ROC curve TPF, sensitivity Abnormal / Class 1 subjects FPF, 1-specificity

  41. AUC=0.98 Entire ROC curve chance line TPF, sensitivity AUC=0.85 Discriminability -or- CIO performance FPF, 1-specificity AUC=0.5

  42. AUC (Area under ROC Curve) • AUC is a separation probability • AUC = probability that • CIO output for abnormal > CIO output for normal • CIO correctly tells which of 2 subjects is normal • Estimating AUC from finite sample • Select abnormal subject score = xi • Select normal subject score = yk • Is xi > yk ? • Average over all x,y:

More Related