1 / 33

Statistics and Image Evaluation

Statistics and Image Evaluation. Oleh Tretiak Medical Imaging Systems Fall, 2002. Which Image is Better?. Case A. Case B. Method. Rating on a 1 to 5 scale (5 is best) Rating performed by 21 subject Statistics:

colum
Download Presentation

Statistics and Image Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics and Image Evaluation Oleh Tretiak Medical Imaging Systems Fall, 2002

  2. Which Image is Better? Case A Case B

  3. Method • Rating on a 1 to 5 scale (5 is best) • Rating performed by 21 subject • Statistics: • Average, maximum, minimum, standard deviation, standard error for Case A, Case B • Difference per viewer between Case A and Case B, and above statistics on difference

  4. Observations

  5. Statistics

  6. Conclusions • Image in Case A has a higher average ranking than that in Case B. • The highest ranking for B is the same as the lowest ranking for A. In all other cases, the rankings for B are lower than those for A. • Consider the difference (rightmost column on previous slide). The ratio of average to the standard error (the z value) is 2.62/.22 ~ 12. This value of z is extremely unlikely if the means are the same.

  7. Experimental Design • How many observers should we use to test differences between pictures? • We expect difference between two kinds of pictures will be 0.5 ranking units. We exoect the standard deviation on difference measurement to be 1.0 (see experiment above). We would like to determine this reliably. We therefore want a confidence interval on the mean to be [mean - 0.5, mean + 0.5] at 99% confidence. How many observers should we use? • Answer:z0.005 = 2.6. Standard error must be 0.5/2.6 = 0.19. Std. err = std. dev. /sqrt(n). Therfore n = (1.0/0.19)^2 = 28

  8. Today’s Lecture • Hypothesis testing • Two kinds of errors • ROC analysis • Visibility of blobs • Quantitative quality measures

  9. Hypothesis Testing Example 256x256 128x128

  10. Question: Which is better? • Testing method • Quality rating by multiple viewers • Compute per-viewer difference in quality • Find mean and standard deviation of the difference • Compute the z score (mean/std. error) • How to interpret?

  11. Null Hypothesis (H0) • Assume that the mean is zero (no difference) • Find a range of z that would occur when the mean is zero. • Accept the null hypothesis if z is in this range (no difference) • Reject null hypothesis if z falls outside the range

  12. We show the normal distribution with 0 mean and s = 1. The shaded area has probability 0.95, and the two white areas have each probability 0.025. If we observe gaussian variables with mean zero, 95% of the observations will have value between -1.96 and 1.96. The area outside this interval (0.05 in this case) is called the significance level of the test.

  13. Two Kinds of Errors • In a decision task with two alternatives, there are two kinds of errors • Suppose the alternatives are ‘healthy’ and ‘sick’ • Type I error: say healthy if sick • Type II error: say sick if healthy

  14. X - observation, t - threshold a = Pr[X > t | H0] (Type I error) b = Pr[X < t | H1] (Type II error) Choosing t, we can trade off between the two types of errors

  15. Examples of Threshold Measurement • Show blobs and noise.

  16. Examples • Measurement of psychophysical threshold • Detectible flicker, detectible contrast • Medical diagnosis • Negative (healthy), positive (sick) • Home security • Friend or terrorist

  17. Probability of Error • Pe= P0a + P1b • Why bother with two types of error, why not just Pe? • In many cases, P1 << P0! • Two types of error are typically of different consequence. We therefore don’t want to mix them together.

  18. ROC Terminology • ROC — receiver operating characteristic • H0 — friend, negative; H1 — enemy, positive • Pr[X > t | H0] = probability of false alarm = probability of false positive = PFP = a • Pr[X > t | H1] = probability of detection = probability of true positive = PTP = b

  19. The ROC • The ROC shows the tradeoff between PFP and PTP as the threshold is varied

  20. How Do We Estimate the ROC? • Radiological diagnosis setting • Positive and negative cases • The true diagnosis must be evaluated by a reliable method • Cases are evaluated by radiologist(s), who report the data on a discrete scale • 1 = definitely negative, 5 = definitely positive

  21. Binormal Model • Negative: Normal, mean = 0, st. dev. = 1 • Negative: Normal, mean = a, st. dev. = b

  22. Some Binormal Plots b = 0.5, a = 1, 2, 3 b = 1, a = 1, 2, 3 b = 2, a = 1, 2, 3 Az ~ area under ROC curve

  23. Az formula

  24. Experimental Framework • Set of positive and negative cases • Need reliable diagnosis • Radiologist interprets cases • Radiologist report on a scale • Certainly Negative, Probably Negative, Unclear, Probably Positive, Certainly Positive • Estimate ROC, Az • Compare results in studies with conventional and image processing

  25. Standard deviations of estimates of a, b, and Az for varying numbers of observations. Horizontal axis: number of positive and negative observations. Top curve, sa, middle curve: sb. Trials were with a = b = 1. Standard deviations of estimates of a, b, and Az for varying numbers of observations. Horizontal axis: number of positive and negative observations. Top curve, sa, middle curve: sb. Trials were with a = b = 1 Statistical Estimation • Result of experiment is a sample • If N is very large, estimate is the same as theory • For practical N, the estimate is true ± error

  26. Another Approach: Nonparametric Model • Ordinal Dominance Graph • Donald Bamber, Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph, J. of Math. Psych. 12: 387-415 (1975). • Method: computer frequencies of occurrence for different threshold levels from sample, plot on probability scale. Monte Carlo, a = 1, b = 1, 10 positive and 10 negative cases

  27. Ordinal Dominance - examples (10, 10) (20, 20) (100, 100) (40, 40)

  28. Theory • Area asymptotically normal Worst case

  29. Metz • University of Chicago ROC project: • http://www-radiology.uchicago.edu/krl/toppage11.htm • Software for estimating Az, also sample st. dev. And confidence intervals. • Versatile

  30. Example • Compare image processing with conventional • Design: • Should we use same cases for both? • Yes, better comparison • Now results from two studies are correlated! • Metz software can handle this

  31. Design Parameters (1) unpaired (uncorrelated) test results. The two "conditions" are applied to independent case samples -- for example, from two different diagnostic tests performed on the different patients, from two different radiologists who make probability judgments concerning the presence of a specified disease in different images, etc.; (2) fully paired (correlated) test results, in which data from both of two conditions are available for each case in a single case sample. The two "conditions" in each test-result pair could correspond, for example, to two different diagnostic tests performed on the same patient, to two different radiologists who make probability judgments concerning the presence of a specified disease in the same image, etc.; and (3) partially-paired test results -- for example, two different diagnostic tests performed on the same patient sample and on some additional patients who received only one of the diagnostic tests.

  32. Summary: ROC • Compare modalities, evaluate effectiveness of a modality • Need to know the truth • Issue: two kinds of error • Specificity, Sensitivity • Scalar comparison not suitable • Statistical problem • More data, better answer • ROC methodology • Metz methods and software allow computation of confidence intervals, significance for tests with practical design parameters

  33. Recent Work • Beiden SV, Wagner RF, Campbell G, Metz CE, Jiang Y. Components-of-variance models for random-effects ROC analysis: The case of unequal variance structures across modalities. Academic Radiol. 8: 605-615, 2001 • Gefen S, Tretiak OJ, Piccoli CW, Donohue KD, Petropulu AP, Shankar PM, Dumane VA, Huang L, Kutay MA, Genis V, Forsberg F, Reid JM, Goldberg BB, ROC Analysis of Ultrasound Tissue Characterization Classifiers For Breast Cancer Diagnosis, IEEE Trans. Med. Im. In press

More Related