Statistics and Image Evaluation

Statistics and Image Evaluation Oleh Tretiak Medical Imaging Systems Fall, 2002

Which Image is Better? Case A Case B

Method • Rating on a 1 to 5 scale (5 is best) • Rating performed by 21 subject • Statistics: • Average, maximum, minimum, standard deviation, standard error for Case A, Case B • Difference per viewer between Case A and Case B, and above statistics on difference

Observations

Statistics

Conclusions • Image in Case A has a higher average ranking than that in Case B. • The highest ranking for B is the same as the lowest ranking for A. In all other cases, the rankings for B are lower than those for A. • Consider the difference (rightmost column on previous slide). The ratio of average to the standard error (the z value) is 2.62/.22 ~ 12. This value of z is extremely unlikely if the means are the same.

Experimental Design • How many observers should we use to test differences between pictures? • We expect difference between two kinds of pictures will be 0.5 ranking units. We exoect the standard deviation on difference measurement to be 1.0 (see experiment above). We would like to determine this reliably. We therefore want a confidence interval on the mean to be [mean - 0.5, mean + 0.5] at 99% confidence. How many observers should we use? • Answer:z0.005 = 2.6. Standard error must be 0.5/2.6 = 0.19. Std. err = std. dev. /sqrt(n). Therfore n = (1.0/0.19)^2 = 28

Today’s Lecture • Hypothesis testing • Two kinds of errors • ROC analysis • Visibility of blobs • Quantitative quality measures

Hypothesis Testing Example 256x256 128x128

Question: Which is better? • Testing method • Quality rating by multiple viewers • Compute per-viewer difference in quality • Find mean and standard deviation of the difference • Compute the z score (mean/std. error) • How to interpret?

Null Hypothesis (H0) • Assume that the mean is zero (no difference) • Find a range of z that would occur when the mean is zero. • Accept the null hypothesis if z is in this range (no difference) • Reject null hypothesis if z falls outside the range

We show the normal distribution with 0 mean and s = 1. The shaded area has probability 0.95, and the two white areas have each probability 0.025. If we observe gaussian variables with mean zero, 95% of the observations will have value between -1.96 and 1.96. The area outside this interval (0.05 in this case) is called the significance level of the test.

Two Kinds of Errors • In a decision task with two alternatives, there are two kinds of errors • Suppose the alternatives are ‘healthy’ and ‘sick’ • Type I error: say healthy if sick • Type II error: say sick if healthy

X - observation, t - threshold a = Pr[X > t | H0] (Type I error) b = Pr[X < t | H1] (Type II error) Choosing t, we can trade off between the two types of errors

Examples of Threshold Measurement • Show blobs and noise.

Examples • Measurement of psychophysical threshold • Detectible flicker, detectible contrast • Medical diagnosis • Negative (healthy), positive (sick) • Home security • Friend or terrorist

Probability of Error • Pe= P0a + P1b • Why bother with two types of error, why not just Pe? • In many cases, P1 << P0! • Two types of error are typically of different consequence. We therefore don’t want to mix them together.

ROC Terminology • ROC — receiver operating characteristic • H0 — friend, negative; H1 — enemy, positive • Pr[X > t | H0] = probability of false alarm = probability of false positive = PFP = a • Pr[X > t | H1] = probability of detection = probability of true positive = PTP = b

The ROC • The ROC shows the tradeoff between PFP and PTP as the threshold is varied

How Do We Estimate the ROC? • Radiological diagnosis setting • Positive and negative cases • The true diagnosis must be evaluated by a reliable method • Cases are evaluated by radiologist(s), who report the data on a discrete scale • 1 = definitely negative, 5 = definitely positive

Binormal Model • Negative: Normal, mean = 0, st. dev. = 1 • Negative: Normal, mean = a, st. dev. = b

Some Binormal Plots b = 0.5, a = 1, 2, 3 b = 1, a = 1, 2, 3 b = 2, a = 1, 2, 3 Az ~ area under ROC curve

Az formula

Experimental Framework • Set of positive and negative cases • Need reliable diagnosis • Radiologist interprets cases • Radiologist report on a scale • Certainly Negative, Probably Negative, Unclear, Probably Positive, Certainly Positive • Estimate ROC, Az • Compare results in studies with conventional and image processing

Standard deviations of estimates of a, b, and Az for varying numbers of observations. Horizontal axis: number of positive and negative observations. Top curve, sa, middle curve: sb. Trials were with a = b = 1. Standard deviations of estimates of a, b, and Az for varying numbers of observations. Horizontal axis: number of positive and negative observations. Top curve, sa, middle curve: sb. Trials were with a = b = 1 Statistical Estimation • Result of experiment is a sample • If N is very large, estimate is the same as theory • For practical N, the estimate is true ± error

Another Approach: Nonparametric Model • Ordinal Dominance Graph • Donald Bamber, Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph, J. of Math. Psych. 12: 387-415 (1975). • Method: computer frequencies of occurrence for different threshold levels from sample, plot on probability scale. Monte Carlo, a = 1, b = 1, 10 positive and 10 negative cases

Ordinal Dominance - examples (10, 10) (20, 20) (100, 100) (40, 40)

Theory • Area asymptotically normal Worst case

Metz • University of Chicago ROC project: • http://www-radiology.uchicago.edu/krl/toppage11.htm • Software for estimating Az, also sample st. dev. And confidence intervals. • Versatile

Example • Compare image processing with conventional • Design: • Should we use same cases for both? • Yes, better comparison • Now results from two studies are correlated! • Metz software can handle this

Design Parameters (1) unpaired (uncorrelated) test results. The two "conditions" are applied to independent case samples -- for example, from two different diagnostic tests performed on the different patients, from two different radiologists who make probability judgments concerning the presence of a specified disease in different images, etc.; (2) fully paired (correlated) test results, in which data from both of two conditions are available for each case in a single case sample. The two "conditions" in each test-result pair could correspond, for example, to two different diagnostic tests performed on the same patient, to two different radiologists who make probability judgments concerning the presence of a specified disease in the same image, etc.; and (3) partially-paired test results -- for example, two different diagnostic tests performed on the same patient sample and on some additional patients who received only one of the diagnostic tests.

Summary: ROC • Compare modalities, evaluate effectiveness of a modality • Need to know the truth • Issue: two kinds of error • Specificity, Sensitivity • Scalar comparison not suitable • Statistical problem • More data, better answer • ROC methodology • Metz methods and software allow computation of confidence intervals, significance for tests with practical design parameters

Recent Work • Beiden SV, Wagner RF, Campbell G, Metz CE, Jiang Y. Components-of-variance models for random-effects ROC analysis: The case of unequal variance structures across modalities. Academic Radiol. 8: 605-615, 2001 • Gefen S, Tretiak OJ, Piccoli CW, Donohue KD, Petropulu AP, Shankar PM, Dumane VA, Huang L, Kutay MA, Genis V, Forsberg F, Reid JM, Goldberg BB, ROC Analysis of Ultrasound Tissue Characterization Classifiers For Breast Cancer Diagnosis, IEEE Trans. Med. Im. In press

Statistics and Image Evaluation

Statistics and Image Evaluation

Presentation Transcript

Image upsampling via Imposed Edge Statistics

Image Production and Evaluation

Bio-image analysis, bio-statistics, programming and machine learning

EVALUATION OF IMAGE SEGMENTATION METHODS

Statistics and Image Evaluation

Testbeds , Model Evaluation , Statistics, and Users

Statistics for Performance Evaluation

Spatial and Spectral Evaluation of Image Fusion Methods

Image quality assessment and statistical evaluation

NAEP-Howard Statistics and Evaluation Institute

Remote Sensing Image Statistics

Evaluation of Fabric Data and Statistics of Orientation Data

Learning Image Statistics for Bayesian Tracking

Statistics for Performance Evaluation

REAL-TIME FMRI: setup, image monitoring, statistics, and feedback

Evaluation of Image Retrieval Results

Image Evaluation

Image Production and Evaluation

Statistics evaluation for Teleacademy.it

Statistics and Image Quality Evaluation III

Evaluation and assessment in civil registration and vital statistics systems

REAL-TIME FMRI: setup, image monitoring, statistics, and feedback