Why probability forecasts?

1 / 67

# Why probability forecasts? - PowerPoint PPT Presentation

Verification of Probability Forecasts at Points WMO QPF Verification Workshop Prague, Czech Republic 14-16 May 2001 Barbara G. Brown NCAR Boulder, Colorado, U.S.A. bgb@ucar.edu. Why probability forecasts?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Verification of Probability Forecasts at PointsWMO QPF Verification WorkshopPrague, Czech Republic14-16 May 2001Barbara G. BrownNCARBoulder, Colorado, U.S.A.bgb@ucar.edu

QPF Verification Workshop

Why probability forecasts?

“…the widespread practice of ignoring uncertainty when formulating and communicating forecasts represents an extreme form of inconsistency and generally results in the largest possible reductions in quality and value.”

--Murphy (1993)

QPF Verification Workshop

Outline
• Background and basics
• Types of events
• Types of forecasts
• Representation of probabilistic forecasts in the verification framework

QPF Verification Workshop

Outline continued
• Verification approaches: focus on 2-category case
• Measures
• Graphical representations
• Using statistical models
• Signal detection theory
• Ensemble forecast verification
• Extensions to multi-category verification problem
• Comparing probabilistic and categorical forecasts
• Connections to value
• Summary, conclusions, issues

QPF Verification Workshop

Background and basics
• Types of events:
• Two-category
• Multi-category
• Two-category events:
• Either event A happens or Event B happens
• Examples: Rain/No-rain

Hail/No-hail

• Multi-category event
• Event A, B, C, ….or Z happens
• Example: Precipitation categories

(< 1 mm, 1-5 mm, 5-10 mm, etc.)

QPF Verification Workshop

Background and basics cont.
• Types of forecasts
• Completely confident
• Forecast probability is either 0 or 1
• Example: Rain/No rain
• Probabilistic
• Objective (deterministic, statistical, ensemble-based)
• Subjective
• Probability is stated explicitly

QPF Verification Workshop

Background and basics cont.
• Representation of probabilistic forecasts in the verification framework

x = 0 or 1

f = 0, …, 1.0

f may be limited to only certain values between

0 and 1

• Joint distribution:

p(f,x), where x = 0, 1

Ex: If there are 12 possible values of f, then p(f,x) is comprised of 24 elements

QPF Verification Workshop

Background and basics, cont.
• Factorizations: Conditional and marginal probabilities
• Calibration-Refinement factorization:
• p(f,x) = p(x|f)p(f)
• p(x=0|f) = 1 – p(x=1|f) = 1 – E(x|f)
• Only one number is needed to specify the distribution

p(x|f) for each f

• p(f) is the frequency of use of each forecast probability
• Likelihood-Base Rate factorization:
• p(f,x) = p(f|x) p(x)
• p(x) is the relative frequency of a Yes observation (e.g., the sample climatology of precipitation); p(x) = E(x)

QPF Verification Workshop

Attributes [from Murphy and Winkler(1992)]

(sharpness)

QPF Verification Workshop

Verification approaches: 2x2 case

Completely confident forecasts:

Use the counts in this table to compute various common statistics (e.g., POD, POFD, H-K, FAR, CSI, Bias, etc.)

QPF Verification Workshop

QPF Verification Workshop

Relationships among measures in the 2x2 case

Many of the measures in the 2x2 case are strongly related in surprisingly complex ways.

For example:

QPF Verification Workshop

0.10

0.30

0.50

0.70

0.90

The lines indicate different values of POD and POFD

(where POD = POFD).

From Brown and Young (2000)

QPF Verification Workshop

CSI as a function of p(x=1) and POD=POFD

0.9

0.7

0.5

0.3

0.1

QPF Verification Workshop

CSI as a function of FAR and POD

QPF Verification Workshop

Summary measures:

Expectation

Conditional:

E(f|x=0), E(f|x=1)

E(x|f)

Marginal:

E(f)

E(x) = p(x=1)

Correlation

Joint distribution

Variability

Conditional:

Var.(f|x=0), Var(f|x=1)

Var(x|f)

Marginal:

Var(f)

Var(x) = E(x)[1-E(x)]

Measures for Probabilistic Forecasts

QPF Verification Workshop

From Murphy and Winkler (1992)Summary measures for joint and marginal distributions:

QPF Verification Workshop

From Murphy and Winkler (1992)Summary measures for conditional distributions:

QPF Verification Workshop

Performance measures
• Brier score:
• Analogous to MSE; negative orientation;
• For perfect forecasts: BS=0
• Brier skill score:
• Analogous to MSE skill score

QPF Verification Workshop

From Murphy and Winkler (1992):

QPF Verification Workshop

Brier score displays

From Shirey and Erickson, http://www.nws.noaa.gov/tdl/synop/amspapers/masmrfpap.htm

QPF Verification Workshop

Brier score displays

From http://www.nws.noaa.gov/tdl/synop/mrfpop/mainframes.htm

QPF Verification Workshop

Decomposition of the Brier Score

Break Brier score into more elemental components:

Reliability

Resolution

Uncertainty

Where I = the number of distinct probability values and

Then, the Brier Skill Score can be re-formulated as

QPF Verification Workshop

Graphical representations of measures
• Reliability diagram

p(x=1|fi) vs. fi

• Sharpness diagram

p(f)

• Attributes diagram
• Reliability, Resolution, Skill/No-skill
• Discrimination diagram

p(f|x=0) and p(f|x=1)

Together, these diagrams provide a relatively complete picture of the quality of a set of probability forecasts

QPF Verification Workshop

Reliability and Sharpness (from Wilks 1995)

Climatology

Minimal RES

Underforecasting

Good RES, at expense of REL

Reliable forecasts of rare event

Small sample size

QPF Verification Workshop

Reliability and Sharpness (from Murphy and Winkler 1992)

Sub

Model

St. Louis

12-24 h PoP

Cool Season

Model

Sub

No skill

No RES

QPF Verification Workshop

Attributes diagram (from Wilks 1995)

QPF Verification Workshop

Icing forecast examples

QPF Verification Workshop

Use of statistical models to describe verification features
• Exploratory study by Murphy and Wilks (1998)
• Case study
• Use regression model to model reliability
• Use Beta distribution to model p(f) as measure of sharpness
• Use multivariate diagram to display combinations of characteristics
• Promising approach that is worthy of more investigation

QPF Verification Workshop

Fit Beta distribution to p(f)

2 parameters: p. q

Ideal: p<1; q<1

1

0

QPF Verification Workshop

Fit regression to

Reliability diagram

[p(x|f) vs. f]

2 parameters:b0, b1

Murphy and Wilks (1997)

QPF Verification Workshop

Summary Plot

Murphy and Wilks 1997

QPF Verification Workshop

Signal Detection Theory (SDT)
• Approach that has commonly been applied in medicine and other fields
• Brought to meteorology by Ian Mason (1982)
• Evaluates the ability of forecasts to discriminate between occurrence and non-occurrence of an event
• Summarizes characteristics of the Likelihood-Base Rate decomposition of the framework
• Tests model performance relative to specific threshold
• Ignores calibration
• Allows comparison of categorical and probabilistic forecasts

QPF Verification Workshop

Mechanics of SDT
• Based on likelihood-base rate decomposition

p(f,x) = p(f|x) p(x)

• Basic elements :
• Hit rate (HR)
• HR = POD = YY / (YY+NY)
• Estimate of p(f=1|x=1)
• False Alarm Rate (FA)
• FA = 1 - POFD = YN / (YN + NN)
• Estimate of p(f=1|x=0)
• Relative Operating Characteristic curve
• Plot HR vs. FA

QPF Verification Workshop

ROC Examples: Mason(1982)

QPF Verification Workshop

ROC Examples: Icing forecasts

QPF Verification Workshop

ROC
• Area under the ROC is a measure of forecast skill
• Values less than 0.5 indicate negative skill
• Measurement of ROC Area often is better if a normal distribution model is used to model HR and FA
• Area can be underestimated if curve is approximated by straight line segments
• Harvey et al (1992), Mason (1982); Wilson (2000)

QPF Verification Workshop

Idealized ROC (Mason 1982)

f(x=1)

f(x=0)

f(x=0)

f(x=1)

f(x=0)

f(x=1)

S=2

S=1

S=0.5

S = s0 / s1

QPF Verification Workshop

Brier score

Based on squared error

Strictly proper scoring rule

Calibration is an important factor; lack of calibration impacts scores

Decompositions provide insight into several performance attributes

Dependent on frequency of occurrence of the event

ROC

Considers forecasts’ ability to discriminate between Yes and No events

Calibration is not a factor

Less dependent on frequency of occurrence of event

Provides verification information for individual decision thresholds

Comparison of Approaches

QPF Verification Workshop

Relative operating levels
• Analogous to the ROC, but from the Calibration-Refinement perspective (i.e., given the forecast)
• Curves based on
• Correct Alarm Ratio:
• Miss Ratio:
• These statistics are estimates of two conditional probabilities:
• Correct Alarm Ratio: p(x=1|f=1)
• Miss Ratio: p(x=1|f=0)
• For a system with no skill, p(x=1|f=1) = p(x=1|f=0) = p(x)

QPF Verification Workshop

ROC Diagram

(Mason and Graham 1999)

QPF Verification Workshop

ROL Diagram

(Mason and Graham 1999)

QPF Verification Workshop

Verification of ensemble forecasts
• Output of ensemble forecasting systems can be treated as
• A probability distribution
• A probability
• A categorical forecast
• Probabilistic forecasts from ensemble systems can be verified using standard approaches for probabilistic forecasts
• Common methods
• Brier score
• ROC

QPF Verification Workshop

Example: Palmer et al. (2000)Reliability

ECMWF ensemble

Multi-model ensemble

<0

<1

QPF Verification Workshop

Example: Palmer et al. (2000)ROC

ECMWF ensemble

Multi-model

ensemble

QPF Verification Workshop

Verification of ensemble forecasts (cont.)

A number of methods have been developed specifically for use with ensemble forecasts. For example:

• Rank histograms
• Rank position of observations relative to ensemble members
• Ideal: Uniform distribution
• Non-ideal can occur for many reasons (Hamill 2001)
• Ensemble distribution approach

(Wilson et al. 1999)

• Fit distribution to ensemble
• Determine probability associated with that observation

QPF Verification Workshop

Rank histograms:

QPF Verification Workshop

Extensions to multiple categories
• Examples:
• QPF with several thresholds/categories
• Approach 1: Evaluate each category on its own
• Compute Brier score, reliability, ROC, etc. for each category separately
• Problems:
• Some categories will be very rare, have few Yes observations
• Throws away important information related to the ordering of predictands and magnitude of error

QPF Verification Workshop

Example: Brier skill score for several categories

From http://www.nws.noaa.gov/tdl/synop/mrfpop/mainframes.htm

QPF Verification Workshop

Extensions to multiple categories (cont.)
• Approach 2: Evaluate all categories simultaneously
• Rank Probability Score (RPS)
• Analogous to Brier Score for multiple categories
• Skill score:
• Decompositions analogous to BS, BSS

QPF Verification Workshop

Multiple categories: Examples of alternative approaches
• Continuous ranked probability score

(Bouttier 1994; Brown 1974; Matheson and Winkler 1976; Unger 1985)

and decompositions (Hersbach 2000)

• Analogous to RPS with infinite number of classes
• Decompose into Reliability and Resolution/uncertainty components
• Multi-category reliability diagrams (Hamill 1997)
• Measures calibration in a cumulative sense
• Reduces impact of categories with few forecasts
• Other references
• Bouttier 1994
• Brown 1974
• Matheson and Winkler 1976
• Unger 1985

QPF Verification Workshop

MCRD example (Hamill 1997)

QPF Verification Workshop

Connections to value
• Cost-Loss ratio model
• Optimal to protect whenever C < pL or p > C/L

where p is the probability of adverse weather

QPF Verification Workshop

Wilks’ Value Score (Wilks 2001)
• VS is the percent improvement in value between climatological and perfect information as a function of C/L
• VS is impacted by (lack of) calibration
• VS can be generalized for particular/idealized distributions of C/L

QPF Verification Workshop

VS example: Wilks (2001)

Las Vegas, PoP April 1980 – March 1987

QPF Verification Workshop

VS example: Icing forecasts

QPF Verification Workshop

VS: Beta model example (Wilks 2001)

QPF Verification Workshop

Richardson approach
• ROC context
• Calibration errors don’t impact the score

QPF Verification Workshop

Miscellaneous issues
• Quantifying the uncertainty in verification measures
• Issue: Spatial and temporal correlation
• A few approaches:
• Parametric methods

Ex: Seaman et al. (1996)

• Robust methods (confidence intervals for medians)

Ex: Brown et al. (1997)

Velleman and Hoaglin (1981)

• Bootstrap methods

Ex: Hamill (1999)

Kane and Brown (2001)

• Treatment of observations as probabilistic?

QPF Verification Workshop

Conclusions
• Basis for evaluating probability forecasts was established many years ago (Brier, Murphy, Epstein)
• Recent renewal in interest has led to new ideas
• Still more to do
• Develop and implement a cohesive set of meaningful and useful methods
• Develop greater understanding of methods we have and how they inter-relate

QPF Verification Workshop

Verification of Probabilistic QPFs: Selected References

Brown, B.G., G. Thompson, R.T. Bruintjes, R. Bullock and T. Kane, 1997: Intercomparison of in-flight icing algorithms. Part II: Statistical verification results. Weather and Forecasting, 12, 890-914.

Davis, C., and F. Carr, 2000: Summary of the 1998 workshop on mesoscale model verification. Bulletin of the American Meteorological Society, 81, 809-819.

Hamill, T.M., 1997: Reliability diagrams for multicategory probabilistic forecasts. Weather and Forecasting, 12, 736–741.

Hamill, T.M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Weather and Forecasting, 14, 155-167.

Hamill, T.M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Monthly Weather Review, 129, 550-560.

QPF Verification Workshop

References (cont.)

Harvey, L.O., Jr., K.R. Hammond, C.M. Lusk, and E.F. Mross, 1992: The application of signal detection theory to weather forecasting behavior. Monthly Weather Review, 120, 863-883.

Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather and Forecasting, 15, 559-570.

Hsu, W.-R., and A.H. Murphy, 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts. International Journal of Forecasting, 2, 285-293.

Kane, T.L., and B.G. Brown, 2000: Confidence intervals for some verification measures – a survey of several methods. Preprints, 15th Conference on Probability and Statistics in the Atmospheric Sciences, 8-11 May, Asheville, NC, U.S.A., American Meteorological Society (Boston), 46-49.

QPF Verification Workshop

References (cont.)

Mason, I., 1982: A model for assessment of weather forecasts. Australian Meteorological Magazine, 30, 291-303.

Mason, I., 1989: Dependence of the critical success index on sample climate and threshold probability. Australian Meteorological Magazine, 37, 75-81.

Mason, S., and N.E. Graham, 1999: Conditional probabilities, relative operating characteristics, and relative operating levels. Weather and Forecasting, 14, 713-725.

Murphy, A.H., 1993: What Is a god forecast? An essay on the nature of goodness in weather forecasting. Weather and Forecasting, 8, 281–293.

Murphy, A.H., and D.S. Wilks, 1998: A case study of the use of statistical models in forecast verification: Precipitation probability forecasts. Weather and Forecasting, 13, 795-810.

QPF Verification Workshop

References (cont.)

Murphy, A.H., and R.L. Winkler, 1992: Diagnostic verification of probability forecasts. International Journal of Forecasting, 7, 435-455.

Richardson, D.S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quarterly Journal of the Royal Meteorological Society, 126, 649-667.

Seaman, R., I. Mason, and F. Woodcock, 1996: Confidence intervals for some performance measures of Yes-No forecasts. Australian Meteorological Magazine, 45, 49-53.

Stanski, H., L.J. Wilson, and W.R. Burrows, 1989: Survey of common verification methods in meteorology. WMO World Weather Watch Tech. Rep. 8, 114 pp.

Velleman, P.F., and D.C. Hoaglin, 1981: Applications, Basics, and Computing of Exploratory Data Analysis. Duxbury Press, 354 pp.

QPF Verification Workshop

References (cont.)

Wilks, D.S., 1995: Statistical Methods in the Atmospheric Sciences, Academic Press, San Diego, CA, 467 pp.

Wilks, D.S., 2001: A skill score based on economic value for probability forecasts. Meteorological Applications, in press.

Wilson, L.J., W.R. Burrows, and A. Lanzinger, 1999: A strategy for verification of weather element forecasts from an ensemble prediction system. Monthly Weather Review, 127, 956-970.

QPF Verification Workshop