Verification Continued… Holly C. Hartmann

Verification Continued… Holly C. Hartmann Department of Hydrology and Water ResourcesUniversity of Arizona hollyoregon@juno.com RFC Verification Workshop, 08/14/2007

Agenda • Introduction to Verification • Applications, Rationale, Basic Concepts • Data Visualization and Exploration • Deterministic Scalar measures • 2. Categorical measures – KEVIN WERNER • Deterministic Forecasts • Ensemble Forecasts • 3. Diagnostic Verification • Reliability • Discrimination • Conditioning/Structuring Analyses • 4. Lab Session/Group Exercise • - Developing Verification Strategies • - Connecting to Forecast Operations and Users

Probabilistic Ensemble Forecasts From: California-Nevada River Forecast Center

Probabilistic Ensemble Forecasts From: A. Hamlet, University of Washington

From: A. Hamlet, University of Washington

Probabilistic Ensemble Forecasts From: A. Hamlet, University of Washington

Talagrand Diagram – Also Called Ranked Histogram • Identifies systematic flaws of an ensemble prediction system. • Shows effectiveness of ensemble distribution in sampling the observations. • Does not indicate that the ensemble will be of practical use.

 |  With only one ensemble member ( | ) all (100%) observations (  ) will fall “outside” With two ensemble members two out of three observations ( 2/3=67%) should fall outside  |  |  With three ensemble members two out of four observations ( 2/4=50%) should fall outside  |  |  |  Talagrand Diagram – Also Called Ranked Histogram • Identifies systematic flaws of an ensemble prediction system. • Shows effectiveness of ensemble distribution in sampling the observations. • Does not indicate that the ensemble will be of practical use. Principle Behind Talagrand Diagram For any number of ensemble members, 2/#members should fall outside the ensemble Adapted from A. Persson, 2006

Bin1 Bin2 Bin3 Bin4 Bin5 Talagrand Diagram Computation Example Four sample ensemble members (E1 – E4) for daily flow forecasts (produced from reforecasts using carryover each year) Step 1: Rank lowest to highest for each year. Four members results in 5 bins. Step 2: Determine which bin the corresponding observation falls into. Step 3: Tally how many observations fall in each bin. Step 4: Plot frequency of observations for ranked bin. YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Bin # 5 3 5 4 5 3 4 4 5 Bin # Tally 1 2 3 4 5

Bin1 Bin2 Bin3 Bin4 Bin5 Talagrand Diagram Computation Example Four sample ensemble members (E1 – E4) for daily flow forecasts (produced from reforecasts using carryover each year) Step 1: Rank lowest to highest for each year. Four members results in 5 bins. Step 2: Determine which bin the corresponding observation falls into. Step 3: Tally how many observations fall in each bin. Step 4: Plot frequency of observations for ranked bin. YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Bin # 5 3 5 4 3 1 2 5 3 4 4 5 Bin # Tally 1 1 2 1 3 3 4 3 5 4

Bin1 Bin2 Bin3 Bin4 Bin5 Talagrand Diagram Computation Example Four sample ensemble members (E1 – E4) ranked lowest to highest for daily flow (produced from reforecasts using carryover in each year) YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Bin # 5 3 5 4 3 1 2 5 3 4 4 5 Bin # Tally 1 1 2 1 3 3 4 3 5 4 Frequency

Talagrand Diagram: 25 traces/ensemble, 375 observations Example: “L-Shaped” Observations too often larger (smaller) than ensemble Indicates under- (over-) forecasting bias Example: “U-Shaped” Observations too often falling outside ensemble Indicates ensemble spread too small Example: “N-Shaped” (domed shaped) Observations too rarely falling outside ensemble Indicates ensemble spread is too big Example: “Flat-Shaped” Observations falling uniformly across ensemble Indicates appropriately sized ensemble distribution

Bin1 Bin2 Bin3 Bin4 Bin5 Talagrand Diagram Example: Interpretation? Four sample ensemble members (E1 – E4) ranked lowest to highest for daily flow (produced from reforecasts using carryover in each year) YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Bin # 5 3 5 4 3 1 2 5 3 4 4 5 Bin # Tally 1 1 2 1 3 3 4 3 5 4 ??? Frequency

Distributions-oriented Forecast Evaluation leads to Diagnostic Verification It’s all about conditional and marginal distributions! P(O|F), P(F|O), P(F), P(O) Reliability, Discrimination, Sharpness, Uncertainty

1 1 Relative frequency of observed Relative frequency of observed 0 0 0 1 0 1 Forecasted Probability Forecasted Probability Forecast Reliability -- P(O|F) For a specified forecast condition, what does the distribution of observations look like? User perspective: “When you say 80% chance of flood flows, how often do flood flows actually happen?” User perspective: “When you say 20% chance of flood flows, how often do flood flows actually happen?”

Reliability (Attributes) Diagram – Reliability, Sharpness • Good reliability – close to diagonal • Sharpness diagram (p(f)) –histogram of forecasts in each probability bin shows shows marginal distribution of forecasts The reliability diagram is conditioned on the forecasts. That is, given that X was predicted, what was the outcome?

Reliability Diagram Example Computation Step 1: Choose threshold value to base probability forecasts on. For simplicity we’ll choose the mean forecast over all years and all ensembles (= 208). YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167

Reliability Diagram Example Computation Step 2: Choose how many forecast probability categories to use (5 here: 0,.25,.5,.75,1) Step 3: For each forecast, calculate the forecast probability below the threshold value. YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 P(peakfor < 208) 1.0 0.5 0.5 0.0 0.25 0.25 1.0 0.5 0.5 1.0

Reliability Diagram Example Computation Step 2: Choose how many forecast probability categories to use (5 here: 0,.25,.5,.75,1) Step 3: For each forecast, calculate the forecast probability below the threshold value. YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 P(peakfor < 208) 1.0 0.5 0.5 0.0 0.25 0.25 1.0 0.5 0.25 0.5 0.5 1.0

Reliability Diagram Example Computation P(peak < 208) = 0.0 516 P(peak < 208) = 0.25 348, 98, 233 P(peak < 208) = 0.5 206, 301, 245, 248, 227 P(peak < 208) = 1.0 Step 4: Group the observations into groups of equal forecast probability (or, more generally, into forecast probability categories). OBS 112 206 301 516 348 98 156 245 233 248 227 167 P(peakfor < 208) 1.0 0.5 0.5 0.0 0.25 0.25 1.0 0.5 0.25 0.5 0.5 1.0 P(peak < 208) = 0.0 516 P(peak < 208) = 0.25 348, 98, 233 P(peak < 208) = 0.5 206, 301, 245, 248, 227 P(peak < 208) = 0.75 N/A

Reliability Diagram Example Computation P(peak < 208) = 0.0 516 P(peak < 208) = 0.25 348, 98, 233 P(peak < 208) = 0.5 206, 301, 245, 248, 227 P(peak < 208) = 1.0 112, 156, 167 Step 4: Group the observations into groups of equal forecast probability (or, more generally, into forecast probability categories). OBS 112 206 301 516 348 98 156 245 233 248 227 167 P(peakfor < 208) 1.0 0.5 0.5 0.0 0.25 0.25 1.0 0.5 0.25 0.5 0.5 1.0 P(peak < 208) = 0.0 516 P(peak < 208) = 0.25 348, 98, 233 P(peak < 208) = 0.5 206, 301, 245, 248, 227 P(peak < 208) = 0.75 N/A

Reliability Diagram Example Computation P(obs peak < 208 given [P(peakfor < 208) = 0.5]) = 1/5 = 0.2 P(obs peak < 208 given [P(peakfor < 208) = 0.75]) = 0/0 = NA P(obs peak < 208 given [P(peakfor < 208) = 1.0]) = Step 5: For each group, calculate the frequency of observations above the threshold value, 208 cfs. P(peak < 208) = 0.0 516 P(obs peak < 208 given [P(peakfor < 208) = 0.0]) = 0/1 = 0.0 P(peak < 208) = 0.25 348, 98, 233 P(obs peak < 208 given [P(peakfor < 208) = 0.25]) = 1/3 = 0.33 P(peak < 208) = 0.5 206, 301, 245, 248, 227 P(peak < 208) = 0.75 N/A P(peak < 208) = 1.0 112, 156, 167

Reliability Diagram Example Computation P(obs peak < 208 given [P(peakfor < 208) = 0.5]) = 1/5 = 0.2 P(obs peak < 208 given [P(peakfor < 208) = 0.75]) = 0/0 = NA P(obs peak < 208 given [P(peakfor < 208) = 1.0]) = 3/3 = 1 Step 5: For each group, calculate the frequency of observations above the threshold value, 208 cfs. P(peak < 208) = 0.0 516 P(obs peak < 208 given [P(peakfor < 208) = 0.0]) = 0/1 = 0.0 P(peak < 208) = 0.25 348, 98, 233 P(obs peak < 208 given [P(peakfor < 208) = 0.25]) = 1/3 = 0.33 P(peak < 208) = 0.5 206, 301, 245, 248, 227 P(peak < 208) = 0.75 N/A P(peak < 208) = 1.0 112, 156, 167

Reliability Diagram Example Computation Step 6: Plot centroid of the forecast category (just points in our case) on the x-axis against the observed frequency within each forecast category on the y-axis. Include the 45 degree diagonal for reference.

Reliability Diagram Example Computation Step 7: Include sharpness plot showing the number of observation/forecast pairs in each category.

Reliability Diagram – Reliability, Sharpness – P(O|F) • Good reliability – close to diagonal • Sharpness diagram (p(f)) –histogram of forecasts in each probability bin shows marginal distribution of forecasts • Good resolution –wide range of frequency of observations corresponding to forecast probabilities • Skill – related to Brier Skill Score, in reference to sample climatology (not historical climatology) The reliability diagram is conditioned on the forecasts. That is, given that X was predicted, what was the outcome?

Attributes Diagram – Reliability, Resolution, Skill/No-skill No-skill line : halfway between perfect-reliability line and no-resolution line, with sample climatology as a reference Overall relative frequency of observations (sample climatology) Points closer to perfect-reliability line than to no-resolution line: subsamples of probabilistic forecast contribute positively to overall skill (as defined by BSS) in reference to sample climatology

Interpretation of Reliability Diagrams Climatology Minimal RESolution Underforecasting Good RES, at expense of REL Reliable forecasts of rare event Small sample size Source: Wilks (1995)

Interpretation of Reliability Diagrams Reliability P[O|F] Does the frequency of occurrence match your probability statement? Identifies conditional bias Relative frequency of observations No resolution Forecasted probability

EVS Reliability Diagram Examples Arkansas-Red Basin, 24-hr flows, lead time 1-14 days 85th Percentile Observed Flows (high flows) Good reliability at shorter lead times, long-leads miss high events 25th Percentile Observed Flows (low flows) Sharp forecasts, but low resolution From: J. Brown, EVS Manual

Historical seasonal water supply outlooks Colorado River Basin Morrill, Hartmann, and Bales, 2007

Reliability: Colorado Basin ESP Seasonal Supply Outlooks Jan 1 Jan 1 Mar 1 Apr 1 Apr 1 Jun 1 high 30% mid 40% low 30% UC JJy (7 mo. lead) LC JM (5 mo. lead) 1) Few high prob. fcasts, good reliability between 10-70% probability; reliability improves. UC AJy (4 mo. lead) LC MM (3 mo. lead) 2) These months show best reliability; low resolution limiting reliability Relative Frequency of Observations UC JnJy (2 mo. lead) LC AM (2 mo. lead) 3) Reliability decreases for later forecasts as resolution increases; UC good at extremes. Forecast probability Franz, Hartmann, and Sorooshian, 2003

Discrimination – P(F|O) For a specified observation category, what do the forecast distributions look like? “When dry conditions happen… What do the forecasts usually look like? You sure hope that forecasts look different when there’s a drought, compared to when there’s a flood!

Climatology Climatology Probability of dry Probability of dry Probability of wet Relative frequency of indicated forecast Relative frequency of indicated forecast Probability of wet 0.00 0.33 1.00 0.00 0.33 1.00 Forecasted Probability Forecasted Probability Discrimination – P(F|O) You sure hope that forecasts look different when there’s a drought, compared to when there’s a flood! Example: NWS CPC Seasonal climate outlooks, sorted into DRY cases (lowest tercile), 1995-2001, all forecasts, all lead-times Good discrimination! Not much discrimination!

Jan 1 High Mid- Low Discrimination: Lower Colorado ESP Supply Outlooks When unusually low flows happened… P(F|Low flows). Low < 30th percentile Jan-May There is some discrimination… Early forecasts warned “High flows less likely” Relative Frequency of Forecasts Forecast probability Franz, Hartmann, and Sorooshian (2003)

Jan 1 High Mid- Low Apr 1 Discrimination: Lower Colorado ESP Supply Outlooks When unusually low flows happened… P(F|Low flows). Low < 30th percentile Jan-May There is some discrimination… Early forecasts warned “High flows less likely” Relative Frequency of Forecasts Apr-May Good Discrimination… Forecasts were saying: 1) high and mid- flows less likely. 2) Low flows more likely Forecast probability Franz, Hartmann, and Sorooshian (2003)

Jan 1 Jan 1 Jun 1 high 30% mid 40% low 30% Apr 1 Discrimination: Colorado Basin ESP Supply Outlooks For observed flows in lowest 30% of historic distribution Upper Colorado Basin Jan-July (7 mo. lead) Lower Colorado Basin Jan-May (5 mo. lead) 1)High flows less likely. 2) No discrimination between mid and low flows. 3) Both UC and LC show good discrimination for low flows at 2-month lead time. Relative Frequency of Forecasts April-May (2 mo. lead) June-July (2 mo. lead) Forecast probability Franz, Hartmann, and Sorooshian (2003)

Historical seasonal water supply outlooks Colorado River Basin

Discrimination: CDF Perspective All observation CDF is plotted and color coded by tercile. Forecast ensemble members are sorted into 3 groups according to which tercile its associated observation falls into. The CDF for each group is plotted in the appropriate color. i.e. high is blue. Credit: K. Werner

Discrimination In this case, there is relatively good discrimination since the three conditional forecast CDFs separate themselves. Credit: K. Werner

Discrimination Example Computation Step 1: Order observations and divide ordered list into categories. Here we will use terciles (≤167, 206 ≤≤ 245, ≥ 248) . YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 OBS Tercile Low Middle High High High Low Low Middle Middle High Middle Low Credit: K. Werner

Discrimination Example Computation Step 2: Group forecast ensemble members according to OBS tercile. Low OBS Forecasts: 42, 74, 82, 90, 114, 277, 351, 356, 98, 170, 204, 205, 94,135, 156, 158 YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 OBS Tercile Low Middle High High High Low Low Middle Middle High Middle Low Credit: K. Werner

Discrimination Example Computation Step 2: Group forecast ensemble members according to OBS tercile. YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 OBS Tercile Low Middle High High High Low Low Middle Middle High Middle Low Mid OBS Forecasts: 65, 143, 223, 227, 69, 169, 229, 236, 94, 219, 267, 270, Credit: K. Werner

Discrimination Example Computation Step 2: Group forecast ensemble members according to OBS tercile. YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 OBS Tercile Low Middle High High High Low Low Middle Middle High Middle Low Mid OBS Forecasts: 65, 143, 223, 227, 69, 169, 229, 236, 94, 219, 267, 270, 108, 189, 227, 228 Credit: K. Werner

Discrimination Example Computation Step 2: Group forecast ensemble members according to OBS tercile. YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 OBS Tercile Low Middle High High High Low Low Middle Middle High Middle Low Hi OBS Forecasts: 82, 192, 295, 300, 142, 291, 349, 356, 59, 175, 244, 250 Credit: K. Werner

Discrimination Example Computation Step 2: Group forecast ensemble members according to OBS tercile. YEAR E1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 OBS Tercile Low Middle High High High Low Low Middle Middle High Middle Low Hi OBS Forecasts: 82, 192, 295, 300, 211, 397, 514, 544 142, 291, 349, 356, 59, 175, 244, 250 Credit: K. Werner

Discrimination Example Computation Step 3: Plot all-observation CDF color coded by tercile (≤167, 206 ≤≤ 245, ≥ 248). OBS 112 206 301 516 348 98 156 245 233 248 227 167 OBS Tercile Low Middle High High High Low Low Middle Middle High Middle Low Credit: K. Werner

Discrimination Example Computation Mid OBS Forecasts: 65, 143, 223, 227, 69, 169, 229, 236, 94, 219, 267, 270, 108, 189, 227, 228 Hi OBS Forecasts: 82, 192, 295, 300, 211, 397, 514, 544, 142, 291, 349, 356, 59, 175, 244, 250 Step 4: Add forecasts conditioned on observed terciles CDFs to plot. Low OBS Forecasts: 42, 74, 82, 90, 114, 277, 351, 356, 98, 170, 204, 205, 94, 135, 156, 158 Credit: K. Werner

Discrimination Example Computation Step 5: Discrimination is shown by the degree to which the conditional forecast CDFs are separated from each other. In this case, high forecasts discriminate better than mid and low forecasts. Credit: K. Werner

Verification Continued… Holly C. Hartmann