1 / 39

Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble

Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble. Tom Hopson 1 Josh Hacker 1 , Yubao Liu 1 , Gregory Roux 1 , Wanli Wu 1 , Jason Knievel 1 , Tom Warner 1 , Scott Swerdlin 1 , John Pace 2 , Scott Halvorson 2. 1. 2 U.S. Army Test and Evaluation Command.

ulf
Download Presentation

Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quantile regression as a means of calibrating and verifying a mesoscaleNWP ensemble Tom Hopson1 Josh Hacker1, Yubao Liu1, Gregory Roux1, Wanli Wu1, Jason Knievel1, Tom Warner1, Scott Swerdlin1, John Pace2, Scott Halvorson2 1 2U.S. Army Test and Evaluation Command

  2. Outline • Motivation: ensemble forecasting and post-processing • E-RTFDDA for Dugway Proving Grounds • Introduce Quantile Regression (QR; Kroenker and Bassett, 1978) • Post-processing procedure • Verification results • Warning: dynamically finding ensemble dispersion at risk ensemble mean utility • Conclusions

  3. Goals of an EPS • Predict the observed distribution of events and atmospheric states • Predict uncertainty in the day’s prediction • Predict the extreme events that are possible on a particular day • Provide a range of possible scenarios for a particular forecast

  4. More technically … • Greater accuracy of ensemble mean forecast (half the error variance of single forecast) • Likelihood of extremes • Non-Gaussian forecast PDF’s • Ensemble spread as a representation of forecast uncertainty => All rely on forecasts being calibrated • Further … • -- Argue calibration essential for tailoring to local application: NWP provides spatially- and temporally-averaged gridded forecast output • -- Applying gridded forecasts to point locations requires location specific calibration to account for local spatial- and temporal-scales of variability ( => increasing ensemble dispersion)

  5. Dugway Proving Grounds, Utah e.g. T Thresholds • Includes random and systematic differences between members. • Not an actual chance of exceedance unless calibrated.

  6. Challenges in probabilistic mesoscale prediction • Model formulation • Bias (marginal and conditional) • Lack of variability caused by truncation and approximation • Non-universality of closure and forcing • Initial conditions • Small-scales are damped in analysis systems, and the model must develop them • Perturbation methods designed for medium-range systems may not be appropriate • Lateral boundary conditions • After short time periods the lateral boundary conditions can dominate • Representing uncertainty in lateral boundary conditions is critical • Lower boundary conditions • Dominate boundary-layer response • Difficult to estimate uncertainty in lower boundary conditions

  7. RTFDDA andEnsemble-RTFDDA yliu@ucar.edu Liu et al. 2010 AMS Annual Meeting, 14th IOAS-AOLS, Atlanta, GA. January 18 – 23, 2010

  8. Perturbations RTFDDA Member 1 36-48h fcsts Post processing observations Perturbations RTFDDA Member 2 36-48h fcsts Input to decision support tools observations Perturbations RTFDDA Member 3 36-48h fcsts observations … Archiving and verification Perturbations RTFDDA Member N 36-48h fcsts observations The Ensemble Execution Module yliu@ucar.edu Liu et al. 2010 AMS Annual Meeting, 14th IOAS-AOLS, Atlanta, GA. January 18 – 23, 2010

  9. T Mean and SD Surface and X-sections – Mean, Spread, Exceedance Probability, Spaghetti, … T-2m Mean T & Wind Wind Rose D2 Likelihood for SPD > 10m/s D3 Pin-point Surface and Profiles – Mean, Spread, Exceedance probability, spaghetti, Wind roses, Histograms … Wind Speed D1 Real-time Operational Products for DPG • Operated at • US Army DPG • since Sep. 2007

  10. Forecast “calibration” or “post-processing” “bias” obs Forecast PDF Probability Probability Forecast PDF obs “spread” or “dispersion” calibration Flow rate [m3/s] Flow rate [m3/s] • Post-processing has corrected: • the “on average” bias • as well as under-representation of the 2nd moment of the empirical forecast PDF (i.e. corrected its “dispersion” or “spread”) • Our approach: • under-utilized “quantile regression” approach • probability distribution function “means what it says” • daily variation in the ensemble dispersion directly relate to changes in forecast skill => informative ensemble skill-spread relationship

  11. Example of Quantile Regression (QR) Our application Fitting T quantiles using QR conditioned on: Ranked forecast ens ensemble mean ensemble median 4) ensemble stdev 5) Persistence

  12. Step 2: For each quan, use “forward step-wise cross-validation” to iteratively select best subset Selection requirements: a) QR cost function minimum, b) Satisfy binomial distribution at 95% confidence If requirements not met, retain climatological “prior” Step I: Determine climatological quantiles Probability/°K climatological PDF 1. Regressor set: 1. reforecast ens 2. ens mean 3. ens stdev 4. persistence 5. LR quantile (not shown) 3. T [K] 2. 4. Temperature [K] observed forecasts Time Step 3: segregate forecasts into differing ranges of ensemble dispersion and refit models (Step 2) uniquely for each range Final result: “sharper” posterior PDF represented by interpolated quans forecasts Forecast PDF posterior I. II. III. II. I. Probability/°K prior T [K] Temperature [K] Time

  13. Utilizing Verification measures near-real-time … Measures Used: Rank histogram (converted to scalar measure) Root Mean square error (RMSE) Brier score Rank Probability Score (RPS) Relative Operating Characteristic (ROC) curve New measure of ensemble skill-spread utility => Using these for automated calibration model selection by using weighted sum of skill scores of each

  14. Problems withSpread-Skill Correlation … 4 day ECMWF r = 0.33 “Perfect” r = 0.68 ECMWF r= “Perfect” r = 0.56 1 day • ECMWF spread-skill (black) correlation << 1 • Even “perfect model” (blue) correlation << 1 and varies with forecast lead-time 7 day 10 day ECMWF r = 0.39 “Perfect” r = 0.53 ECMWF r = 0.36 “Perfect” r = 0.49

  15. 3-hr dewpoint time series Station DPG S01 Before Calibration After Calibration National Security Applications Program Research Applications Laboratory

  16. 42-hr dewpoint time series Station DPG S01 Before Calibration After Calibration

  17. PDFs: raw vs. calibrated Blue is “raw” ensemble Black is calibrated ensemble Red is the observed value Notice: significant change in both “bias” and dispersion of final PDF (also notice PDF asymmetries) obs

  18. 3-hr dewpoint rank histograms Station DPG S01 National Security Applications Program Research Applications Laboratory

  19. 42-hr dewpoint rank histograms Station DPG S01 National Security Applications Program Research Applications Laboratory

  20. Skill Scores • Single value to summarize performance. • Reference forecast - best naive guess; persistence, climatology • A perfect forecast implies that the object can be perfectly observed • Positively oriented – Positive is good

  21. Skill Score Verification CRPS Skill Score RMSE Skill Score Reference Forecasts: Black -- raw ensemble Blue -- persistence National Security Applications Program Research Applications Laboratory

  22. Computational Resource Questions: How best to utilize a multi-model simulations (forecast), especially if under-dispersive? • Should more dynamical variability be searched for? Or • Is it better to balance post-processing with multi-model utilization to create a properly dispersive, informative ensemble?

  23. 3-hr dewpoint rank histograms Station DPG S01 National Security Applications Program Research Applications Laboratory

  24. RMSE of ensemble members Station DPG S01 42hr Lead-time 3hr Lead-time National Security Applications Program Research Applications Laboratory

  25. Significant calibration regressors Station DPG S01 42hr Lead-time 3hr Lead-time National Security Applications Program Research Applications Laboratory

  26. Questions revisited: How best to utilize a multi-model simulations (forecast), especially if under-dispersive? • Should more dynamical variability be searched for? Or • Is it better to balance post-processing with multi-model utilization to create a properly dispersive, informative ensemble? Warning: adding more models can lead to decreasingutilityof the ensemble mean (even if the ensemble is under-dispersive)

  27. Summary • Quantile regression provides a powerful framework for improving the whole (potentially non-gaussian) PDF of an ensemble forecast – differentregressors for different quantiles and lead-times • This framework provides an umbrella to blend together multiple statistical correction approaches (logistic regression, etc., not shown) as well as multiple regressors • As well, “step-wise cross-validation” basedcalibration provides a method to ensure forecast skill no worse than climatological and persistence for a variety of cost functions • As shown here, significant improvements made to the forecast’s ability to represent its own potential forecast error (while improving sharpness): • uniform rank histogram • significant spread-skill relationship (new skill-spread measure) • Care should be used before “throwing more models” at an “under-dispersive” forecast problem • Further questions: hopson@ucar.edu or yliu@ucar.edu

  28. Dugway Proving Ground

  29. other options … Assign dispersion bins, then: 2) Average the error values in each bin, then correlate 3) Calculate individual rank histograms for each bin, convert to a scalar measure

  30. Before Calibration => underdispersive Example: French Broad River Black curve shows observations; colors are ensemble

  31. Raw full ensemble After calibration Rank Histogram Comparisons After quantile regression, rank histogram more uniform (although now slightly over-dispersive)

  32. What Nash-Sutcliffe (RMSE) implies about Utility Frequency Used for Quantile Fitting of Method I: Best Model=76% Ensemble StDev=13% Ensemble Mean=0% Ranked Ensemble=6%

  33. Note: obs Probability Forecast PDF Take home message: For a “calibrated ensemble”, error variance of the ensemble mean is 1/2 the error variance of any ensemble member (on average), independent of the distribution being sampled Discharge

  34. What Nash-Sutcliffe (RMSE) implies about Utility (cont) -- degredation with increased ensemble size Sequentially-averaged models (ranked based on NS Score) and their resultant NS Score => Notice the degredation of NS with increasing # (with a peak at 2 models) => For an equitable multi-model, NS should rise monotonically => Maybe a smaller subset of models would have more utility? (A contradiction for an under-dispersive ensemble?)

  35. What Nash-Sutcliffe implies about Utility (cont) …using only top 1/3 of models To rank and form ensemble mean … … earlier results … Initial Frequency Used for Quantile Fitting: Best Model=76% Ensemble StDev=13% Ensemble Mean=0% Ranked Ensemble=6% Reduced Set Frequency Used for Quantile Fitting: Best Model=73% Ensemble StDev=3% Ensemble Mean=32% Ranked Ensemble=29% => Appears to be significant gains in the utility of the ensemble after “filtering” (except for drop in StDev) … however “proof is in the pudding” … => Examine verification skill measures …

  36. Skill Score Comparisonsbetween full- and “filtered” ensemble sets Points: -- quite similar results for a variety of skill scores -- both approaches give appreciable benefit over the original raw multi-model output -- however, only in the CRPSS is there improvement of the “filtered” ensemble set over the full set => post-processing method fairly robust => More work (more filtering?)! GREEN -- full calibrated multi-model BLUE -- “filtered” calibrated multi-model Reference – uncalibrated set

More Related