1 / 66

Ground based evaluation of cloud forecasts.

Ground based evaluation of cloud forecasts. Robin Hogan Ewan O’Connor, Anthony Illingworth University of Reading, UK Clouds radar collaboration meeting 17 Nov 09. Project. Aim: to retrieve and evaluate the crucial cloud variables in forecast and climate models

marnie
Download Presentation

Ground based evaluation of cloud forecasts.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ground based evaluation of cloud forecasts. Robin Hogan Ewan O’Connor, Anthony Illingworth University of Reading, UK Clouds radar collaboration meeting 17 Nov 09

  2. Project • Aim: to retrieve and evaluate the crucial cloud variables in forecast and climate models • 8+ models: global, mesoscale and high-resolution forecast models • Variables: cloud fraction, LWC, IWC, plus a number of others • Sites: 4 across Europe plus worldwide ARM sites • Period: several years to avoid unrepresentative case studies • Current status • Funded by US Department of Energy Climate Change Prediction Program to apply to ARM data worldwide • Application to FP7 Infrastructure - bid 3 Dec 09 joint with EUSAAR (trace gases) + Earlinet (lidar/aerosol) ACTRIS: Aerosol Clouds and Trace gases Research Infrastructure Network.

  3. Level 1b • Minimum instrument requirements at each site • Cloud radar, lidar, microwave radiometer, rain gauge, model or sondes • Radar • Lidar

  4. Level 1c • Instrument Synergy product • Example of target classification and data quality fields: Ice Liquid Rain Aerosol

  5. Level 2a/2b • Cloud products on (L2a) observational and (L2b) model grid • Water content and cloud fraction L2a IWC on radar/lidar grid L2b Cloud fraction on model grid

  6. Cloud fraction Chilbolton Observations Met Office Mesoscale Model ECMWF Global Model Meteo-France ARPEGE Model KNMI RACMO Model Swedish RCA model

  7. Cloud fraction in 7 models • All models except DWD underestimate mid-level cloud • Some have separate “radiatively inactive” snow (ECMWF, DWD); Met Office has combined ice and snow but still underestimates cloud fraction • Wide range of low cloud amounts in models • Not enough overcast boxes, particularly in Met Office model • Mean & PDF for 2004 for Chilbolton, Paris and Cabauw 0-7 km Illingworth et al. (BAMS 2007)

  8. Comparison of NAE (12km) and 4km model 3 months data. 2009. Ideally global and 1.5km model as well. Compare 12km with 4km and also 4km averaged 3x3 boxes. Is the performance any better at 4km? Can make more overcast skies? Any improvement on mid level cloud? What about low level clouds? What about getting the right cloud in the right place at the right time - skill scores?

  9. NAE 12km Mean fraction too low Equitable threat score: Falls with fraction threshold and age of forecast.

  10. 4km 3 x 3 Mean fraction too low Equitable threat score: Spin up 0-5 hrs

  11. 4km each box Mean fraction too low Equitable threat score: Same as 3x3.

  12. CLOUD FRACTION NAE 12km 6-11hr Can’t make overcast

  13. CLOUD FRACTION 4km (3x3) 6-11hr Can’t make overcast

  14. CLOUD FRACTION 4km (each box) 6-11hr Can’t make overcast BL clouds – worse!

  15. IWC nae

  16. IWC 4km 3X3 Improved pdf mid level cloud Still missing the higher iwc.

  17. IWC 4km

  18. Diurnal cycle composite of clouds Barrett, Hogan & O’Connor (GRL 2009) Radar and lidar provide cloud boundaries and cloud properties above site Meteo-France: Local mixing scheme: too little entrainment SMHI: Prognostic TKE scheme: no diurnal evolution All other models have a non-local mixing scheme in unstable conditions and an explicit formulation for entrainment at cloud top: better performance over the diurnal cycle

  19. Contingency tables Model cloud Model clear-sky Observed cloud Observed clear-sky For given set of observed events, only 2 degrees of freedom in all possible forecasts (e.g. a & b), because 2 quantities fixed: - Number of events that occurred n =a +b +c +d - Base rate (observed frequency of occurrence) p =(a +c)/n

  20. Desirable properties of verification measures • “Equitable”: all random forecasts receive expected score zero • Constant forecasts of occurrence or non-occurrence also score zero • Note that forecasting the right cloud climatology versus height but with no other skill should also score zero • Useful for rare events • Almost all measures are “degenerate” in that they asymptote to 0 or 1 for vanishingly rare events Extreme dependency score • Stephenson et al. (2008) explained this behavior: • Almost all scores have a meaningless limit as “base rate” p  0 • HSS tends to zero and LOR tends to infinity • They proposed the Extreme Dependency Score: • where n = a + b + c + d • It can be shown that this score tends to a meaningful limit:

  21. Symmetric extreme dependency score • EDS problems: • Easy to hedge (unless calibrated) • Not equitable • Solved by defining a symmetric version: • All the benefits of EDS, none of the drawbacks! Hogan, O’Connor and Illingworth (2009 QJRMS)

  22. Skill versus height • Most scores not reliable near the tropopause because cloud fraction tends to zero SEDS EDS LBSS • New score reveals: • Skill tends to slowly decrease at tropopause • Mid-level clouds (4-5 km) most skilfully predicted, particularly by Met Office • Boundary-layer clouds least skilfully predicted HSS LOR

  23. What is the origin of the term “ETS”? • First use of “Equitable Threat Score”: Mesinger & Black (1992) • A modification of the “Threat Score” a/(a+b+c) • They cited Gandin and Murphy’s equitability requirement that constant forecasts score zero (which ETS does) although it doesn’t satisfy requirement that non-constant random forecasts have expected score 0 • ETS now one of most widely used verification measures in meteorology • An example of rediscovery • Gilbert (1884) discussed a/(a+b+c) as a possible verification measure in the context of Finley’s (1884) tornado forecasts • Gilbert noted deficiencies of this and also proposed exactly the same formula as ETS, 108 years before! • Suggest that ETS is referred to as the Gilbert Skill Score (GSS) • Or use the Heidke Skill Score, which is unconditionally equitable and is uniquely related to ETS = HSS / (2 – HSS) Hogan, Ferro, Jolliffe and Stephenson (WAF, in press)

  24. THUS FAR DISCUSSED: CLOUDNET Clouds in the 4km v 12km NAE. Diurnal cycle of BL clouds in various models. Problems with the ETS (now GSS) – use SEDS Now DRIZZLE! BL clouds in models drizzle all the time. New observations from CloudSat/Calipso compared with FWD model from ECMWF.

  25. A TRAIN v ECMWF. -22dBZ  0.4g/m3 or 0.001mm/hr (1mm per month: 0.6 W/m2). ECMWF FWD MODEL: LWP 100 g/m2  0dBZ 160 times too much drizzle! Drizzle rate 0.03mm/hr. {20 W/m2, 300m layer cools 0.3/hr} OBSERVED Z OBSERVATIONS: Z - LWP. LWP 100 g/m2  -22dBZ LWP MODEL

  26. ECMWF rain flux parameterisation Autoconversion of cloud mixing ratio qcl to rain mixing ration qr =K qcl Threshold term: turns off autoconversion for value below qcl,crit = 0.3 g kg-1 Without threshold term: dqr /dt  q cl LWP of 1000g/m2  0.6 mm/hr LWP of 100gm2  0.06 mm/hr Add threshold assume adiabatic 0.03mm/hr (0dBZ) So why not increase qcl.crit to stop all the drizzle forming? NO! This will increase the lwp of all water clouds, make them too bright and destroy the global radiation balance. 26

  27. Evidence that the clouds in ECMWF are more adiabatic than observed? F Cloud amount >80% Observed 25% adiabatic? Modelled 50% adiabatic? MODEL AUTOCONVERSION: for LWP 100g/m2 100% adiabatic  0.03mm/hr 0dBZ 300m deep/ max LWC 0.6gm3 50% adiabatic  0.02mm/hr 450m deep/max LWC 0.45g/m3 25% adiabatic  0.01mm/hr -8dBZ 700m deep/max LWC 0.3g/m3 CSAT gate 500m. 27

  28. Lwc nae

  29. Lwc 4km 3x3

  30. Lwc 4km

  31. Raw (1 hr) resolution 1 year from Murgtal DWD COSMO model Joint PDFs of cloud fraction b a d c • 6-hr averaging …or use a simple contingency table

  32. Skill-Bias diagrams Reality (n=16, p=1/4) Forecast Under-prediction No bias Over-prediction Best possible forecast - Positive skill Random forecast Negative skill Random unbiased forecast Constant forecast of occurrence Constant forecast of non-occurrence Worst possible forecast

  33. Hedging“Issuing a forecast that differs from your true belief in order to improve your score” (e.g. Jolliffe 2008) • Hit rate H=a/(a+c) • Fraction of events correctly forecast • Easily hedged by randomly changing some forecasts of non-occurrence to occurrence H=0.5 H=0.75 H=1

  34. Equitability Defined by Gandin and Murphy (1992): • Requirement 1: An equitable verification measure awards all random forecasting systems, including those that always forecast the same value, the same expected score • Inequitable measures rank some random forecasts above skillful ones • Requirement 2: An equitable verification measure S must be expressible as the linear weighted sum of the elements of the contingency table, i.e. S = (Saa +Sbb +Scc +Sdd) / n • This can safely be discarded: it is incompatible with other desirable properties, e.g. usefulness for rare events • Gandin and Murphy reported that only the Peirce Skill Score and linear transforms of it is equitable by their requirements • PSS = Hit Rate minus False Alarm Rate = a/(a+c) – b/(b+d) • What about all the other measures reported to be equitable?

  35. Some reportedly equitable measures HSS = [x-E(x)] / [n-E(x)]; x = a+d ETS = [a-E(a)] / [a+b+c-E(a)] E(a) = (a+b)(a+c)/nis the expected value of a for an unbiased random forecasting system Simple attempts to hedge will fail for all these measures LOR = ln[ad/bc] ORSS = [ad/bc – 1] / [ad/bc + 1] Random and constant forecasts all score zero, so these measures are all equitable, right?

  36. Skill versus cloud-fraction threshold • Consider 7 models evaluated over 3 European sites in 2003-2004 HSS LOR • LOR implies skill increases for larger cloud-fraction threshold • HSS implies skill decreases significantly for larger cloud-fraction threshold

  37. Extreme dependency score • Stephenson et al. (2008) explained this behavior: • Almost all scores have a meaningless limit as “base rate” p  0 • HSS tends to zero and LOR tends to infinity • They proposed the Extreme Dependency Score: • where n = a + b + c + d • It can be shown that this score tends to a meaningful limit: • Rewrite in terms of hit rate H =a/(a +c) and base rate p =(a +c)/n : • Then assume a power-law dependence of H on p as p  0: • In the limit p  0 we find • This is useful because random forecasts have Hit rate converging to zero at the same rate as base rate: d=1 so EDS=0 • Perfect forecasts have constant Hit rate with base rate: d=0 so EDS=1

  38. Skill versus cloud-fraction threshold SEDS HSS LOR • SEDS has much flatter behaviour for all models (except for Met Office which underestimates high cloud occurrence significantly)

  39. A surprise? • Is mid-level cloud well forecast??? • Frequency of occurrence of these clouds is commonly too low (e.g. from Cloudnet: Illingworth et al. 2007) • Specification of cloud phase cited as a problem • Higher skill could be because large-scale ascent has largest amplitude here, so cloud response to large-scale dynamics most clear at mid levels • Higher skill for Met Office models (global and mesoscale) because they have the arguably most sophisticated microphysics, with separate liquid and ice water content (Wilson and Ballard 1999)? • Low skill for boundary-layer cloud is not a surprise! • Well known problem for forecasting (Martin et al. 2000) • Occurrence and height a subtle function of subsidence rate, stability, free-troposphere humidity, surface fluxes, entrainment rate...

  40. Key properties for estimating ½ life • We wish to model the score S versus forecast lead time t as: • where t1/2 is forecast “half-life” • We need linearity • Some measures “saturate” at high skill end (e.g. Yule’s Q / ORSS) • Leads to misleadingly long half-life • ...and equitability • The formula above assumes that score tends to zero for very long forecasts, which will only occur if the measure is equitable

  41. Expected values of a–d for a random forecasting system may score zero: S[E(a), E(b), E(c), E(d)] = 0 But expected score may not be zero! E[S(a,b,c,d)] = S P(a,b,c,d)S(a,b,c,d) Width of random probability distribution decreases for larger sample size n A measure is only equitable if positive and negative scores cancel Which measures are equitable? ETS & ORSS are asymmetric n = 16 n = 80

  42. Asyptotic equitability • Consider first unbiased forecasts of events that occur with probability p = ½ • Expected value of “Equitable Threat Score” by a random forecasting system decreases below 0.01 only when n > 30 • This behaviour we term asymptotic equitability • Other measures are never equitable, e.g. Critical Success Index CSI = a/(a+b+c), also known as Threat Score

  43. What about rarer events? • “Equitable Threat Score” still virtually equitable for n > 30 • ORSS, EDS and SEDS approach zero much more slowly with n • For events that occur 2% of the time (e.g. Finley’s tornado forecasts), need n > 25,000 before magnitude of expected score is less than 0.01 • But these measures are supposed to be useful for rare events!

  44. Possible solutions • Ensure n is large enough that E(a) > 10 • Inequitable scores can be scaled to make them equitable: • This opens the way to a new class of non-linear equitable measures Report confidence intervals and “p-values” (the probability of a score being achieved by chance)

  45. Properties of various measures • Truly equitable • Asymptotically equitable • Not equitable

  46. Skill versus lead time 2007 2004 • Only possible for UK Met Office 12-km model and German DWD 7-km model • Steady decrease of skill with lead time • Both models appear to improve between 2004 and 2007 • Generally, UK model best over UK, German best over Germany • An exception is Murgtal in 2007 (Met Office model wins)

  47. Forecast “half life” Met Office DWD 2007 2004 3.0 d • Fit an inverse-exponential: • S0 is the initial score and t1/2 is the half-life • Noticeably longer half-life fitted after 36 hours • Same thing found for Met Office rainfall forecast (Roberts 2008) • First timescale due to data assimilation and convective events • Second due to more predictable large-scale weather systems 2.7 days 2.6 days 3.2 d 3.1 days 2.9 days 4.0 days 2.7 days 3.1 d 2.9 days 2.4 days 4.3 days 2.9 days 4.3 days 2.7 days

  48. Why is half-life less for clouds than pressure? • Different spatial scales? Convection? • Average temporally before calculating skill scores: • Absolute score and half-life increase with number of hours averaged

  49. Geopotential height anomaly Vertical velocity • Cloud is noisier than geopotential height Z because it is separated by around two orders of differentiation: • Cloud ~ vertical wind ~ relative vorticity ~ 2streamfunction ~ 2pressure • Suggests cloud observations should be used routinely to evaluate models

  50. Satellite observations: IceSAT • Cloud observations from IceSAT 0.5-micron lidar (first data Feb 2004) • Global coverage but lidar attenuated by thick clouds: direct model comparison difficult Lidar apparent backscatter coefficient (m-1 sr-1) Latitude Optically thick liquid cloud obscures view of any clouds beneath Solution: forward-model the measurements (including attenuation) using the ECMWF variables

More Related