1 / 64

K-Nearest Neighbor Resampling Technique (Weather Generation and Water Quality Applications)

K-Nearest Neighbor Resampling Technique (Weather Generation and Water Quality Applications). Balaji Rajagopalan Somkiat Apipattanavis & Erin Towler Department of Civil, Environmental and Architectural Engineering University of Colorado Boulder, CO Denver Water February 2007.

acton
Download Presentation

K-Nearest Neighbor Resampling Technique (Weather Generation and Water Quality Applications)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. K-Nearest Neighbor Resampling Technique(Weather Generation and Water Quality Applications) Balaji Rajagopalan Somkiat Apipattanavis & Erin Towler Department of Civil, Environmental and Architectural Engineering University of Colorado Boulder, CO Denver Water February 2007

  2. “Translation” of Climate Info • Users most interested in sectoral outcomes (streamflows, crop yields, risk of disease X) Climate Forecast / Projection Forecast / Projection Translation Process Models Distribution of Outcomes

  3. Translation Historical Data Synthetic series Process model 28.5 … … … 12.4 23.1 … … … 10.2 29.1 … … … 11.4 25.8 … … … 9.7 … Frequency distribution of outcomes

  4. Why Simulation? • Limited historical data • cannot capture the full range of variability • electing a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method • Need – tool to generate ‘scenarios’ that capture the historical statistical properties • Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.) • These are cumbersome, restrictive (in their assumptions) • Re-sampling techniques are simple and robust • Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN) bootstrap offer attractive alternatives.

  5. Why Simulation? • Limited historical data • cannot capture the full range of variability • electing a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method • Need – tool to generate ‘scenarios’ that capture the historical statistical properties • Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.) • These are cumbersome, restrictive (in their assumptions) • Re-sampling techniques are simple and robust • Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN) bootstrap offer attractive alternatives.

  6. Re-sampling Techniques • Drawing cards from a well shuffled deck • Selecting a (single or a set of ) historical years from the record – with equal chance. Unconditional bootstrap, Index Sequential Method • Drawing card from a biased deck • Selecting a (single or a set of) historical years with unequal chance. E.g., selecting only El Nino years Conditional bootstrap • K-Nearest Neighbor Bootstrap – “pattern matching” • Select ‘K’ nearest neighbors (e.g., years) to the current ‘feature’ • Select one of the K neighbors at random • Repeat to produce an ensemble

  7. Examples • Ensemble Weather Generation • Scenario generation • Forecast Argentina - Pampas Region • Water Quality Modeling (Boulder Water Utility)

  8. Two Step Weather Generator • Estimate Transition (wet to dry, etc.) Probabilities of the Markov Chain order-1 from historical data – for each month • Generate Precipitation State time series using Markov Chain • Suppose we need weather simulation for January 5th - January 4th is a wet day • Get Neighbors from a 7-day window (7*50) centered on January 4th • Screen days using the Precipitation state [(1,0), days in blue] – i.e., “Potential Neighbors” • Calculate the distances between weather variables of current day feature vector and the potential neighbors • Select the K-nearest neighbors • Assign them weights Generated Precipitation State time series • Pick a day from k-NN using the weight function – say, Jan 1st 1953 • The simulated weather for Jan 5th is Jan 2nd 1953. • Repeat

  9. Single Site Simulation • Pergamino, Argentina • Daily weather variables 1931-2003 • Precipitation • Max. Temperature • Min. Temperature • 100 simulations of 73 year length (as length of record) • Statistics of simulated and historical data are compared

  10. Spell Properties Pergamino, Argentina

  11. wet and dry spell statistics

  12. Moments (wet month - Jan)

  13. Moments (dry month - July)

  14. Conditional K-NN Re-sampling • Conditioned on IRI seasonal forecast • Get the prediction (A:N:B=40:35:25) • Divide historical (seasonal) total into 3 tercile categories • Bootstrap 40, 35 and 25 sample of historical years from wet, normal and dry categories • Apply the two-step weather generator on this sample.

  15. Conditional Weather Generation (results)

  16. Multi-site extension • Same procedure as single site is used but • Calculate the Average time series – “single site virtual weather data” • Apply the two-step generator • Select the weather at all the locations on the picked day – to obtain multi-site simulation • Stations in Pampus region, Argentina • Pergamino • Junin • Nueve de Julio

  17. Multisite Case wet and dry spell Statistics Pergamino, Argentina

  18. Basic Distribution Properties

  19. Spatial Correlation

  20. Motivation Finished water must comply with a given regulation Water Treatment Plant • TOC • TSUVA • Alkalinity • pH • Turbidity • Temperature Finished Water Quality Influent Water Quality

  21. Motivation Uncertainty helps us to understand the risk of non-compliance with a given regulation WTP Comply Non-Compliance Distribution Distribution Input Output The possibilities!

  22. Data Set Information Collection Rule (ICR) • Monitoring effort mandated by USEPA • Large public water systems • Water quality and operating data • Disinfection by-products (DBPs) and microorganisms to support rulemakings • Most comprehensive view of large drinking water systems to date

  23. Data Set ICR • 18 months (Jul. 1997 – Dec. 1998) • 458 continental US locations

  24. Data Set ICR Database • Water Quality • Influent • Intermediate • Finished • Distribution system • Chemical Additions

  25. Characterize Variability Influent water quality has significant variability due to - climate - geology - water management practices Source Water • TOC • TSUVA • Alkalinity • pH • Turbidity • Temperature • Total Hardness

  26. Variability • Examine influent water quality for surface waters (SWs) • Spatial variability • Temporal variability • Focus on total organic carbon (TOC) • TOC is a precursor in formation of DBPs • Methods extend to other water quality parameters

  27. Variability Spatial Variability • Local polynomial approach • Find best K and P combination • Contour estimates

  28. Variability Spatial VariabilitySW Average Annual TOC (mg/L)

  29. Variability Spatial Variability Similar spatial patterns found for • Finished water TOC (lower) • Distribution system DBPs • TTHM (total trihalomethanes) • HAA5 (five haloacetic acids)

  30. Variability Spatial Variability Spatial patterns consistent with previous research for other influent water quality variables • Alkalinity • Bromide

  31. Variability Temporal Variability City of Boulder’s Betasso Water Treatment Plant (CO) Influent TOC (mg/L) 0 1 2 3 4 J F M A M J J A S O N D

  32. Variability Temporal Variability • Some locations exhibited seasonal trends, others did not • Month to month variations should be considered

  33. Variability • Inherent variability in water quality contributes to uncertainty • How can we quantify uncertainty?

  34. Quantify Uncertainty Simulate “ensembles” of influent water quality (Monte Carlo) Ensembles Observed data

  35. Quantify Traditional Method • Fit a probability density function (pdf) to the data • -Normal, • Lognormal, etc. • Simulate from pdf Normal Lognormal

  36. Quantify Limitations - What if the pdf is not a good fit? - What if you don’t have enough data to make the pdf? ex. 18 months/location in ICR database

  37. Quantify Space-Time Bootstrapping Method • Skip fitting a pdf to the data • Simulate by bootstrapping • Randomly sample data with replacement • Expand bootstrapping pool to include “similar” locations (nearest neighbors) • What is limited in time is available in space

  38. Quantify • Find nearest neighbors (locations) in terms of a feature vector that includes variables of interest • Feature vector includes: - Average Annual Concentration - Latitude - Longitude

  39. Quantify Average annual concentration helps finds neighbors that are similar but may not be geographically nearby. Geographically close, but not good “neighbors” for bootstrapping Average annual TOC (mg/L) for Ohio surface waters

  40. Quantify • Sample monthly TOC values based on feature vector • Conditional probability

  41. Quantify Simulation Algorithm 1) User inputs their location and their average annual TOC concentration 2) The ICR database is queried for all eligible entries

  42. Quantify Algorithm- cont. 3) Calculate distances, d,between the xuser vector and the xICR vector

  43. Quantify Algorithm- cont. 3) Calculate distances using weighted Mahalanobis equation

  44. Quantify Algorithm- cont. Remove the weights (W) and the covariance matrix (S) and it’s Euclidean Distance

  45. Quantify Algorithm- cont. By including S, covariance matrix, components of the feature vector do not have to be scaled (Davis 1986 )

  46. Quantify Algorithm- cont. Weights are assigned as

  47. Quantify Weights offer flexibility in neighbor selection (a) (b) (c) (d)

  48. Quantify Algorithm- cont. 4) Obtain observed monthly data for each nearest neighbor

  49. Quantify Algorithm- cont. 5) Bootstrap xNNusing a weight function Increases likelihood of picking nearer neighbors

  50. Quantify Apply algorithm to quantify uncertainty in influent TOC concentration City of Boulder’s Betasso Water Treatment Plant (CO) Boulder SWs only, N = 334

More Related