Download Presentation

Using Bayesian networks for Water Quality Prediction in Sydney Harbour

Using Bayesian networks for Water Quality Prediction in Sydney Harbour

335 Views

Download Presentation
## Using Bayesian networks for Water Quality Prediction in Sydney Harbour

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Using Bayesian networks for Water Quality Prediction**inSydney Harbour Ann Nicholson Shannon Watson, Honours 2003 Charles Twardy, Research Fellow School of Computer Science and Software Engineering Monash University**Overview**• Representing uncertainty • Introduction to Bayesian Networks • Syntax, semantics, examples • The knowledge engineering process • Sydney Harbour Water Quality Project 2003 • Summary of other BN research**Sources of Uncertainty**• Ignorance • Inexact observations • Non-determinism • AI representations • Probability theory • Dempster-Shafer • Fuzzy logic**Probability theory for representing uncertainty**• Assigns a numerical degree of belief between 0 and 1 to facts • e.g. “it will rain today” is T/F. • P(“it will rain today”) = 0.2 prior probability (unconditional) • Posterior probability (conditional) • P(“it will rain today” | “rain is forecast”) = 0.8 • Bayes’ Rule: P(H|E) = P(E|H) x P(H) P(E)**Bayesian networks**• A Bayesian Network (BN) represents a probability distribution graphically (directed acyclic graphs) • Nodes: random variables, • R: “it is raining”, discrete values T/F • T: temperature, cts or discrete variable • C: colour, discrete values {red,blue,green} • Arcs indicate conditional dependencies between variables • P(A,S,T) can be decomposed to P(A)P(S|A)P(T|A)**Flu**Bayesian networks • Conditional Probability Distribution (CPD) • Associated with each variable • probability of each state given parent states “Jane has the flu” P(Flu=T) = 0.05 Models causal relationship “Jane has a high temp” P(Te=High|Flu=T) = 0.4 P(Te=High|Flu=F) = 0.01 Te Models possible sensor error Th “Thermometer temp reading” P(Th=High|Te=H) = 0.95 P(Th=High|Te=L) = 0.1**Flu**TB Flu Te Te Th Mixed inference Intercausal inference BN inference • Evidence: observation of specific state • Task: compute the posterior probabilities for query node(s) given evidence. Flu Flu Flu Y Te Te Th Th Diagnostic inference Predictive inference**BN software**• Commerical packages: Netica, Hugin, Analytica (all with demo versions) • Free software: Smile, Genie, JavaBayes, • See appendix B, Korb & Nicholson, 2004 • Example running Netica software**Decision networks**• Extension to basic BN for decision making • Decision nodes • Utility nodes • EU(Action) = p(o|Action,E) U(o) o • choose action with highest expect utility • Example**Elicitation from experts**• Variables • important variables? values/states? • Structure • causal relationships? • dependencies/independencies? • Parameters (probabilities) • quantify relationships and interactions? • Preferences (utilities)**BN**EXPERT Domain EXPERT BN TOOLS Expert Elicitation Process • These stages are done iteratively • Stops when further expert input is no longer cost effective • Process is difficult and time consuming. • Current BN tools • inference engine • GUI • Next generation of BN tools?**Knowledge discovery**• There is much interest in automated methods for learning BNS from data • parameters, structure (causal discovery) • Computationally complex problem, so current methods have practical limitations • e.g. limit number of states, require variable ordering constraints, do not specify all arc directions • Evaluation methods**Knowledge Engineering for Bayesian Networks (KEBN)**1. Building the BN • variables, structure, parameters, preferences • combination of expert elicitation and knowledge discovery 2. Validation/Evaluation • case-based, sensitivity analysis, accuracy testing 3. Field Testing • alpha/beta testing, acceptance testing 4. Industrial Use • collection of statistics 5. Refinement • Updating procedures, regression testing**Water Quality for Sydney Harbour**• Water Quality for recreational use • Beachwatch / Harbourwatch Programs • Bacteria samples used as pollution indicators • Many variables influencing Bacterial levels – rainfall, tide, wind, sunlight temperature, phetc**Past studies**• Hose et al. used multi dimension scaling model of Sydney harbour • low predictive accuracy, unable to handle the noisy bacteria samples, explained 63% of bacteria variablity (Port Jackson) • Ashbolt and Bruno: • agree with Hose et al, + wind effects, sunlight hours, tide • Crowther et al (UK): • rainfall, tide, sampling times, sunshine, wind • Explained 53% of bacteria variablility • Other models developed by the USEPA to model estuaries are: • QUAL2E – Steady-state receiving water model • WASP – Time Varying dispersion model • EFDC – 3D hydrodynamic model • EPA in Sydney interested in a model applying the causal knowledge of the domain**Stages of Project**• Preparation of EPA Data rainfall only • Hand-craft simple networks for rainfall data • Comparison of hand-crafted networks with range of learners (using Weka software) • Using CaMML to learn BN on extended data set 2003 Hons proj 2003/04 Summer Vac proj**EPA Data**• Database 1: • E.coli, Enterococci (cfu/100mL), thresholds 150 & 35. • 60 water samples each year since 1994 at 27 sites in Sydney Harbour. • Enterococci E.coli, Raining, Sunny, Drain running, temperature, time of sample, direction of sampling run, date, site name, beach code • Database 2: • Rainfall readings (mm) at 40 locations around Sydney**Data Preparation**New file format: Date BeachCode Entc Ecoli D1 D2 D3 D4 D5 D6 D1 = rainfall on day of collection D6 = rainfall 5 days previously • Rainfall data had many missing entries**Rainfall BNs**• Hand-crafted BNs to predict bacteria using rainfall only • Started with deterministic BN that implemented EPA guidelines • Looked at varying number of previous days rainfall for predicting bacteria • Investigated various discretisations of variables**Evaluation**• Split data 50-50 training/testing • 10 fold cross validation • Measures: Predictive Accuracy & Information Reward • Also looked at ROC curves (correct classification vs false positives) • Using Weka: Java environment for machine learning tools and techniques • Small data: 4 beaches: Chinamans, Edwards, Balmoral (all middle harbour), Clifton (Port Jackson) • Using 6 days rainfall averaged from all rain gauges**Predictive accuracy**• Examining each joint observation in the sample • Adding any available evidence for the other nodes • Updating the network • Use value with highest probability as predicted value • Compare predicted value with the actual value**Information Reward**• Rewards calibration of probabilities • Zero reward for just reporting priors • Unbounded below for a bad prediction • Bounded above by a maximum that depends on priors Reward = 0 Repeat If I == correct state IR += log ( 1 / p[i] ) else IR += log ( 1 / 1 - p[i] )**Pr=1/3**Pr=1/3 Pr=1/3 Evaluation: Weka learners • Naïve Bayes • J48 (version of C4.5) • CaMML –Causal BN learner, using MML metric • AODE • TAN • Logistic • “Davidson” BN – 6 days previous rainfall • With and without adaptation of parameters (case learning) • “Guidelines” BN – 3 days previous rainfall • Deterministic rule • With adaptation of parameters (case learning)**Results: ROC Curves**• For ~20% false-positive, can get ~60% of events • For ~45% false-positive, can get ~75% of events • For ~60% false-positive, can get ~80% of events • Implications? • Using current guidelines, if accept 45% false-positive, getting 60% hit rate • Can either keep that false-positive rate, get extra 15% • Or, keep same hit rate at half the false positive rate**Early BN-related projects**• DBNS for discrete monitoring (PhD, 1992) • Approximate BN inference algorithms based on a mutual information measure for relevance (with Nathalie Jitnah, 1996-1999) • Plan recognition: DBNs for predicting users actions and goals in an adventure game (with David Albrecht,Ingrid Zukerman,1997-2000) • DBNs for ambulation monitoring and fall diagnosis (with biomedical engineering, 1996-2000) • Bayesian Poker (with Kevin Korb,1996-2003)**Knowledge Engineering with BNs**• Seabreeze prediction: joint project with Bureau of Meteorology • Comparison of existing simple rule, expert elicited BN, and BNs from Tetrad-II and CaMML • ITS for decimal misconceptions • Methodology and tools to support knowledge engineering process • Matilda: visualisation of d-separation • Support for sensitivity analysis • Written a textbook: • Bayesian Artificial Intelligence, Kevin B. Korb and Ann E. Nicholson, Chapman & Hall / CRC, 2004. www.csse.monash.edu.au/bai/book**Current BN-related projects**• BNs for Epidemiology (with Kevin Korb, Charles Twardy) • ARC Discovery Grant, 2004 • Looking at Coronary Heart Disease data sets • Learning hybrid networks: cts and discrete variables. • BNs for supporting meteorological forecasting process (DSS’2004) (with Ph. D student Tal Boneh, K. Korb, BoM) • Building domain ontology (in Protege) from expert elicitation • Automatically generating BN fragments • Case studies: Fog, hailstorms, rainfall. • Ecological risk assessment • Goulburn Water, native fish abundance • Sydney Harbour Water Quality**Open Research Questions**• Methodology for combining expert elicitation and automated methods • expert knowledge used to guide search • automated methods provide alternatives to be presented to experts • Evaluation measures and methods • may be domain dependent • Improved tools to support elicitation • Reduce reliance on BN expert • e.g. visualisation of d-separation • Industry adoption of BN technology