Using Bayesian networks for Water Quality Prediction in Sydney Harbour

Using Bayesian networks for Water Quality Prediction inSydney Harbour Ann Nicholson Shannon Watson, Honours 2003 Charles Twardy, Research Fellow School of Computer Science and Software Engineering Monash University

Overview • Representing uncertainty • Introduction to Bayesian Networks • Syntax, semantics, examples • The knowledge engineering process • Sydney Harbour Water Quality Project 2003 • Summary of other BN research

Sources of Uncertainty • Ignorance • Inexact observations • Non-determinism • AI representations • Probability theory • Dempster-Shafer • Fuzzy logic

Probability theory for representing uncertainty • Assigns a numerical degree of belief between 0 and 1 to facts • e.g. “it will rain today” is T/F. • P(“it will rain today”) = 0.2 prior probability (unconditional) • Posterior probability (conditional) • P(“it will rain today” | “rain is forecast”) = 0.8 • Bayes’ Rule: P(H|E) = P(E|H) x P(H) P(E)

Bayesian networks • A Bayesian Network (BN) represents a probability distribution graphically (directed acyclic graphs) • Nodes: random variables, • R: “it is raining”, discrete values T/F • T: temperature, cts or discrete variable • C: colour, discrete values {red,blue,green} • Arcs indicate conditional dependencies between variables • P(A,S,T) can be decomposed to P(A)P(S|A)P(T|A)

Flu Bayesian networks • Conditional Probability Distribution (CPD) • Associated with each variable • probability of each state given parent states “Jane has the flu” P(Flu=T) = 0.05 Models causal relationship “Jane has a high temp” P(Te=High|Flu=T) = 0.4 P(Te=High|Flu=F) = 0.01 Te Models possible sensor error Th “Thermometer temp reading” P(Th=High|Te=H) = 0.95 P(Th=High|Te=L) = 0.1

Flu TB Flu Te Te Th Mixed inference Intercausal inference BN inference • Evidence: observation of specific state • Task: compute the posterior probabilities for query node(s) given evidence. Flu Flu Flu Y Te Te Th Th Diagnostic inference Predictive inference

BN software • Commerical packages: Netica, Hugin, Analytica (all with demo versions) • Free software: Smile, Genie, JavaBayes, • See appendix B, Korb & Nicholson, 2004 • Example running Netica software

Decision networks • Extension to basic BN for decision making • Decision nodes • Utility nodes • EU(Action) =  p(o|Action,E) U(o) o • choose action with highest expect utility • Example

Elicitation from experts • Variables • important variables? values/states? • Structure • causal relationships? • dependencies/independencies? • Parameters (probabilities) • quantify relationships and interactions? • Preferences (utilities)

BN EXPERT Domain EXPERT BN TOOLS Expert Elicitation Process • These stages are done iteratively • Stops when further expert input is no longer cost effective • Process is difficult and time consuming. • Current BN tools • inference engine • GUI • Next generation of BN tools?

Knowledge discovery • There is much interest in automated methods for learning BNS from data • parameters, structure (causal discovery) • Computationally complex problem, so current methods have practical limitations • e.g. limit number of states, require variable ordering constraints, do not specify all arc directions • Evaluation methods

Knowledge Engineering for Bayesian Networks (KEBN) 1. Building the BN • variables, structure, parameters, preferences • combination of expert elicitation and knowledge discovery 2. Validation/Evaluation • case-based, sensitivity analysis, accuracy testing 3. Field Testing • alpha/beta testing, acceptance testing 4. Industrial Use • collection of statistics 5. Refinement • Updating procedures, regression testing

The KEBN process

Quantitative KE process

Water Quality for Sydney Harbour • Water Quality for recreational use • Beachwatch / Harbourwatch Programs • Bacteria samples used as pollution indicators • Many variables influencing Bacterial levels – rainfall, tide, wind, sunlight temperature, phetc

Past studies • Hose et al. used multi dimension scaling model of Sydney harbour • low predictive accuracy, unable to handle the noisy bacteria samples, explained 63% of bacteria variablity (Port Jackson) • Ashbolt and Bruno: • agree with Hose et al, + wind effects, sunlight hours, tide • Crowther et al (UK): • rainfall, tide, sampling times, sunshine, wind • Explained 53% of bacteria variablility • Other models developed by the USEPA to model estuaries are: • QUAL2E – Steady-state receiving water model • WASP – Time Varying dispersion model • EFDC – 3D hydrodynamic model • EPA in Sydney interested in a model applying the causal knowledge of the domain

EPA Guidelines

Stages of Project • Preparation of EPA Data rainfall only • Hand-craft simple networks for rainfall data • Comparison of hand-crafted networks with range of learners (using Weka software) • Using CaMML to learn BN on extended data set 2003 Hons proj 2003/04 Summer Vac proj

EPA Data • Database 1: • E.coli, Enterococci (cfu/100mL), thresholds 150 & 35. • 60 water samples each year since 1994 at 27 sites in Sydney Harbour. • Enterococci E.coli, Raining, Sunny, Drain running, temperature, time of sample, direction of sampling run, date, site name, beach code • Database 2: • Rainfall readings (mm) at 40 locations around Sydney

Data Preparation New file format: Date BeachCode Entc Ecoli D1 D2 D3 D4 D5 D6 D1 = rainfall on day of collection D6 = rainfall 5 days previously • Rainfall data had many missing entries

Rainfall BNs • Hand-crafted BNs to predict bacteria using rainfall only • Started with deterministic BN that implemented EPA guidelines • Looked at varying number of previous days rainfall for predicting bacteria • Investigated various discretisations of variables

EPA Guidelines as BN

Davidson BN: 1 day rainfall

Davidson BN: 6 days rainfall

Evaluation • Split data 50-50 training/testing • 10 fold cross validation • Measures: Predictive Accuracy & Information Reward • Also looked at ROC curves (correct classification vs false positives) • Using Weka: Java environment for machine learning tools and techniques • Small data: 4 beaches: Chinamans, Edwards, Balmoral (all middle harbour), Clifton (Port Jackson) • Using 6 days rainfall averaged from all rain gauges

Predictive accuracy • Examining each joint observation in the sample • Adding any available evidence for the other nodes • Updating the network • Use value with highest probability as predicted value • Compare predicted value with the actual value

Information Reward • Rewards calibration of probabilities • Zero reward for just reporting priors • Unbounded below for a bad prediction • Bounded above by a maximum that depends on priors Reward = 0 Repeat If I == correct state IR += log ( 1 / p[i] ) else IR += log ( 1 / 1 - p[i] )

Pr=1/3 Pr=1/3 Pr=1/3 Evaluation: Weka learners • Naïve Bayes • J48 (version of C4.5) • CaMML –Causal BN learner, using MML metric • AODE • TAN • Logistic • “Davidson” BN – 6 days previous rainfall • With and without adaptation of parameters (case learning) • “Guidelines” BN – 3 days previous rainfall • Deterministic rule • With adaptation of parameters (case learning)

Results

Results: ROC Curves

Results: area under ROC Curves

Results: ROC Curves • For ~20% false-positive, can get ~60% of events • For ~45% false-positive, can get ~75% of events • For ~60% false-positive, can get ~80% of events • Implications? • Using current guidelines, if accept 45% false-positive, getting 60% hit rate • Can either keep that false-positive rate, get extra 15% • Or, keep same hit rate at half the false positive rate

Example of CaMML BN

Future Directions?

Early BN-related projects • DBNS for discrete monitoring (PhD, 1992) • Approximate BN inference algorithms based on a mutual information measure for relevance (with Nathalie Jitnah, 1996-1999) • Plan recognition: DBNs for predicting users actions and goals in an adventure game (with David Albrecht,Ingrid Zukerman,1997-2000) • DBNs for ambulation monitoring and fall diagnosis (with biomedical engineering, 1996-2000) • Bayesian Poker (with Kevin Korb,1996-2003)

Knowledge Engineering with BNs • Seabreeze prediction: joint project with Bureau of Meteorology • Comparison of existing simple rule, expert elicited BN, and BNs from Tetrad-II and CaMML • ITS for decimal misconceptions • Methodology and tools to support knowledge engineering process • Matilda: visualisation of d-separation • Support for sensitivity analysis • Written a textbook: • Bayesian Artificial Intelligence, Kevin B. Korb and Ann E. Nicholson, Chapman & Hall / CRC, 2004. www.csse.monash.edu.au/bai/book

Current BN-related projects • BNs for Epidemiology (with Kevin Korb, Charles Twardy) • ARC Discovery Grant, 2004 • Looking at Coronary Heart Disease data sets • Learning hybrid networks: cts and discrete variables. • BNs for supporting meteorological forecasting process (DSS’2004) (with Ph. D student Tal Boneh, K. Korb, BoM) • Building domain ontology (in Protege) from expert elicitation • Automatically generating BN fragments • Case studies: Fog, hailstorms, rainfall. • Ecological risk assessment • Goulburn Water, native fish abundance • Sydney Harbour Water Quality

Open Research Questions • Methodology for combining expert elicitation and automated methods • expert knowledge used to guide search • automated methods provide alternatives to be presented to experts • Evaluation measures and methods • may be domain dependent • Improved tools to support elicitation • Reduce reliance on BN expert • e.g. visualisation of d-separation • Industry adoption of BN technology

Using Bayesian networks for Water Quality Prediction in Sydney Harbour