1 / 26

Unsupervised Pattern Recognition for the Interpretation of Ecological Data

Unsupervised Pattern Recognition for the Interpretation of Ecological Data. by Mark A. O’Connor Centre for Intelligent Environmental Systems School of Computing Staffordshire University. Outline. Background River pollution & biological monitoring Pattern recognition Self-organising maps

bandele
Download Presentation

Unsupervised Pattern Recognition for the Interpretation of Ecological Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Pattern Recognition for the Interpretation of Ecological Data by Mark A. O’Connor Centre for Intelligent Environmental Systems School of Computing Staffordshire University

  2. Outline • Background • River pollution & biological monitoring • Pattern recognition • Self-organising maps • MIR-Max • RPDS (River Pollution Diagnostic System) • Conclusion

  3. Background • Work on use of artificial intelligence (AI) techniques started in 1989 by W. J. Walley and H. A. Hawkes • Biological monitoring of river quality widely used for many years • Current techniques based on subjective score systems, e.g. BMWP, and simplistic formulae, using only a fraction of the available data • Current systems (e.g. RIVPACS) rely on ‘reference states’ – need to identify a set of ‘unpolluted’ sites

  4. RIVPACS reference sites

  5. Aims • To produce a system for both classification and diagnosis of river quality • Make full use of all the available data • Not founded on subjective human evaluations (e.g. BMWP scores) • No subjective selection of ‘reference sites’ – a holistic view of ‘clean’ and ‘dirty’ water biology

  6. River pollution – ‘biomonitoring’ • Chemical assessments alone do not fully reflect environmental quality of a river • Organisms living in the river constitute a fundamental part of the river ecosystem • ‘Benthic macroinvertebrates’ used: • Abundant • Easy to collect and identify • Sufficient range of diverse species • Confined to a particular part of the river

  7. Interpretation of data • Experts use two complementary processes when interpreting biological data • ‘Plausible reasoning’ based on scientific knowledge of the ecological system • ‘Pattern recognition’ based on experience of past cases • Data from a site are interpreted ‘holistically’, rather than using e.g. specific ‘if … then …’ rules

  8. Pattern recognition • ‘Pattern recognition’ in AI terms attempts to classify or cluster sets of objects into groups using a specified set of features e.g. optical character recognition – the ‘objects’ are letters, the ‘features’ are the % of each square that is shaded, and the output ‘groups’ correspond to ‘a’, ‘b’, ‘c’, etc

  9. PR system for river quality • For river quality, the ‘objects’ are the river sites, the ‘features’ are the abundance levels of 76 selected creatures together with information such as width, depth, discharge, composition of river bed • The ‘output groups’ correspond to varying river quality types or classes

  10. Self-organising maps (SOMs) • Output lattice or ‘map’ of ‘nodes’ represent the clusters, each node is associated with a ‘prototype’ set of features • Training is ‘unsupervised’ • New input data is classified according to which prototype it best matches • Arranged so that nearby nodes on the output map represent similar patterns

  11. River site SOM • 20x20 output maps produced using SOM • http://www.soc.staffs.ac.uk/research/groups/cies/somview/somview.htm Nodes represented by points, referenced by axes.Contours produced using Statistica maths package.Heptageniidae (mayfly), generally indicates good water quality - sensitive to pollution.

  12. Comparison of feature maps • Unionidae (Swan Mussels) only live in gently flowing rivers, thus the feature maps of river slope and the occurrence of Unionidae are seen to be inversely related.

  13. Measurement of SOM quality • 2 aspects: • How well the data is classified (e.g. are very similar examples allocated to the same node/bin/neuron?) • How well the output nodes are ordered (e.g. do nodes that are close together in output space contain examples that are similar?)

  14. Classification • Mathematical theory of information introduced by C. Shannon (1949) • ‘Mutual information’ between two variables (X and Y, say) quantifies the amount of ‘information’ about X that is gained by a knowledge of Y • A ‘good’ classification should maximise the M.I. between inputs (i.e. taxonomic and environmental data) and outputs (i.e. allocated nodes)

  15. Ordering • Also need to ensure a good ordering across the output ‘map’ (a preservation of the neighbourhood relations in the input space) • Ordering can be measured using the correlation (r) between distances in data space (given some ‘distance’ or ‘dissimilarity’ measure between input feature sets) and Euclidean distances on the output map

  16. MIR-Max • Mutual Information and Regression Maximisation • M.I. between set of n output classes C and an input feature Xj which can take any of s possible values, is given by: Where = probability of finding attribute Xj in its k-th state inclass Ci = prior probability of class Ci = prior probability of finding attribute Xj in its k-th state.

  17. MIR-Max clustering • ‘Clustering’ aim is to optimise the M.I. between the output groupings and the input variables (averaged over all of the variables) • Start from a sub-optimal clustering, randomly allocating the input samples to the output classes • Choose a sample and assess the effect of transferring from its current class (the ‘departure’ class) to another class (the ‘arrival’ class) • Make the transfer if it produces an increase in M.I. • Continue procedure until a stopping criterion is satisfied

  18. MIR-Max ordering • ‘Ordering’ aim is to optimise the representation of the output classes in a 2d output space • Start from a random ordering of the output classes in an output space made up of a number of discrete locations • Select 2 output locations and assess the effect of exchanging their contents • If this results in an increase in the correlation r between distances in data space and distances in output space, make the swap • Continue procedure until a stopping criterion is satisfied

  19. MIR-Max results • Initial testing found that MIR-Max outperformed SOM with respect to ‘clustering’ (as measured by average mutual information) • MIR-Max specifically designed to maximise this measure; results show (on average) 18% improvement over SOM • MIR-Max maps were also better ‘ordered’ overall than those produced by SOM; ‘global’ ordering was better, but ‘local’ ordering was worse

  20. RPDS • River Pollution Diagnostic System • Developed for use by the British Environment Agency • Based on a MIR-Max clustering/classification of spring and autumn samples from over 6000 sites across England and Wales • ‘New’ samples are classified by RPDS, classifications help biologists to determine possible causes of pollution at the site

  21. RPDS - feature maps

  22. RPDS – cluster reports

  23. RPDS – cluster ‘templates’

  24. RPDS – sample input

  25. RPDS - classification

  26. Conclusion • MIR-Max providesa means of organising and visualising complex high-dimensional data • Can provide a powerful tool for environmental monitoring/classification and diagnosis. • Find out more about AI and the environment from our website at: http://www.soc.staffs.ac.uk/research/groups/cies/ • mo3@staffs.ac.uk

More Related