Using Advanced Data Mining and Integration in Environmental Risk Management

Using Advanced Data Mining and Integration in Environmental Risk Management LadislavHluchy OndrejHabala, Martin Šeleng, Peter Krammer, Viet Tran Institute of Informatics Slovak Academy of Sciences

Contents • EU FP7 project ADMIRE – overview • Architecture of DMI solution in ADMIRE • New DMI process language – DISPEL • Pilot application scenarios – ORAVA, RADAR • goals, architecture, experimental results • Tools in ADMIRE SAMI 2011, Smolenice, Slovakia, January 2011

ADMIRE - Advanced Data Mining and Integration Research for Europe • 7th Framework Program • ICT, Call 1.2.A • Commenced in February 2008 over 36 months. • €4.3 million in costs, and €3 million in EC funding SAMI 2011, Smolenice, Slovakia, January 2011

Collaborators • University of Edinburgh, UK (Coordinator) • NeSc - National e-Science Centre • EPCC - Edinburgh Parallel Computing Centre • Fujitsu Labs of Europe, UK • University of Vienna, Austria • Institute of Scientific Computing • Universidad Politécnica de Madrid, Spain • Facultad de Informatica • Slovak Academy of Sciences, Slovakia • Institute of Informatics • ComArch S.A., Poland SAMI 2011, Smolenice, Slovakia, January 2011

ADMIRE Goals Accelerate access to and increase the benefits from data exploitation; Deliver consistent and easy to use technology for extracting information and knowledge; Cope with complexity, distribution, change and heterogeneity of services, data, and processes, through abstract view of data mining and integration; and Provide power to users and developers of data mining and integration processes. SAMI 2011, Smolenice, Slovakia, January 2011

ADMIRE Structure • WP1: High-Level Model and Language Research • Incremental development of models and languages with a goal of describing Data Mining and Integration (DMI) processes abstractly • WP2: Architecture Research • Incremental development of a flexible, scalable and open DMI architecture • WP3: Platform Support & Delivery • Deliver robust service platforms, support users and encapsulate knowledge in a book • WP4: Service Infrastructure Development and Enhancement • Develop technology and services to enhance the DMI service infrastructure based on Fujitsu’s USMT SAMI 2011, Smolenice, Slovakia, January 2011

ADMIRE Structure • WP5: Data Mining and Integration Tools Development • Develop and integrate tools that make the technology easier to use and reduce the frequency of failures • WP6: Integrated Applications • Demonstration of validation and performance of architecture, language, platform and tools as an integrated environment for Data Mining and Integration • WP7: Project Management • Management and coordination of the project SAMI 2011, Smolenice, Slovakia, January 2011

ADMIRE Architecture: Separation of Concerns SAMI 2011, Smolenice, Slovakia, January 2011

ADMIRE Architecture SAMI 2011, Smolenice, Slovakia, January 2011

DISPEL – Data Intensive Systems Process-Engineering Language • Data-intensive distributed systems • Connection point of complex application requests and complex enactment systems • Benefit: method development, engineering and evolution of supported practices can take place independently in each world • Describes enactment requests for streaming-data workflows processes • “Process-engineering time” – transform and optimize process in preparation for enactment period SAMI 2011, Smolenice, Slovakia, January 2011

DISPEL: Simple Example Creating streams of literals String sql1 = "SELECT * FROM some_table"; String sql2 = “SELECT * FROM table2”; String resource = "128.18.128.255"; SQLQuery query = new SQLQuery; |- sql1, sql2 -| => query.expression; |- resource -| => query.resource; Tee tee = new Tee; query.result => tee.connectInput; Creating connections SAMI 2011, Smolenice, Slovakia, January 2011

DISPEL – real use SAMI 2011, Smolenice, Slovakia, January 2011

ADMIRE’s High-Level Architecture SAMI 2011, Smolenice, Slovakia, January 2011

ADMIRE Gateways USMT SAMI 2011, Smolenice, Slovakia, January 2011

Security • Framework built on top of formal Grid Infrastructure, available security mechanisms include: • Transport level security: SSL, HTTPs, (currently available) • Message level security: Web Services Security: SOAP Message Security • X509 certificate authentification • Multiple stakeholder authorization • Explicit Trust Delegation (ETD) SAMI 2011, Smolenice, Slovakia, January 2011

Pilot Applications • Admire has 2 pilot applications • CRM • FloodApp • FloodApp • Orava • Radar • SVP SAMI 2011, Smolenice, Slovakia, January 2011

ACRM Application • Large-scale, distributed Churn scenario • 4 database parts, distributed among ADMIRE partners • Graphical UI for business analysts • Using ADMIRE workbench, DISPEL and framework to create predictions of customer churn • Mining over distributed data SAMI 2011, Smolenice, Slovakia, January 2011

Flood ApplicationData sets used in hydrological scenarios SAMI 2011, Smolenice, Slovakia, January 2011 FSKD 2010 Yantai, China, August 10-12 19

Scenarios deployment in testbed Two scenarios (ORAVA, RADAR) completely deployed in testbed Other scenario’s data are partially deployed 5 nodes (1 real + 4 virtual nodes) Databases (MySQL + PostgreSQL), GRIB files in file storage USMT (Unified System Management Technology - Jetty container), OGSA-DAI (Apache Tomcat) SAMI 2011, Smolenice, Slovakia, January 2011

Orava scenario • Legend • Green area – Orava (part of north Slovakia) • Blue – Orava reservoir and local rivers • Red dots– hydrological measurement stations • Notes • We are interested only on hydrological stations below the Orava reservoir • In our tests we will use the hydrological station 5830 (Tvrdosin) SAMI 2011, Smolenice, Slovakia, January 2011

ORAVA – data mining concept • Targets – water level and temperature at a station below the reservoir Targets of data mining Given in a schedule Predicted by a meteo model Predictors – rainfall amount (reservoir and station), air temperature (reservoir and station), reservoir discharge, reservoir temperature SAMI 2011, Smolenice, Slovakia, January 2011

ORAVA – data integration • Integration of data from • GRIB files • Reservoirs • Inputs • Time period of experiment • Reservoir ID • List of hydro stations • Geo coordinates SAMI 2011, Smolenice, Slovakia, January 2011

ORAVA – data sets SAMI 2011, Smolenice, Slovakia, January 2011

ORAVA – integrated and preprocessed data Time ReplaceMissingValues Filter LinearTrend Filter ZeroEpsilon Filter Kelvin2Celsius Filter Integrated preprocessed data Time Integrated raw data SAMI 2011, Smolenice, Slovakia, January 2011

ORAVA – data mining • Input - Integrated data • Data Mining Phases: • Data understanding • Data visualization • Data quality exploration • Data preparation • Missing values substitution (ReplaceMissingValues filter) • Noise reduction (ZeroEpsilon filter) • Switching from one scale to another (Kelvin2Celsius filter) • Data modifying (LinearTrend filter) • Model training • Training on historical data (8760 records) • Linear Regression model • Neural networks - multilayer perceptron without hidden layers • Model Evaluation • Testing of the trained model • N-fold cross validation • Using training sets • Output - Prediction model SAMI 2011, Smolenice, Slovakia, January 2011

Orava – data mining resultsprediction of temperature Linear Regression model equation: SAMI 2011, Smolenice, Slovakia, January 2011

Orava – temperature prediction model comparison SAMI 2011, Smolenice, Slovakia, January 2011

Orava – prediction of water level • Neural network model – multilayer perceptron • Input parameters (6) • Rainfall ([S+1]), Water-Level ([X]) • Outflows ([D], [D+1] – [D], ln([D]), sqrt([D])) • Output • Difference of water level ([X+1] – [X]) SAMI 2011, Smolenice, Slovakia, January 2011

Orava – water level prediction Data count : 8735 records Activation function of the feed-forward neural network: sigmoid Correlation coefficient: 0.9816 Mean absolute error :0.4105 Root mean squared err.:0.9673 Relative absolute error : 30.5869 % (from difference) Root relative squared error 19.2384 % (from difference) SAMI 2011, Smolenice, Slovakia, January 2011

RADAR Targets of data mining • Very short-term rainfall prediction from weather radar data • Movement of areas with higher air moisture content, and thus also higher precipitation potential • Mining of matrices of data SAMI 2011, Smolenice, Slovakia, January 2011 31

Meteorologic data • Networkofsynopticstations in Slovakia • 27 stations in Slovakia • Useddatafromyear 2007, 2008 • Rainfall, humidity, atmospheric pressure and temperature valuesfor eachhour SAMI 2011, Smolenice, Slovakia, January 2011

RADAR isotonic model • Actual model for rainfall prediction • Isotonic reggresion model structure • Training on historical data • Correlation coefficient 0.4593 • Mean absolute error 0.1105 • Root mean squared error 0.5490 • Total Number of Instances 89700 • Validation 10 Cross Fold SAMI 2011, Smolenice, Slovakia, January 2011

Table of isotonic model SAMI 2011, Smolenice, Slovakia, January 2011

Hydrometeorological performance Probability of detection with threshold 0,3 and 0,6 mm rainfall per hour: • POD0,3 = 63,87 % • POD0,6= 56,22 % Miss rate with threshold 0,3 and 0,6 mm rainfall per hour: • MR0,3 = 1,85 % • MR0,6 = 1,58 % SAMI 2011, Smolenice, Slovakia, January 2011

RADAR model • Other tested models • Neural networks, SMOreg, linear regression, ... • Reached correlation coeficient between 0,35 and 0,42 • Validation - 10 Cross Fold Problems in model creation : • process is significantly stochastic • Some input variables are backwards dependenton output • Meteorological process is very sensitive • Reflection matrix represents quantity of water in atmosphere, not exact rainfall rate in specified area, as opposed to data from synoptic stations SAMI 2011, Smolenice, Slovakia, January 2011

ADMIRE Tools Registry client GUI Process designer SKSA Gateway Process Manager DMI Model Visualizer SAMI 2011, Smolenice, Slovakia, January 2011

Registry client GUI Read-only access to ADMIRE Registry list PEs and view their properties search, sort PEs Write access to Registry is done via DISPEL documents SAMI 2011, Smolenice, Slovakia, January 2011

Process Designer Manage your DMI project (files, directories – project structure) Select elements from the Registry View the canonical (DISPEL) representation of your DMI process in real time View the properties of your chosen elements Edit your DMI process graphically SAMI 2011, Smolenice, Slovakia, January 2011

Semantic Knowledge Sharing Assistant Context the user works in Several reservoirs, one settlement Knowledge that may be useful in this context previously entered by other users Provides access to existing user’s knowledge, sorting and selecting it automatically according to the user’s current working context SAMI 2011, Smolenice, Slovakia, January 2011

Gateway Process Manager Keep track of running processes stop/pause/cancel the process view the process’ source DISPEL access process’ results (if available) in several ways – raw or visualized SAMI 2011, Smolenice, Slovakia, January 2011

DMI Model Visualizer Visualization of data mining models Read Weka classifier object produce PMML (Predictive Model MarkupLanguage) description of the model Show the PMML as a graphical tree SAMI 2011, Smolenice, Slovakia, January 2011

Admire Project Thank you for attention. SAMI 2011, Smolenice, Slovakia, January 2011

Using Advanced Data Mining and Integration in Environmental Risk Management