The identification of exceptional values in the ESPON database

Paul Harris Martin Charlton National Centre for Geocomputation NUIM Maynooth Ireland Madrid seminar - 10/6/10 The identification of exceptional values in the ESPON database

ESPON DB data Identifying exceptional values Case study 1 (detecting logical input errors) Case study 2 (detecting statistical outliers) Next things to do.. Outline

Socio-economic, land cover,… • Continuous, categorical, nominal, ordinal,…. • Spatial support: • Area units – NUTS 0/1/2/23/3 • (whose boundaries may also change over time) • Temporal support: • Commonly, yearly units (with only a short time series) 1. ESPON DB data

Define two types: • Logical input errors • (e.g. a negative unemployment rate) • Statistical outliers • (e.g. an unusually high unemployment rate) • Two-stage identification algorithm: • Stage 1: identify input errors via mechanical techniques • Stage 2: identify outliers via statistical techniques 2. Identifying exceptional values

Stage 1: Identify logical Input Errors

Usually detected using some logical, mathematical approach • Statistical detection may also help… • Typical input errors: • Impossible values (e.g. negatives, fractions…) • Repeated data for different variables • Data displaced between or within columns • Data swapped between or within columns • Wrong NUTS code or name • Wrong NUTS regions used (e.g. for 1999 instead of 2006) • Missing value code (e.g. 9999 treated as a true value) • Etc. Logical input errors…

Detect input errors mathematically (& statistically) • Flag observations if they are likely input errors • If possible - correct them • More likely - consult an expert on the data • Once happy - go to stage 2 - assume data is error-free Our approach…

Stage 2: Identify statistical outliers

Types of outliers….

There is no single ‘best’ outlier detection technique, so… • Apply a representative selection of outlier detection techniques (which are simple & robust) • Flag an observation if it is a likely outlier according to each technique • Build up a weight of evidence for the likelihood of a given observation being statistically outlying • Suggest what type of outlier it is likely to be • - aspatial, spatial, temporal, relationship, some mixture… • Consult an expert on the data to decide on the appropriate course of action • Here’s an example using nine techniques & three observations… Our approach…

3. Case study 1 (detecting logical input errors) Data • Data at NUTS3 level (1351 observations/regions) • Variables: • GDP evolution (2000 to 2005) (%age) • Calculated using 4 other variables: • 205 logical input errors deliberately introduced to: • NUTS codes & the 4 variables used to calculate GDP evolution only • ~ 15% of data infected

Performance results False negatives - 13.2% (e.g. in Italy) False positives - 2.0% (e.g. in Spain) Overall misclassification rate - 3.7%

Consequences if we had ignored input errors….

4. Case study 2 (detecting statistical outliers) Data • Data at NUTS23 level for eight years: 2000-2007 • For each year - ‘unemployment rate’ calculated [Unemployment population)/(Active population)] • 8 variables at each of 790 regions = 6320 obs. • Data checked for input errors - i.e. stage 1 done

Presentation of results… • For brevity… • Lets say - we only need at least one of 8 time-specific unemployment values in a region to be outlying… • (But we can identify outliers by year too)

Results: 1 boxplot statistics(aspatial & univariate)

Results: 2 Hawkins’ test(spatial & univariate)

Results: 3 time series statistics(temporal & univariate)

Results: 4 MLR residuals(aspatial linear relationships)

Results: 5 LWR residuals(aspatial nonlinear relationships)

Results: 6 GWR residuals(spatial nonlinear relationships)

Results: 7 PCA residuals(aspatial linear relationships & model-free)

Results: 8 LWPCA residuals(aspatial nonlinear relationships & model-free)

Results: 9 GWPCA residuals(spatial nonlinear relationships & model-free)

Summary of results: weight of evidence

Preliminary performance results • Infected ~ 5% of the data with ‘outliers’ & repeated the analysis on this ‘infected’ data… • False negatives: 10.3% • False positives: 34.3% • Overall misclassification rate: 26.1% • Problems: • Difficult to guarantee that our infections actually produce outliers… • The data already contains outliers (as shown)

1. Other ways of performance testing our approach • Simulated data with known properties? • Statistical theory (or properties)? • 2. Refining each of our nine chosen techniques • Robust extensions 5. Next things to do…

Thank You!

The identification of exceptional values in the ESPON database

The identification of exceptional values in the ESPON database

Presentation Transcript

The ESPON GRETA project

The Clash of Values

Measuring values in the European Values Study

The Clash of Values

The Input Process of the ESPON M4D Database

The Role of Values in Ethics: The Value of Values

Identification of the Network

The ESPON Scientific Platform

The Values

Workshop 1.4: ESPON Database

Measuring values in the European Values Study

ESPON project 1.2.3 Identification of Spatially Relevant Aspects of the Information Society

Launch of the ESPON 2013 Programme

The Input Process of the ESPON M4D Database

The Role of Values in Ethics: The Value of Values

ESPON Scientific Platform, ESPON Database, HyperAtlas and Web-GIS