Not our data, but we use it in research

Not our data, but we use it in research Wietse Dol, LEI-WUR (W.Dol@wur.nl) 6 October 2014

Wietse Dol • PhD Econometrics • 10 years University of Groningen (Econometrics, sampling theory) 21 years LEI (many different departments) • Data and models, i.e. use/reuse and quality, trouble shooter + statistical methods + ICT + user interfacing • Not and IT specialist but a researcher (I build software because I use it myself) • Many model projects and user interfaces for models (not only LEI) • Since 2006: data, data quality≡ MetaBase

LEI: Agricultural Economic Research Institute • Part of Wageningen University & Researchcenter (WUR) • Part of the Social Science Group within the WUR • We are the research part of WUR/SSG (advice ministry of Economic Affairs) in The Hague • Consultancy (applied research): ministries, EU, local government, industry,… • Collecting data (Farm data: FADN), building models and agricultural content specialists

University vs. Research center • University: teaching, publications, new theory and technology • Research center: • applied work/consultancy • reusing things from the past (e.g. yearly publications) • sharing knowledge (how to become a content specialist)/teaching for small groups • working in groups (different disciplines) • Working in (inter)national groups with many different disciplines Research centers have experience in data management.

Primary vs. Secondary research data Research data: collected, observed, or created, for the purpose of analysis to produce and validate original research results. • Primary data: you collect, targeted to answer/validate your questions. • Secondary data: not yours, e.g. from website. • More and more need of secondary data (primary is expensive and takes a lot of time to collect). • Quality of data • Meta-information and Versioning is crucial

Production data Meta-information: Source, Version, Dimension, Definitions etc. without proper information you use the wrong data • is FR with or without DOM? • Is the production in tons or in Euros. • Does the year start 1-1 and ends 31-12? • What’s the definition of Tomato • Owner of the data/Version of the data/conditions usage…

Lifecycle Model of data http://www.dcc.ac.uk/resources/curation-lifecycle-model

Data Everybody Not often Seldom • Use data • How to get the data, filter it and store it • Inspection and Quality checks on the data • How to make it available for others • What scientific actions are done on the data • Curate, preserve, versions, … Lifecycle Model Don’t do it alone, do it as a GROUP and communicate

Types of databases according MetaBase • Statistical database • Scientific database • Meta-database

Statistical database: secondary data Databases provided by international organizations like EU, FAO, OECD, World bank are in general statistical databases: • Good web interfaces for downloading data • Data are stored as they are received • Data are consistent in their own domain • No aggregations are made when underlying data are missing • Not much attention for data checking • No versioning system (data changes

Scientific versus Statistical database • Problems with statistical database: • Different definitions of territories and commodities • Typing errors • Missing data • Break in series • Scientific database: • Problems solved • Transparency (original data sources and underlying assumptions are kept) • Versioning of the data • Essential for modeling and research

Structural design of a scientific database • Key words for structural design HarDFACTS project IPTS 2007 done by vTI/LEI • Transparent • Harmonised • Complete • Consistent Harmonised Database for Agricultural Commodity Time Series => The amount of effort/costs scares institutes but it is often a “hidden” costs.

Transparent • Original data from statistical database are stored • Complete and consistent data are stored • Original and completed data can be compared • Calculation procedures are stored and can be repeated (scripting language) Harmonised Definition used here is to bring together the different international databases in one framework and to link the data through a unique coding system (keywords are classifications and tree structures, super-classifications)

Complete Definition used in MetaBase is that an econometric procedures will be proposed to complete the new (time) series in the database (especially needed for models). Consistent Definition used here is that the inter relationship of the data in the database holds over classifications (time, territories and variables).

Versioning of your research Main reason for versioning: Reproducibility • Software you use changes: software versions • Data changes/is updated/corrected: data versions • You discover errors in your research process or you improve the procedure: model versions • Best advice: do not use a spreadsheet but a language with a scripting language (SQL, R, GAMS,…) and store data in a database (with a good data model). This documents how the original data was transformed into the data of your research • Store data and scripts in a version control system SVN: like Turtoisehttp://tortoisesvn.net/ • Do it as a group and (re)use others results.

Versioning 2 Try to separate Model (script) from Data Make generic scripts when possible (re-use) Store Script and Data in separate SVN repositories Add meta-information to data as well as your scripts I.e. register versions of the software you use Test if your data and code also runs on other computers Example: Outlier testing in MetaBase

Land under permanent crop in Spain by Eurostat

Versioning 3 Versioning looks time consuming, but when you make mistakes it is easy to go back to an old situation. It is also a first good step in sharing data etc. Works very well in groups. Easy to see differences between versions. Versioning makes it possible to reproduce research, also in 5 years time. Frequency of versioning: some make a version every day. Practical advice: make a version when you have a publication.

MetaBase: data management for data

MetaBase many different data sources (e.g. FAO, Eurostat) all in same user-interface (SDMX, NetCDF) find data alternatives using Meta-Information search data content (e.g. oilseed) all content easily available in research software recodings, aggregations and concordances are all implemented in GAMS Statistical methods in GAMS and R Versioning Eurostat (monthly), FAO (twice per year) Example: http://www.agrimatie.nl/

Always play with your dataand communicate Wishes, problems, requests: Wietse.Dol@wur.nl

Not our data, but we use it in research

Not our data, but we use it in research

Presentation Transcript

CENSUS 2000 DATA AND HOW WE CAN USE IT

Schema Translation – OS Mastermap, but not as we know it

But does anyone use it

it’s life Jim but not as we know it...

The question is not whether or not we worship, but who we worship!

MetaManager As We Use It

“It is not things in themselves that trouble us, but our opinions of things.”

“We can choose our friends but we cannot choose our family.”

‘It’s Science Jim…but not as we know it’

It’s Not My Job…..But It Is!

“ We do not inherit the earth from our ancestors, we borrow it from our children. ”

“We did it our way”

It’s life Jim, but not as we know it

PoP “We Want It But Don’t Get It”

“ But it ’ s not my fault! ”

CENSUS 2000 DATA AND HOW WE CAN USE IT

Data Warehousing How do we use the data now that we have it?

Where we love is our home— the house that our feet may leave, but not our hearts.

The fault is not in our databases, but in ourselves: messy data, metadata, and interoperability

Use Salesforce or not use it

Public Key Superstructure It’s PKI Jim, but not as we know it!