Open data as the engine of the “scientific revolution”

Science as an Open Enterprise:Open Data for Open Science Professor Brian Collins CB, FREngUCL, June 2012Emerging conclusions from a Royal Society Policy Report

Open data as the engine of the “scientific revolution” Publish scientific theories – and the experimental and observational data on which they are based – to permit others to scrutinise them, to identify errors, to support, reject or refine theories and to reuse data for further understanding and knowledge. Henry Oldenburg

Why is “open data” a big current issue? The data deluge from powerful acquisition tools coupled with powerful tools for storing, manipulating, analysing, displaying and transmitting data and citizens interest in scrutinising scientific claims have created new challenges & new opportunities that require newforms of openness and novel social dynamics in science

Challenges Maintaining scientific self-correction (closing the concept-data gap) Responding to citizens’ demands for evidence in “public interest science” Opportunities Exploiting data-intensive science – a 4th paradigm? The potential of linked data “Data is the new raw material for business” Exposing malpractice and fraud Stimulating citizen science Aspiration: all scientific literature online, all data online, and for them to interoperate

Openness of data per se has no value. Open science is more than disclosure For effective communication, we need intelligent openness. Data must be: Accessible Intelligible Assessable Re-usable Only when these four criteria are fulfilled are data properly open Metadata must be audience-sensitive METADATA Scientific data rarely fits neatly into an EXCEL spreadsheet!

Boundaries of openness? Legitimate commercial interests Privacy (complete anonymisation is impossible) Safety & Security But the boundaries are fuzzy & complex

Benefits/costs of open data to the science process Pathfinder disciplines where benefit is recognised and habits are changing Bioinformatics (-omics disciplines) Biological science Particle physics Nanotechnology Environmental science Longitudinal societal data Astronomy & space science e.g. Gene Omnibus – 2700 GEO uploads by non-contributors in 2000 led to 1150 papers (>1000 additonal papers over the 16 that would be expected from investment of $400,000) Costs Tier 1 – International databases – e.g. Worldwide Protein Databank: >65 staff; $6.5M pa; 1% of cost of collecting data Tier 3 – Institutional data management - UK 2011, average UK university repository - 1.36 FTE (managerial, administrative, technical)

Levels of data curation Tier 1 – International databases Tier 2 – National (e.g. Research Councils Tier 3 – Institutions (Universities & Institutes) Tier 4 – “Small science” researchers & research groups upward data migration Data loss Financial sustainability?

Priorities for action- 1 Change the mindset: publicly funded data is a public resource Credit for useful data and productive, novel collaboration (the Tim Gowers phenomenon) Mandatory access to data underlying publications Common standards for communicating data Sustainability (the power needs of current modes of data storage will outstrip the global electricity supply within the decade)

Priorities for action - 2 R & D on software tools (Enabling dynamic data; managing the data lifecycle; tracking provenance, citation, indexing and searching, standards & inter-operability, sustainability -note that the ICT industry is often way ahead - & the US prioritises investment here) Institutional responsibility for the knowledge they create (cumulative small science data > cumulative big science data) Data scientists (they are being trained, and the commercial demand is large) “Big Iron” is a national infrastructure priority “Big data” is a science priority – the big costs are people and software, not computers

Targets for recommendations Scientists – changing cultural assumptions Employers (universities/institutes) – data responsibilities; crediting researchers Funders of research - the cost of curation is a cost of research Learned societies – influencing their communities Publishers of research – mandatory open data Business – exploiting the opportunity; awareness & skills Government – efficiency of the science base; exploiting its data Governance processes for privacy, safety, security - proportionality

Open data as the engine of the “scientific revolution”