Curation making data suitable for re use
1 / 37

Curation: making data suitable for re-use - PowerPoint PPT Presentation

  • Uploaded on

Curation: making data suitable for re-use. Chris Rusbridge Presentation at FIBS Seminar. Contents. Science and digital curation What to do with your data: frontiers of practice Repository frontiers. Digital Curation Centre Mission.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Curation: making data suitable for re-use' - aulani

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Curation making data suitable for re use

Curation: making data suitable for re-use

Chris Rusbridge

Presentation at FIBS Seminar


  • Science and digital curation

  • What to do with your data: frontiers of practice

  • Repository frontiers

Digital curation centre mission
Digital Curation Centre Mission

“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”

Curation making data suitable for re use

TWOMASS (Infrared)

SDSS (Visual)

Slide from Rajendra Bose

New discovery
New discovery…

  • National Virtual Observatory

    • Johns Hopkins press release: “Scientists working to create the NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”


  • Data increasingly important as evidence

    • Key part of the scholarly record

    • Experimental verifiability (the basis of science)

    • Allows additional interpretations

    • Unrepeatable observations & experiments (particularly environmental in broadest sense)

    • Legal, compliance & transactions

    • Cultural resources

What kinds of data
What kinds of data?

  • Observations

    • eg UARS (Upper Atmosphere) Level 0: telemetry

    • UARS Level 1: measured physical parameters (post calibration?)

  • Derived data

    • UARS Level 2: calculated geophysical? profiles

    • UARS level 3: gridded, interpolated?

  • Combined data

  • Crafted data

    • Eg annotated gene/protein databases

  • Descriptive (meta)data

What to do with it
What to do with it?

  • Keep as part of experiment

  • Deposit in institutional or discipline repository

    • Possible time-limited embargos

  • Cite it

  • “Publish” in support for articles

Internet archaeology publication with data sadly a preservation nightmare
Internet Archaeology: publication with data (sadly, a preservation nightmare!)

What are the reusability issues
What are the reusability issues?

  • Data not neutral to hypothesis

  • Hard to know the risks & pitfalls of a particular dataset

  • Data not self-describing: hard to find appropriate data

  • Hard to “understand” data once found

  • Hard to use data once understood

What to do about it
What to do about it?

  • Build curation/reusability into your workflow

    • Curation begins before creation

    • What’s easy at first becomes (impossibly) hard later

    • Describe your data (metadata)

    • Keep experimental parameters (technical, who, what, when, where etc)

    • Keep data descriptions (schemas, “representation information”, etc)

    • Keep data!

  • Use standard/agreed formats for data

  • Make ownership & restrictions clear

  • Explain how to cite your data

Data resource stages
Data resource stages

  • Curated data is created…

    • Observations? Fixed!

  • Or Acquired…

    • Data brought/bought from outside

    • Ingest

  • Development

    • Derived, refined, combined, processed data

    • Potentially many stages


  • Data meaningless without context

    • Linkage

    • Metadata of many kinds

    • Workflow!

  • Provenance

    • Authenticity

    • Computational lineage

Curation making data suitable for re use


research group3

University research group1

University research group2

local decision-making body

Slide from Rajendra Bose

Access and re use
Access and re-use

  • Ethics and rights control access

    • Weak in expressing this long-term

  • Collaboration tools

    • Annotation, discussion, review

    • Re-use leading to change and development

  • “Publication”

    • Not just in “print”

    • Underlying data should be “published”, too

  • Citation…

Citation needs
Citation needs…

  • An efficient way to reference and access “archived” past states of a changing dataset (work in progress, Buneman et al)

  • Not important for original observations

    • Don’t mess with those data

  • Less important for incremental datasets

    • Later stuff should not invalidate earlier

  • Very important for revisable datasets

    • Eg Genomics… datasets that result from the combined work of curators, or contain opinions or facts likely to change

    • Eg Mapping… OS maps represent a huge database that changes on a daily basis

Curation individual
Curation: Individual

  • “Small science 2-3 times more data than Big science”, but much more at risk

  • PhD student? RA? PI? Administrator? IT support?

  • Data potentially on local hard drives, or at best shared network drives

    • May be inadequately protected

    • Liable for policy-led deletion on resignation

  • Individual “knows” too much

    • Documentation/metadata unlikely to be adequate

  • Future: gone!

Department ecrystals
Department: eCrystals

  • Partnership with Institutional Repository

  • Specialist department archive (& national service)

  • Workflow recording of lab parameters (R4L)

  • Public & private elements

  • Trying to build eCrystals federation (eBank 3)

  • Future: likely to continue

Institution cambridge chemistry
Institution: Cambridge Chemistry

  • 175,000 small molecule structures in CML

  • Alongside Archaeology, Manuscripts, Learning Materials, etc

  • No library curation skills; dependent on research group enthusiast

  • Collection isolated from other Chemistry

  • (Only 5 UK institutional repositories claim to hold data)

  • Future: assured…

Community lockss
Community: LOCKSS?

  • Self-selected group of collectors: closest to genuine open activity (despite Alliance)?

  • Traditionally libraries collecting eJournals

  • Model respects IPR

  • No domain expertise; rely on origins

  • Data limitations…

  • Future: potentially very persistent (low cost, high reliability, attack resistance, distributed)

Discipline atmospheric science
Discipline: Atmospheric Science

  • Strong believer in need for domain scientists as curators

  • Significant participant in “community proxy” agenda-setting activities

  • Internationally fragmented resources

  • Future: mostly dependent on grant funding (but strong commitment)

Discipline pharmacology
Discipline: Pharmacology

  • International Scientific Union

  • Attempting to build credit for data contributions

  • Future: extremely limited funding

Discipline bio health
Discipline: Bio/Health

  • UK PubMedCentral!

    • (you heard about this earlier)

Issues nature article 23 june 05
Issues: Nature article 23 June 05

  • Databases in Peril

    • 51 out of 89 biological databases contacted reported they were struggling financially

    • 7 have closed

    • Several being updated in owner’s spare time

    • (Notes that not all deserve long term support)

  • [Nucleic Acids Research reports 858 databases in 2006!]

  • Major issue: money

Publisher crystallography
Publisher: Crystallography

  • Publisher and Scientific Union

  • Created key domain crystallographic standard (CIF)

  • Strong motivator for deposit of structure data

  • Consistent quality checks

  • DOIs used for structure data

  • Future: publishing business model

Slide from IUCr

National bodies british library
National bodies: British Library

  • Serious and robust approach

  • Legal deposit powers & responsibilities as driver

  • Oriented primarily towards “cultural heritage” (broadly interpreted)

  • Little data, no science domain experience

  • Future: strong future commitment

National bodies tna ndad
National bodies: TNA/NDAD

  • Specialist archive for government datasets

  • Understand government regulations, dynamics & requirements

  • Subject generalists; disconnected from associated science

  • Technology specialists (understand databases)

  • Future: likely to pass eventually to The National Archives

3rd parties portico
3rd parties: Portico

  • Specific area: eJournals

  • Depends on publisher agreements

  • No data or domain science expertise

  • Future: commitment from Mellon + publishers + subscriptions, good funding mix

3rd parties iron mountain
3rd Parties: Iron Mountain?

  • Records management IS a curation problem

  • Organisations like this very likely to branch out

  • No domain science expertise

  • Future: business case, viability, stock market…

Institutions the network
Institutions & the network

  • Institutions have fundamental sustainability

  • Disciplines have domain knowledge advantage but sustainability is an issue

  • Can we get the best of both?