curation making data suitable for re use
Download
Skip this Video
Download Presentation
Curation: making data suitable for re-use

Loading in 2 Seconds...

play fullscreen
1 / 37

Curation: making data suitable for re-use - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

Curation: making data suitable for re-use. Chris Rusbridge Presentation at FIBS Seminar. Contents. Science and digital curation What to do with your data: frontiers of practice Repository frontiers. Digital Curation Centre Mission.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Curation: making data suitable for re-use' - aulani


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
curation making data suitable for re use

Curation: making data suitable for re-use

Chris Rusbridge

Presentation at FIBS Seminar

contents
Contents
  • Science and digital curation
  • What to do with your data: frontiers of practice
  • Repository frontiers
digital curation centre mission
Digital Curation Centre Mission

“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”

slide4

TWOMASS (Infrared)

SDSS (Visual)

Slide from Rajendra Bose

new discovery
New discovery…
  • National Virtual Observatory
    • Johns Hopkins press release: “Scientists working to create the NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”
curation
Curation
  • Data increasingly important as evidence
    • Key part of the scholarly record
    • Experimental verifiability (the basis of science)
    • Allows additional interpretations
    • Unrepeatable observations & experiments (particularly environmental in broadest sense)
    • Legal, compliance & transactions
    • Cultural resources
what kinds of data
What kinds of data?
  • Observations
    • eg UARS (Upper Atmosphere) Level 0: telemetry
    • UARS Level 1: measured physical parameters (post calibration?)
  • Derived data
    • UARS Level 2: calculated geophysical? profiles
    • UARS level 3: gridded, interpolated?
  • Combined data
  • Crafted data
    • Eg annotated gene/protein databases
  • Descriptive (meta)data
what to do with it
What to do with it?
  • Keep as part of experiment
  • Deposit in institutional or discipline repository
    • Possible time-limited embargos
  • Cite it
  • “Publish” in support for articles
what are the reusability issues
What are the reusability issues?
  • Data not neutral to hypothesis
  • Hard to know the risks & pitfalls of a particular dataset
  • Data not self-describing: hard to find appropriate data
  • Hard to “understand” data once found
  • Hard to use data once understood
what to do about it
What to do about it?
  • Build curation/reusability into your workflow
    • Curation begins before creation
    • What’s easy at first becomes (impossibly) hard later
    • Describe your data (metadata)
    • Keep experimental parameters (technical, who, what, when, where etc)
    • Keep data descriptions (schemas, “representation information”, etc)
    • Keep data!
  • Use standard/agreed formats for data
  • Make ownership & restrictions clear
  • Explain how to cite your data
data resource stages
Data resource stages
  • Curated data is created…
    • Observations? Fixed!
  • Or Acquired…
    • Data brought/bought from outside
    • Ingest
  • Development
    • Derived, refined, combined, processed data
    • Potentially many stages
context
Context
  • Data meaningless without context
    • Linkage
    • Metadata of many kinds
    • Workflow!
  • Provenance
    • Authenticity
    • Computational lineage
slide16

NASA

research group3

University research group1

University research group2

local decision-making body

Slide from Rajendra Bose

access and re use
Access and re-use
  • Ethics and rights control access
    • Weak in expressing this long-term
  • Collaboration tools
    • Annotation, discussion, review
    • Re-use leading to change and development
  • “Publication”
    • Not just in “print”
    • Underlying data should be “published”, too
  • Citation…
citation needs
Citation needs…
  • An efficient way to reference and access “archived” past states of a changing dataset (work in progress, Buneman et al)
  • Not important for original observations
    • Don’t mess with those data
  • Less important for incremental datasets
    • Later stuff should not invalidate earlier
  • Very important for revisable datasets
    • Eg Genomics… datasets that result from the combined work of curators, or contain opinions or facts likely to change
    • Eg Mapping… OS maps represent a huge database that changes on a daily basis
curation individual
Curation: Individual
  • “Small science 2-3 times more data than Big science”, but much more at risk
  • PhD student? RA? PI? Administrator? IT support?
  • Data potentially on local hard drives, or at best shared network drives
    • May be inadequately protected
    • Liable for policy-led deletion on resignation
  • Individual “knows” too much
    • Documentation/metadata unlikely to be adequate
  • Future: gone!
department ecrystals
Department: eCrystals
  • Partnership with Institutional Repository
  • Specialist department archive (& national service)
  • Workflow recording of lab parameters (R4L)
  • Public & private elements
  • Trying to build eCrystals federation (eBank 3)
  • Future: likely to continue
institution cambridge chemistry
Institution: Cambridge Chemistry
  • 175,000 small molecule structures in CML
  • Alongside Archaeology, Manuscripts, Learning Materials, etc
  • No library curation skills; dependent on research group enthusiast
  • Collection isolated from other Chemistry
  • (Only 5 UK institutional repositories claim to hold data)
  • Future: assured…
community lockss
Community: LOCKSS?
  • Self-selected group of collectors: closest to genuine open activity (despite Alliance)?
  • Traditionally libraries collecting eJournals
  • Model respects IPR
  • No domain expertise; rely on origins
  • Data limitations…
  • Future: potentially very persistent (low cost, high reliability, attack resistance, distributed)
discipline atmospheric science
Discipline: Atmospheric Science
  • Strong believer in need for domain scientists as curators
  • Significant participant in “community proxy” agenda-setting activities
  • Internationally fragmented resources
  • Future: mostly dependent on grant funding (but strong commitment)
discipline pharmacology
Discipline: Pharmacology
  • International Scientific Union
  • Attempting to build credit for data contributions
  • Future: extremely limited funding
discipline bio health
Discipline: Bio/Health
  • UK PubMedCentral!
    • (you heard about this earlier)
issues nature article 23 june 05
Issues: Nature article 23 June 05
  • Databases in Peril
    • 51 out of 89 biological databases contacted reported they were struggling financially
    • 7 have closed
    • Several being updated in owner’s spare time
    • (Notes that not all deserve long term support)
  • [Nucleic Acids Research reports 858 databases in 2006!]
  • Major issue: money
publisher crystallography
Publisher: Crystallography
  • Publisher and Scientific Union
  • Created key domain crystallographic standard (CIF)
  • Strong motivator for deposit of structure data
  • Consistent quality checks
  • DOIs used for structure data
  • Future: publishing business model

Slide from IUCr

national bodies british library
National bodies: British Library
  • Serious and robust approach
  • Legal deposit powers & responsibilities as driver
  • Oriented primarily towards “cultural heritage” (broadly interpreted)
  • Little data, no science domain experience
  • Future: strong future commitment
national bodies tna ndad
National bodies: TNA/NDAD
  • Specialist archive for government datasets
  • Understand government regulations, dynamics & requirements
  • Subject generalists; disconnected from associated science
  • Technology specialists (understand databases)
  • Future: likely to pass eventually to The National Archives
3rd parties portico
3rd parties: Portico
  • Specific area: eJournals
  • Depends on publisher agreements
  • No data or domain science expertise
  • Future: commitment from Mellon + publishers + subscriptions, good funding mix
3rd parties iron mountain
3rd Parties: Iron Mountain?
  • Records management IS a curation problem
  • Organisations like this very likely to branch out
  • No domain science expertise
  • Future: business case, viability, stock market…
institutions the network
Institutions & the network
  • Institutions have fundamental sustainability
  • Disciplines have domain knowledge advantage but sustainability is an issue
  • Can we get the best of both?
ad