Enhancing access to research data:
1 / 43

UKOLN is supported by: - PowerPoint PPT Presentation

  • Uploaded on

Enhancing access to research data: the e-Science project eBank UK. UKOLN is supported by:. 2005-09-01 www.ukoln.ac.uk. www.bath.ac.uk. A centre of expertise in digital information management. Enhancing access to research data: overview.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'UKOLN is supported by:' - jerica

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ukoln is supported by

Enhancing access to research data:

the e-Science project eBank UK

UKOLN is supported by:

2005-09-01 www.ukoln.ac.uk


A centre of expertise in digital information management

Enhancing access to research data overview
Enhancing access to research data: overview

  • E-Science: impact of digital technologies on research process

  • Scholarly knowledge cycle and publication bottleneck

  • eBank project: applying digital library techniques to support data curation in crystallography

  • Services, metadata, issues; phase 3

Changes in research process
Changes in research process

  • Increasing data volumes from eScience / Grid-enabled / cyber-infrastructure applications, “big science”, data-driven science

  • Changing research methods: high througput technologies, automation, ‘smart labs’

  • Potential for re-use of data, new inter-disciplinary research

  • Different types of data: observational data, experimental data, computational data: different stewardship and long-term access requirements

Diversity of data collections
Diversity of data collections

  • Very large, relatively homogeneous: Large-scale Hadron Collider (LHC) outputs from CERN

  • Smaller, heterogeneous and richer collections: World Data Centre for Solar-terrestrial Physics CCLRC

  • Small-scale laboratory results: “jumping robots” project at the University of Bath

  • Population survey data: UK Biobank

  • Highly sensitive, personal data: patient carerecords

Taxonomy of data collections
Taxonomy of data collections

  • Research collections: jumping robots

  • Community collections: Flybase at Indiana (with UC Berkeley )

  • Reference collections: Protein Data Bank

    Source: NSF Long-Lived Digital Data Collections

    Draft report March 2005

Ukoln is supported by

Repository evolution:

1971 Research collection

<12 files

2005 Reference collection

>2700 structures deposited in 6 months

1 issues research data as content
1. Issues: research data as content

  • Sharing or not; Open Access to data?

  • Data diversity

    • Homo- or heterogeneous

    • Raw and derived / processed

    • Sensitivity

    • Fast or slow growth in volume

  • Repository evolution:

    • Likelihood to scale up (from bytes to petabytes)

    • Quality assurance (from the start)

    • Community-based standards development

    • Relationship between institutional and subject r’s

    • Build robust services

Ukoln is supported by

Presentation services: subject, media-specific, data, commercial portals

Searching , harvesting, embedding

Resource discovery, linking, embedding

Resource discovery, linking, embedding

Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media

Data analysis, transformation, mining, modelling

Aggregator services: national, commercial

Learning object creation, re-use


Learning & Teaching workflows

Research & e-Science workflows

Repositories : institutional, e-prints, subject, data, learning objects

Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules

Deposit / self-archiving

Deposit / self-archiving



Resource discovery, linking, embedding


Peer-reviewed publications: journals, conference proceedings

Quality assurance bodies

Ukoln is supported by

Data Overload! commercial portals

EPSRC National Crystallography Service

How do we disseminate?

The data deluge: crystallography

Data overload the publication bottleneck
Data overload & the publication bottleneck commercial portals




Current publishing process
Current Publishing Process commercial portals

  • Journal articles: aims, ideas, context, conclusions – only most significant data

  • Raw & underlying data required by peers not readily available

Context existing data repositories
Context: existing data repositories commercial portals

  • National data archives:

    • UK Data Archive, Arts and Humanities Data Service, US National Archives and Records Administration (NARA), Atlas Datastore

  • Discipline specific archives:

    • GenBank, Protein Data Bank

  • Crystallography archives

    • Cambridge Crystallographic Data Centre (Cambridge Structural Database) , Indiana University Molecular Structure Center (Crystal Data Server, Reciprocal Net), FIZ Karlsruhe (Inorganic crystals), Toth Information Systems (CHRYSTMET)

  • Journals require deposit of data to support articles

    • Typically deposit of summary data…. partial coverage

Ebank uk project overview
eBank UK project overview commercial portals

  • JISC funded in 2003, now in Phase 2 to 2006

  • Joint effort between crystallographers, computer scientists, digital library researchers

  • Investigating contribution of existing digital library technologies to enable ‘publication at source’

  • Partners have interest in dissemination of chemistry research data, open access, OAI, institutional repositories http://www.ukoln.ac.uk/projects/ebank-uk/

Ebank project team
eBank project team commercial portals

University of Bath, UKOLN (lead)

  • Monica Duke, Rachel Heery, Traugott Koch, Liz Lyon,

    University of Southampton, School of Chemistry

  • Simon Coles, Jeremy Frey, Mike Hursthouse

    University of Southampton, School of Electronics and Computer Science

  • Leslie Carr, Chris Gutteridge

    University of Manchester, PSIgate (physical sciences portal in RDN)

  • John Blunden-Ellis

Ebank phase one achievements
eBank phase one: achievements commercial portals

  • Gathered requirements from crystallographers

  • Established pilot institutional repository for crystallography data at Southampton with web interface

  • Developed a demonstrator aggregator service at UKOLN (CCDC exploring aggregation service)

  • Developed appropriate schema

  • Demonstrated a search interface as an embedded service at PSIgate portal

  • Demonstrated an added value service linking research data to papers (one-off)

Institutional repositories publication at source
Institutional repositories…publication at source commercial portals

  • Institution establishes repository(s)

  • Institution pro-actively supports deposit process

  • OAI provides basis for interoperability

  • Potential for added value services

  • And/Or ….international subject based archives?

Crystallography good fit
Crystallography good fit…. commercial portals

  • Crystallography has well defined data creation workflow

  • Tradition of sharing using standard file format

  • Crystallography Information File (CIF)

  • What about other chemistry sub-disciplines? other scientific disciplines?

Ebank uk e science testbed combechem
eBank: UK e-Science testbed ‘Combechem’ commercial portals

  • Grid-enabled combinatorial chemistry

  • Crystallography, laser and surface chemistry examples

  • Development of an e-Lab using pervasive computing technology

  • National Crystallography Service at Southampton

Comb e chem project
Comb- commercial portalse-Chem Project









Grid Middleware

Crystallography workflow

RAW DATA commercial portals



Crystallography workflow

  • Initialisation: mount new sample on diffractometer & set up data collection

  • Collection: collect data

  • Processing: process and correct images

  • Solution: solve structures

  • Refinement: refine structure

  • CIF: produce CIF (Crystallographic Information File)

  • Validation: chemical & crystallographic checks

Data collection

Setup via GUI commercial portals

BruNo Unmount

Sample Tray

BruNo Mount





Unit Cell





Data Collection

Data Process

System Y

Data Collection

Data flow in ebank uk

HTML commercial portals




Harvest (XML)



Data Flow in eBank UK



Index and Search

Institutional repository

eBank aggregator

Data files


Ukoln is supported by

Southampton digital repository commercial portals


Access to all underlying data
Access to commercial portalsALL underlying data

Harvesting oaister
Harvesting: OAIster commercial portals

Oai pmh harvesting and aggregating
OAI-PMH: harvesting and aggregating commercial portals

eBank aggregator at UKOLN


Demonstrating potential for linking between data and journal article

Embedded search service at psigate
Embedded search service at PSIgate commercial portals

PSIgate subject


service provider

Schema for records made available for harvesting
Schema for records made available for harvesting commercial portals

  • Data holding (collection of files associated with experiment)

    • Qualified Dublin Core data elements plus additional chemical properties

      • Chemical formula

      • International Chemical Identifier (InChI)

      • Compound Class

  • Individual data files

    • Separate records for stage status of each file

  • Description set wrapped into one XML record using METS

  • Research metadata/data as a complex object

Ukoln is supported by

Dataset commercial portals

eBank data model




Harvesting OAI-PMH oai_dc

Crystal structure (data holding)

ePrint UK aggregator service



Harvesting OAI-PMH


ebank_dc record (XML)


eBank UK aggregator service


Institutional repositories


Crystal structure report (HTML)


Harvesting OAI-PMH oai_dc,ebank_dc

Eprint “jump-off” page (HTML)


Eprint manifestation (e.g. PDF)

Eprint oai_dc record (XML)

Other aggregators and services

dc:type=“Eprint” and/or ”Text”


Model input Andy Powell, UKOLN.

Creating the metadata
Creating the metadata commercial portals

  • Potential to embed ‘deposit and disseminate’ into workflow of chemist in automated way

Ebank phase two work areas
eBank phase two work areas commercial portals

  • Sub-disciplines of chemistry, earth sciences, engineering

  • Pursue generic data model

  • Use of identifiers for citing datasets

  • Subject approach to discovering research data (keywords, classification, ontology)

  • Access to research data in teaching and learning context

  • Liaise with other digital repository initiatives

Related uk projects
Related UK projects commercial portals

  • National e-Science Centre NESC

  • NERC Data Grid (Athmospheric and Oceanographic Data Centres)

  • JISC Digital Repositories Programme:

    - Spectra (experim. chemistry, high volume ingestion)

    - R4L (lab equipment, metadata generation)

    - CLADDIER (citation, identifiers, linking)

    - StORe (data and publ. repository links)

    - GRADE (reuse of geospatial data)

2 issues generic data models metadata schema terminology
2. Issues: generic data models, metadata schema & terminology

  • Validation against generic schema

    • CCLRC Scientific Data Model Vs 2

  • Complex digital objects and packaging options

    • METS

    • MPEG 21 DIDL

  • Terminologies

    • Domain: crystallography

    • Inter-disciplinary e.g. biomaterials

    • Metadata enhancement: subject keyword additions to datasets based on related publications

    • Meaningful resource discovery?

3 issues linking
3. Issues: linking terminology

  • Links to individual datasets within an experiment

  • Links to all datasets associated with an experiment or a data collection

  • Links to derived eprints and published literature

  • Context sensitive linking: find me

    • Datasets by this author / creator

    • Datasets related to this subject

    • Learning objects by this author / creator

    • Learning objects related to this subject

  • Identifiers and persistence

    • “generic”

    • domain: International Chemical Identifier (InChI code)

  • Resource discovery : Google Scholar?

  • Provenance: authenticity, authority, integrity?

4 issues identifiers
4. Issues: identifiers terminology

  • Identifiers and persistence

    • “generic”: DOI, PURL, Handle, ARK

    • domain: International Chemical Identifier (InChI)

    • Resolution; lookup

  • Resource discovery : Google Scholar?

  • Granularity (metadata, linking)?

  • Provenance: authenticity, authority, integrity?

5 issues embedding and workflow
5. Issues: embedding and workflow terminology

  • Into the crystallographic publishing community International Union of Crystallography

  • Into the chemistry research workflow

    • SMART TEA Digital Lab Book e-synthesis Lab

    • Other analytical techniques and instrumentation

  • Into the curriculum and e-Learning workflows

    • MChem course

    • Undergraduate Chemical Informatics courses

For the future
For the future… terminology

  • Who provides added value services?

    • Authority files, automated subject indexing, annotation, data mining, visualisation

  • What are the preservation issues?

    • UK Digital Curation Centre http://www.dcc.ac.uk

    • National Science Board Draft report on long-lived data collections http://www.nsf.gov/nsb/meetings/2005/LLDDC_draftreport.pdf

  • How to manage complex objects descriptions within OAI ?

  • Digital curation of research data presents new roles for scientists, computer scientists, data managers….

Ukoln is supported by

Repositories and digital curation terminology

For later use? In use now (and the future)?



Data preservation

Data curation

“maintaining and adding value to a trusted body of digital information for current and future use”

Provide value added services
Provide value-added services terminology

  • Annotation

    • e-Lab books (Smart Tea Project in chemistry)

    • Gene and protein sequences

Enable post processing and knowledge extraction
Enable “post-processing” and knowledge extraction terminology

  • The acquisition of newly-derived information and knowledge from repository content

    • Run complex algorithms over primary datasets

    • Mining (data, text, structures)

    • Modelling (economic, climate, mathematical, biological)

    • Analysis (statistical, lexical, pattern matching, gene)

    • Presentation (visualisation, rendering)

6 issues knowledge services
6. Issues: “knowledge services” terminology

  • Layered over repositories

    • Annotation

    • Mining, modelling, analysis

    • Visualisation

  • Across multiple repositories

    • Grid enabled applications

    • Highly distributed, dynamic and collaborative

  • Associated with curatorial responsibility

    • UK Digital Curation Centre http://www.dcc.ac.uk

Issues summary
Issues summary terminology

  • Research data is diverse, increasing rapidly in volume and complexity

  • Repository collections are dynamic and evolve

  • Technical challenges associated with interoperability, persistence, provenance, resource discovery and infrastructure provision

  • Embedding in workflow is critical: scholarly communications, research practice, learning

  • Knowledge extraction tools will generate new discoveries based on repository content

  • Repository solutions must scale: M2M processing will become the norm

Ukoln is supported by

Project homepage: terminologyhttp://www.ukoln.ac.uk/projects/ebank-uk/Duke, M. et al: Enhancing access to research data: the challenge of crystallography. JCDL 2005.http://www.ukoln.ac.uk/projects/ebank-uk/dissemination/jcdl2005/preprint.pdfAcknowledgementto all project partners for their contributions to this presentation.