Enhancing Scientific Data Publication and Integration

Outline • The nature of scientific data and image publication • What data do we actually publish? • Relationship between publications and databases • Improving journal authoring • Data lenses, semantic lenses and live journal content • Integrating distributed data • Data webs • ImageWeb • ImageBLAST • Preserving biological research images • The ImageStore Project

Characteristics of biological research data • Bottom-up data flow, lacking central control • Very large research community with diverse research topics • Highly distributed research activities and publication structures • Research data heterogeneous and largely unstructured, often with little by way of semantic mark-up • An open world, where change is as ubiquitous as consensus is elusive

Where to store research data? • Research results may represent ‘universal truths’, e.g. the sequence of a particular gene • These form bounded data sets • The data need only be discovered once • Such information is typically published in a large global bioinformatics database • Research data can also be ‘particulars’ rather than ‘universals’, for example individual assay results, microscopy images and wildlife photos • These data form unbounded data sets • Data collection will never be complete • Such image information is not (yet) widely available on line • It is not appropriate to submit such data to centralized global databases • The data are too heterogeneous • Such activities would not scale

What data do we publish? A scientific paper does not just report scientific observations Rather, as Anita de Waard of Elsevier has pointed out, a scientific paper is an exercise in rhetoric, designed to convince readers of the truth of a particular scientific hypothesis or belief The goal of the article is not to state facts, but rather to convince Facts are selected to support the argument, and are embedded in a rhetorical structure with the purpose of conviction “These observations support theories that defects of the muscle plasma membrane are important for dystrophic pathogenesis.”

. . . but what about the original research data? • While selected findings that support hypotheses appear in research articles, the majority of original research data are never published • Historically, in the paper age, there was no easy method for doing this • Journals had limited space • Other publication avenues were not available • Now, in this digital age, ‘supplementary information’ can be put on-line • However, this facility is not widely used • Furthermore, such supplementary data are usually poorly structured, with insufficient metadata, and may not be discoverable by external search engines • Depositing data as supplementary information may thus be consigning them to costly data graveyards, from which resurrection is difficult

How might we improve on this situation? (my take home messages !!) • We need to start treating experimental research data sets as first class publication objects, of equal value to the journal papers based upon them • We need to work towards better interoperability between papers and data • First, two examples of work in progress • Then my suggestions for new developments

Convergence between papers and databases • Philip Bourne, Editor-in Chief of PLoS Computational Biology and Co-director of the Protein Data Bank, wrote a stimulating paper: PLoS Comp. Biol. 2005 1(3) e34 • In this, he contends that the distinction between an on-line paper and a database is diminishing • He calls for “seamless integration” between papers reporting results and the data used to compute those results

Similar Processes Lead to Similar Resources Author Submission via the Web Depositor Submission via the Web Syntax Checking Syntax Checking Review by Scientists & Editors Review by Annotators Corrections by Depositor Corrections by Author Publish – Web Accessible Release – Web Accessible Credit: Philip Bourne

My critique of Philip Bourne’s ideas • I agree with his central analysis of the processes involved. However, this similarity of process should not blind us to essential differences in purpose • We must maintain a clear distinction between the journal publication • peer reviewed • a dated record of the authors’ view at the time of publication • while errata are permitted, the original version should be immutable • and the research database • should contain the most reliable up-to-date information • data quality is initially the responsibility of the depositor • errors subsequently discovered should be corrected by the curator • Thus “seamless integration” is not desirable • One needs to approach publications and data sets with different presuppositional spectacles – the first rhetorical, the other analytical • Researchers really want the “seams” to be very clear, not covered over

Improving the authoring process • Richard O’Beirne of Oxford Journals has stressed that, for publishers to enable their publications to be better used in the digital world, they need to expose metadata of a higher granularity, identifying component pieces of papers • For images, this means figures and their legends • Such mark-up is typically present during the production phases of a paper’s publication, usually in the form of XML, but is ‘lost’ upon publication as PDF • Such metadata needs exposure to facilitate interoperability with data resources • Anita de Waard of the Elsevier Advanced Technology Group is currently developing a system, in conjunction with the editors and authors of Cell, whereby the authors are enabled to create such mark-up while writing the paper • What we need is an easy-to-use plug-in for MS-Word, accepted by all leading publishers, for the creation of suitable text mark-up at the time of authoring

Live (or at least lively) journal content • The norm that the online version of a journal article is a PDF file is antithetical to the spirit of the Web, and ignores its great potential • PDF is an electronic embodiment of a static printed page • Rather, what we need are on-line journals that include tools to deliver renderable interactive views of otherwise static images, and interpretive‘data lenses’ or ‘semantic lenses’ over published data, thereby enabling new levels of reader comprehension • Semantic lenses specifically provide viewpoints onto RDF data, presenting users with information from selected semantic perspectives • This will require Web delivery of information from multiple resources, involving proper integration of the published paper with research data archives

A data lens showing tsunami damage

A data lens applying a high-pass filter

A data lens for image analysis • Electron micrograph showing cross sections of microvilli on the surface of intestinal epithelial cells

A live semantic lens demonstration http://www.cc.gatech.edu/gvu/ui/sub_arctic/sub_arctic/test/sem_lens_test.html

An example from a recent issue of Biochemistry

Report of a crystallographic structure

Figure 1 from the on-line version of the paper, showing the protein structure

The PDB entry for Polo-like Kinase 1 (PDB ID 2OU7)

Interactive Jmol representation of Polo-like Kinase 1 http://molvis.sdsc.edu/fgij/fg.htm?mol=2ou7

Another example, from The Plant Cell

All the images in the paper should be clickable videos ! Fusiform bodies within the ER network of arabidopsis stem cortical cells http://www.brookes.ac.uk/schools/lifesci/research/molcell/hawes/gfpmoviepage.htm

Integrating distributed data • The problems of achieving semantic interoperability between distributed heterogeneous archives of digital data are well known • Previous approaches to solving the problem have involved • distributed query processing • repository federation, or • portals • All shared in common reliance on mainstream technologies such as Z39.50, XML and Web Services, some of which might be considered as dated or heavyweight technologies • None have applied to the problems of data integration the Semantic Web and Web 2.0 approaches that I wish now to describe

Web and Semantic Web standards and tools We favour the World Wide Web Consortium standards: • RDF as the standard format for sharable metadata • SPARQL as the universal query language for RDF • Software such as D2R Server for abstracting RDF from relational databases in response to SPARQL queries • OWL-DL as the standard web ontology language; and for software development and integration: • use of agile programming techniques • Ruby or Python to provide a lightweight development environment • loose coupling between the Model, View and Controller software components, based on a simple ‘REST’-full approach to component integration (Fielding 2000, Representational State Transfer)

publication@source • With the advent of the Semantic Web, the possibility exists to extend the Web paradigm that anyone can publish to include data publication • We are entering the age of distributed data publication • Most research data will in future not be submitted to centralized databases • Rather, data will be published locally by individual research groups, by institutional repositories and by journal publishers, complete with semantically rich metadata that can be harvested and indexed • The database gives way to a distributed ‘data space’ • The trick then is to create mechanisms whereby such heterogeneous distributed data can be integrated and made cross searchable • One mechanism we are now exploring is the data web

Data integration – the lightweight data web approach The data web is a novel concept for digital information integration involving semantic web technologies • The data are held locally, with metadata published on local Web servers • Separately for each data web serving a particular knowledge domain, automatedlightweight software tools will be used to integrate the distributed data • separate metadata schemas will be mapped to a core ontology • instance metadata describing the distributed data will be made available for harvesting as RDF by creating a SPARQL endpoint at each resource • This overcomes syntactic and semantic differences between data providers • Resources can then be discovered by distributed SPARQL queries across the data web

Data web services

Web 2.0 aspects of data webs • Use ofthe Web as the platform • Small pieces, loosely coupled • Programmatic access, giving ‘hackability’ and the right to remix • Tagging: • Data webs are predicated on a formal core ontology, but we see vital roles for user annotations to supplement formal metadata • Trusting our users: • Data providers control their own primary image data and metadata • Data consumers are free to use the data web service in whatever way they think fit, including building secondary services, and providing annotations • The Long Tail: • Data webs enable discovery of ‘long tails’ of hard-to-find data – this is particularly true for research particulars rather than research universals

The ImageWeb Project • Image webs are data webs for research images • We desire to integrate and make cross-searchableresearch images held by publishers, research organizations, museums and institutional repositories, which are currently in isolated data silos • We desire to enable these information resources • to become a more integral part of day-to-day research, and • for published images to be more fully used than at present, including combination and re-use for meta-research • The same images might be accessed by more than one data web • For example, cellular images might be accessed by one data web illustrating confocal microscopy techniques, and alternatively by another data web concerned with cancer therapy

ImageBLAST – an image web secondary service • I originally imagined that ImageWeb users would directly query the ImageWeb, and from there being led to relevant images • However, I now believe that it might be even more useful for a user to be able to click on an image within an online paper she is reading, and have semantically related images from other sources presented as a ranked list • This service would resemble the basic bioinformatics BLAST service for finding related biological sequences (http://www.ncbi.nlm.nih.gov/BLAST/) • This ‘ImageBLAST’ service would not locate images that resemble the first image in terms of visual appearance, but in terms of being about the same thing • e.g. the same gene expressed in a different organism • or the same biological concept demonstrated in a different system

An example – transplanted GFP-labelled stem cells

Related images Fig. 2. (A and B) Immunohistochemical staining for EGFP on livers of (A) Z/EG x Cre–into-Cre and (B) Z/EG-into-Cre transplants. (C) Immunofluorescence staining with cytokeratin (green) and Y chromosome FISH (red) in the same Z/EG-into-Cre transplant, showing the presence of a donor-derived Y-positive hepatocyte (arrow). (D and E) Immunofluorescence staining of (D) untransplanted positive control (Z/EG x Cre F1) and (E) experimental (Z/EG into Cre) epidermal sections with antibodies against EGFP (green) and cytokeratin AE1/AE3 (red). (F) Immunofluorescence staining with cytokeratin AE1/AE3 (red) and Y chromosome FISH (green), showing the presence of a donor-derived Y-positive keratinocyte (arrow) in the epidermis of a Z/EG-into-Cre transplant recipient.

How might a data web improve on ? • It permits access to database information hidden in the ‘Deep Web’ • It involves specific targeting to a particular knowledge domain, thus achieving a significantly higher signal-to-noise ratio • It provides integration of information with ontological underpinning, semantic coherence, and truth propagation • It permits programmatic access, enabling secondary services to be built on top of one or more data webs

Our present objective DW-40 : data webs for frictionless interoperability between scientific publications and research datasets

References In addition to the papers shown in my presentation itself, please find further details in: • Presentations by Philip Bourne, Anita de Waard, David Karger and David Shotton given at the Research Information Network workshop “Data Webs: new visions for research data on the Web”, 28 June 2006, available at http://www.rin.ac.uk/data-webs. • Erika Darling, Chris Newbern and Nikhil Kalghatgi (Mitre Corporation IR&D) (2005) Reducing visual clutter with semantic lenses. ESRI User Conference July 2005. http://www.themitrecorporation.org/tech/nlvis/pdf/esri_user_conference.pdf. • Anita de Waard (2006) Semantic authoring for scientific publication. Downloadable from www.cs.uu.nl/people/anita/talks/deWaardSWDays0410.pdf. • Anita de Waard and H. van Oostendorp (2005). Development of a semantic structure for scientific articles. Presented at Werkgemeenschap Informatiewetenschap, Antwerp, the Netherlands. http://www.cs.uu.nl/people/anita/papers/deWvanOWIG2710.pdf. • Anita de Waard, Leen Breure, Joost G. Kircz and Herre van Oostendorp (2006) Modeling rhetoric in scientific publications. Presented at INSCIT 2006. http://www.instac.es/inscit2006/papers/pdf/133.pdf. • Roy Thomas Fielding (2000) Architectural styles and the design of network-based software architectures. Chapter 5: Representational state transfer (REST). Ph. D. thesis. Department of Information and Computer Science, University of California, Irvine. http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm • Requirements analyses for building a data web for images: http://imageweb.zoo.ox.ac.uk/wiki/index.php/Defining_Image_Access. • Details of the ImageWeb Consortium: http://imageweb.zoo.ox.ac.uk/wiki/index.php/BioImageWeb_Consortium.

The Internet and the flow of information • What struck me after compiling that list is that it did not contain a single journal publication! • Why is this? • “The Internet treats censorship as damage, and routes around it” • quote by John Gilmore • The same fate will suffer anything that impedes the free flow of information, including journals • Unless journals adapt to provide the quality and depth of information that users require, they will become increasingly marginalized, as users go elsewhere on the Web to find it

The ImageStore Project • ImageStore: Curation requirements for legacy analogue and ‘born digital’ scientific image data • Purpose: To research the requirements for effective digital curation and re-use of scientific research images from the biological domain • Part of the Digital Curation Centre’s JISC-funded SCARP Project • To adopt a disciple-specific approach to problems of sharing, curation, re-use and preservation of data • To determine curation needs by embedding curation staff within research teams • To give the ImageStore project specific focus, we are investigating the curation requirements for four distinct types of images, two sets of historical analogue records and two sets of modern ‘born digital’ images

The history of molecular and cell biology • Molecular and cell biology began as research disciplines in the 1950s, when the combination of findings from biochemistry, biophysics and electron microscopy gave us the DNA double helix and the first visions of cell ultrastructure and function • Many of the pioneers of molecular and cell biology have now retired or are close to retirement • Their analogue data constitute our scientific cultural heritage, yet most of it will almost certainly be lost if nothing is done soon to curate and archive it • The cost of having to repeat these research observations would far outweigh the cost of preserving the original data

How much data should we save? • It is now technically possible to store as much research data as we wish • But how much is enough? • When is it right not to save data? • For electron microscopy, a good rule of thumb is that for every 1000 EM images taken, 100 will be good, 10 will be superb, and 1 or 2 will make it into print, as figures in a scientific paper • While we should be happy to discard the 900 poor negatives, what we should do with the 98 unpublished good images is a pressing question

Electron microscopy of trypanosomes • Trypanosomes are the causative agents of sleeping sickness • Hundreds of electron micrograph negatives – glass photographic plates – taken over the last 25 years by Professor Keith Gull (Dunn School of Pathology), during his life-long studies of microtubules in trypanosomes Tsetes fly From Broadhead et al., Flagellar motility is required for the viability of the bloodstream trypanosome. Nature440, 224-227 (9 March 2006)

Wildlife videos • Wildlife videos of British and African mammals, including badgers and Ethiopian wolves • Created by Professor David Macdonald’s Wildlife Conservation Research Unit (Department of Zoology) over the last 20 years • There are hundreds of analogue videotapes in a variety of formats Haydon et al. Low-coverage vaccination strategies for the conservation of endangered species. Nature443, 692-695 (12th October 2006)

Computer simulations of the human heart • These models, created by Professor Denis Noble and colleagues (Department of Physiology), permit understanding of hear disease • They form part of the OeRC Integrative Biology e-Science Project • Both the computational models and the resulting digital videos recording the simulations are important artefacts that are shared with overseas collaborators and that required long-term curation

In situ images of gene expression • In situ images revealing the time and place of gene expression in the testes of the fruit fly, Drosophila melanogaster, are important for understanding male sterility in humans • These images are currently being acquired by my colleague Dr Helen White-Cooper (Department of Zoology), as part of a BBSRC project on which I am co-investigator • They are born digital true colour light microscopy images • DNA array images that quantify gene expression also form part of the data Male fruit fly Mst87F cyclinB aly

The endAcknowledgement: I am endebted to Graham Klyne, with whom my data web ideas have been developed

Enhancing Scientific Data Publication and Integration

Enhancing Scientific Data Publication and Integration

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

OUTLINE