Integrating diverse sources of scientific data is it safe to match on names
Download
1 / 47

Integrating Diverse Sources of Scientific Data: Is it safe to match on names? - PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on

Integrating Diverse Sources of Scientific Data: Is it safe to match on names?. Prof. Jessie Kennedy. Exploiting Diverse Sources of Scientific Data. Wealth and diversity of scientific data collected and stored is growing rapidly Increase in automation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Integrating Diverse Sources of Scientific Data: Is it safe to match on names?' - amadahy-mitchell


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Integrating diverse sources of scientific data is it safe to match on names

Integrating Diverse Sources of Scientific Data: Is it safe to match on names?

Prof. Jessie Kennedy


Exploiting diverse sources of scientific data
Exploiting Diverse Sources of Scientific Data

  • Wealth and diversity of scientific data collected and stored is growing rapidly

    • Increase in automation

      • Genetic sequencing, remote sensing, astronomy satellites

    • Decrease in technological costs

      • Computers more powerful, disk space greater for the same £

  • Huge potential for scientific discovery by exploiting this data

    • especially multi-disciplinary research

  • Number, complexity and diversity of resources makes this a difficult task

  • Case Study

    • Data Integration

    • Matching data sets on biological names

Exploiting Diverse Sources of Scientific Data


SEEK

  • Science Environment for Ecological Knowledge

    • USA National Science Foundation funding

  • Multidisciplinary project

    • Biology: Ecology, Taxonomy

    • Environmental science: Geography, Remote sensing, Meteorology, Climatology

    • Computer Science: Database, GRID/Web, Ontologies, Workflows, Algorithms, Human Computer Interaction

  • Exploiting Diverse Sources of Scientific Data


    The seek prototype ecological niche modeling

    Model of niche in ecological dimensions

    occurrence points on native species distribution

    precipitation

    temperature

    Project back onto geography

    Native range prediction

    Invaded range prediction

    The SEEK Prototype: Ecological Niche Modeling

    Geographic Space

    Ecological Space

    Geospatial and remotely sensed data

    Biodiversity information e.g. data from museum specimens, ecological surveys

    ecological niche modeling

    Results taken to integrate with other data realms (e.g., human populations, public health, etc.)

    Exploiting Diverse Sources of Scientific Data


    Species prediction map
    Species prediction map

    Predicted

    Distribution:

    Amur snakehead

    (Channa argus)

    Image from http://www.lifemapper.org

    Exploiting Diverse Sources of Scientific Data


    Seek informatics challenges
    SEEK - Informatics Challenges

    • Data is Distributed

    • Data is Heterogeneous

      • Syntax

        • e.g. Text, Excel, Relational Database…..

      • Schema

        • e.g. Names of the tables, columns in tables

      • Semantics  principal focus for SEEK

        • From many disciplines

          • Biodiversity surveys, hydrology, atmospheric chemistry, spatial data, behavioural experiments,…

          • Data on economics, demographics, legal issues,…

    Exploiting Diverse Sources of Scientific Data


    Seek overview
    SEEK Overview

    BEAM WG:

    Biodiversity and Ecological Analysis and Modelling

    EcoGrid:

    Making diverse environmental data systems interoperate

    Analysis and Modelling System (Kepler)

    Modelling scientific workflows

    Knowledge Representation WG:

    Ontologies, Metadata

    Taxon WG:

    Taxonomic name/concept resolution server

    Semantic Mediation System:

    “Smart” data discovery and integration

    Exploiting Diverse Sources of Scientific Data


    Seek overview1
    SEEK Overview

    EcoGrid

    Exploiting Diverse Sources of Scientific Data


    Ecogrid resources

    LUQ

    AND

    HBR

    VCR

    NTL

    Metacat node

    SRB node

    VegBank node

    DiGIR node

    Xanthoria node

    Legacy system

    EcoGrid Resources

    Partnership for Interdisciplinary Studies of Coastal Oceans (4)

    Natural History Collections (>> 100)

    UC Natural Reserve System (36)

    Multi-agency Rocky Intertidal Network (60)

    LTER Network (24)

    Organization of Biological Field Stations (180)

    Exploiting Diverse Sources of Scientific Data


    Ecogrid data access
    EcoGrid Data Access

    • EcoGrid registry to discover data sources

    • EML (Ecological Metadata Language)

      • Experimental data, survey data, spatial raster and vector data, etc.

      • XML based

        • Discovery information

          • Creator, Title, Abstract, Keyword, etc.

        • Coverage

          • Geographic, temporal, and taxonomic extent

        • Logical and physical data structure

          • Data semantics via unit definitions and typing

        • Protocols and methods

    • DarwinCore

      • Museum collections

    Exploiting Diverse Sources of Scientific Data


    Ecogrid services
    EcoGrid Services

    • Service to Analysis and Modelling Layer

      • Interaction with Kepler – Workflows

      • Interaction with Grid Computing Facilities

        • Distributed computation

    • Service to Semantic Mediation Layer

      • Access to Ontologies; Taxon Services

    • Access to Legacy Apps

      • LifeMapper

      • Spatial Data Workbench

    Exploiting Diverse Sources of Scientific Data


    Seek overview2
    SEEK Overview

    AMS

    Exploiting Diverse Sources of Scientific Data


    Scientific workflows

    Query EcoGrid to find data

    Archive output to EcoGrid with workflow metadata

    Scientific Workflows

    • Model the way scientists currently work with data

      • coordinate export and import of data among software systems

    • Workflows emphasize data flow

    • Output generation includes creating appropriate metadata

      • The analysis workflow itself becomes metadata

      • The workflow describes the data lineage as it has been transformed

      • Derived data sets can be stored in EcoGrid with provenance

    Exploiting Diverse Sources of Scientific Data


    Scientific workflows1
    Scientific workflows

    • EML provides semi-automated data binding

    Exploiting Diverse Sources of Scientific Data


    Kepler ecological niche model
    Kepler: Ecological Niche Model

    (200 to 500 runs per species

    x

    2000 mammal species

    x

    3 minutes/run)

    =

    833 to 2083 days

    Exploiting Diverse Sources of Scientific Data


    Grid enable kepler

    (200 to 500 runs per species

    x

    2000 mammal species

    x

    3 minutes/run)

    /

    100 nodes

    =

    8 to 20 days

    Grid-enable Kepler

    • Utilize distributed computing resources

    • Execute single steps or sub-workflows on distributed machines

    KeplerGrid

    for Niche

    Modeling

    Exploiting Diverse Sources of Scientific Data


    Seek overview3
    SEEK Overview

    SMS

    Exploiting Diverse Sources of Scientific Data


    Metadata
    Metadata

    • Key information needed to read and machine process a data file is in the metadata

      • Physical descriptors (CSV, Excel, RDBMS, etc.)

      • Logical Entity (table, image..),Attribute (column) descriptions

        • Name

        • Type (integer, float, string…)

        • Codes (missing values, nulls...)

        • Integrity constraints

      • Semantic descriptions (ontology-based type systems)

    • Metadata driven data ingestion

    Exploiting Diverse Sources of Scientific Data


    Ecological ontologies
    Ecological ontologies

    • What was measured (biomass or photosynthetic solar radiation)

    • Type of quantity measured (mass, length)

    • Context of measurement (Psychotria limonensis, wavelength band)

    • How it was measured (dry weight, total solar radiation)

    Exploiting Diverse Sources of Scientific Data


    Semantic mediation

    Data

    Ontology

    Workflow Components

    Semantic Mediation

    • Label data with semantic types

    • Label inputs and outputs of analytical components with semantic types

    • Use reasoning engine to generate transformation step

    • Use reasoning engine to discover relevant component

    Exploiting Diverse Sources of Scientific Data


    Data integration
    Data integration

    • Homogeneous data integration

      • Integration via EML metadata is relatively straightforward

    • Heterogeneous Data integration

      • Requires advanced metadata and processing

        • Attributes must be semantically typed

        • Collection protocols must be known

        • Units and measurement scale must be known

        • Measurement relationships must be known

          • e.g., that ArealDensity=Count/Area

    Exploiting Diverse Sources of Scientific Data


    Simple example
    Simple Example

    Exploiting Diverse Sources of Scientific Data


    Life sciences data
    Life Sciences Data

    • Much of the data gathered in ecological studies and used in ecological data analysis is bio-referenced data

      • typically organisms are referenced by a Latin name

        • e.g. Picea rubens

    • Many analyses require integrating data

      • originating in many locations and

      • at various points in time

    • For most bio-referenced data, integration involves matching on organism name

      • SEEK Taxon investigating associated issues

    Exploiting Diverse Sources of Scientific Data


    Biological scientific names
    Biological (Scientific) Names

    • Used for communicating information about known organisms and groups of organisms – taxa

      • Framework for all biologists to communicate…

    • Arise from taxonomists applying them to species and higher taxa following classification

    • Formalized according to strict codes of nomenclature

      • differ depending on kingdom

    • Use a Latin naming scheme

      • polynomial for species + below; monomial for genus + above

    • Quoted as: LatinName NameAuthors Year

      • Example: Carya floridana Sarg. 1913

    • Can cause problems in data analysis…..

    Exploiting Diverse Sources of Scientific Data


    Classification concepts names

    _a

    Taxon_concept

    Genus

    Type specimens

    classify

    _b

    _c

    _d

    Taxon_concept

    Taxon_concept

    Taxon_concept

    Species

    Pile of

    specimens

    Taxonomic Hierarchy

    Classification, Concepts & Names

    Exploiting Diverse Sources of Scientific Data


    Classification concepts names1

    classify

    Pile of

    specimens

    Taxon_concept_d

    Taxon_concept_d

    Classification, Concepts & Names

    Exploiting Diverse Sources of Scientific Data


    Taxonomic history of aus l 1758

    (ii) Aus L.1758

    (i) Aus L.1758

    Publications

    of Taxonomic

    Revisions

    Fry splits Aus bea Archer. 1965 into two species, retains the name for one and creates a new one

    Tucker finds new specimens and combines Aus aus L. 1758 and Aus bea Archer. 1965 into one species, retains the name.

    Archer splits Aus aus L. 1758 into two species, retains the name for one and creates a new one

    Pargiter decides to re-split Aus aus but believes bea(beus) is in a new genus Xus.

    Aus bea

    Archer 1965

    Aus aus L.1758

    type specimen

    Genus concept

    genus name

    (iv) Aus L.1758

    (v) Aus L.1758

    (iii) Aus L.1758

    Archer 1965

    Linnaeus 1758

    Aus aus

    L.1758

    Aus aus L.1758

    Aus aus L. 1758

    Aus aus L.1758

    Species concept

    Aus bea

    Archer 1965

    Aus ceus

    BFry 1989

    Aus cea

    BFry 1989

    species name

    Aus cea

    BFry 1989

    Xus Pargiter 2003

    Tucker 1991

    A diligent nomenclaturist, Pyle (1990), notes that the species epithet of Aus bea and Aus cea are of the wrong gender and publishes the corrected names Aus beus corrig. Archer 1965 and Aus ceus corrig. BFry 1989

    Xus beus (Archer)

    Pargiter 2003.

    Fry 1989

    Pargiter publishes his revision using Pyle’s corrigendum of the epithet bea to beus and Aus cea to Aus ceus.

    Tucker publishes his revision without noting Pyle’s corrigendum of the name of Aus cea

    Publications

    of Purely Nomenclatural Observation

    publication

    Pargiter 2003

    In Linnaeus 1758

    In Archer 1965

    In Pyle 1990

    In Tucker 1991

    In Fry 1989

    In Pargiter 2003

    specimen

    Taxonomic history of Aus L. 1758

    bea and cea noted as invalid names and replaced with beus and ceus. Pyle1990

    Exploiting Diverse Sources of Scientific Data


    Problems with taxonomic names
    Problems with Taxonomic Names

    • Are not unique

      • “Re-use” of names with changed definition

      • Name is ambiguous without definition/context

    • Subject to alterations and 'corrections' in time

    • Often recorded inappropriately in datasets

      • No author and/or year (e.g. Carya floridana)

      • Abbreviated (e.g. C. floridana)

      • Internal code (e.g. PicRub for Picea rubens)

      • Vernacular used (e.g. Scrub Hickory)

      • Misspelled

    Exploiting Diverse Sources of Scientific Data


    Taxon concepts
    Taxon Concepts ……

    • The published expert opinion defining and describing a group of organisms which are given a (scientific) name

      • Scientific names qualified with a reference to the definition of a concept

    • Should be used for communicating about groups of organisms

    • Comparing or integrating data based on taxon concepts will be more accurate

    Exploiting Diverse Sources of Scientific Data


    Taxon concepts1
    Taxon Concepts…

    • Created by someone - an Author

    • Described in a Publication

    • Given a Name

      • Related to the type specimen

    • Definition

    • Referenced by

      • Full Scientific name + “according to” (Author + Publication + Date)  Definition

      • Carya floridana Sarg. (1913) “according to” Charles Sprague Sargent, Trees & Shrubs 2:193 plate 177 (1913)

    Exploiting Diverse Sources of Scientific Data


    Taxon concepts2
    Taxon Concepts ……

    • Defined by

      • set of Specimens examined during classification

      • set of common Characters

        • context dependent; differentiate taxa rather than fully describe them;

        • use natural language with all its ambiguities

      • relationships to other Taxon Concepts

        • Taxon circumscription

          • the lower level taxa

        • Congruence, overlap, includes etc. to taxa in other classifications

    Exploiting Diverse Sources of Scientific Data


    Taxon concepts3
    Taxon Concepts ……

    • Original concept

      • 1st use of name as described by the taxonomist

        • same author + date in scientific name and “according to”

        • Carya floridana Sarg. (1913) Charles Sprague Sargent, Trees & Shrubs 2:193 plate 177 (1913)

          • TC_a

    • Revised concept

      • Re-classification of a group

        • Carya floridana Sarg. (1913) “according to” Stone, Flora of North America 3:424 (1997)

          • TC_b

    • Relationship between the taxon concepts

      • TC_b includes TC_a

    Exploiting Diverse Sources of Scientific Data


    Legacy data
    Legacy Data …

    • In legacy data names often appear in place of concepts

    • Names are imprecise

      • inappropriate for referring to information regarding taxa

        • e.g. observational/collection data

        • BUT…sometimes that’s all we have

    • How do we interpret names?…..

      • potentially multiple definitions

        • the sum of all definitions that exist for the name

        • one of the existing definitions

        • the “attributes” in common to all the definitions

        • represented by the type specimen

    Exploiting Diverse Sources of Scientific Data


    Names as taxon concepts
    Names as Taxon Concepts

    • Nominal concepts

      • Sub-set of TaxonConcepts

      • Name but no AccordingTo

        • non-unique (concept) identifier attributes

          • can be given a unique concept identifier

      • No definition

      • Explicitly saying it’s something with this name

        • but not really sure what is/was meant by the name

      • Encourage people to understand and address the issue of names

        • Allowing mark-up of data with names allows them to believe names are really good enough

      • Will improve long term usefulness of scientific data

      • Ease integration

    Exploiting Diverse Sources of Scientific Data


    Seek taxon s message
    SEEK Taxon’s Message…..

    • Scientific names are not unique identifiers for biological entities

    • Integrating data from different sources based on names alone could cause serious errors in analysis of the integrated data

    • Biologists must reference organisms precisely

      • if datasets to be of use long term or to other users

    • Reference by taxon concept rather than name

      • integrate data for analysis on taxon concepts

    Exploiting Diverse Sources of Scientific Data


    Taxonomic databases
    Taxonomic Databases

    • Main taxonomic list servers are still name based

      • single perspective on taxonomy

        • don’t represent multiple classifications

      • unclear what the definition is (don’t even try!)

      • provide non-standardised interface (web page, xml download)

    • SEEK Taxon aims to prototype a concept/name resolution service for ecologists working with SEEK

      • Find concepts given a name

      • Compare concepts

      • Relate concepts

      • Mark up ecological data sets with concepts

    • First

      • Need data on names and concepts

      • Need an exchange standard….

    Exploiting Diverse Sources of Scientific Data


    Taxon concept schema
    Taxon Concept Schema

    • TCS standard for exchange of taxonomic names/concept data

      • Taxonomic Databases Working Group (TDWG)

      • Global Biodiversity Information Facility (GBIF)

      • XML based exchange schema

      • Makes heavy use of Globally Unique Identifiers (GUIDs)

    • Not designed as the “correct way” to model a Taxon Concept

      • No “rules” as to what a taxon must have

      • Design to accommodate different models

    • Includes Taxon Names

      • more constrained - the codes of nomenclature

    • TCS/EML

      • TCS modifications to EML taxon coverage

    Exploiting Diverse Sources of Scientific Data


    Taxon names and taxon concepts
    Taxon Names and Taxon Concepts

    • Important to be able to pass names alone

      • For nomenclatural and some taxonomic purposes

      • But not for identifications/observations

    • Taxon Concepts refer to Names

      • By GUID

      • Names must not change

        • Can’t record original taxon concept

    Exploiting Diverse Sources of Scientific Data


    Taxon concept name resolution server
    Taxon Concept/Name Resolution Server

    • Taxon Object Server

      • Schema based on the TCS model

      • Implements the GUIDs using LSID technology

      • Tool to import/export data from TCS documents

    • TOS Allows

      • registration, retrieval of taxonomic datasets

      • Match concepts given names, concepts, etc.

    • Allow users to

      • See different taxonomic opinions

      • Uses GUIDs to reference concepts (LSIDs)

      • Find concepts…

      • Author new concepts

      • Make new relationships between existing concepts

    • Integrated with Kepler workflow system

    Exploiting Diverse Sources of Scientific Data


    Seek user interface tools
    SEEK User Interface Tools

    • Concept mapper

      • A desktop tool to assist taxonomists to relate concepts from one source to another

        • For use in creating data sets for TOS or TCS

        • For creating new relationships between concepts in TOS

    • Taxonomy comparison visualisation

      • Visualisation tool to explore different classifications

      • Compare concepts

    Exploiting Diverse Sources of Scientific Data


    Concept mapper main gui

    Query concepts

    Concepts

    Relationships

    Concept Mapper Main GUI

    Exploiting Diverse Sources of Scientific Data


    Concept comparison visualisation
    Concept Comparison Visualisation

    Exploiting Diverse Sources of Scientific Data


    Seek summary
    SEEK Summary

    • Environment to support large scale ecological data analysis

      • Scientific Workflows: Kepler

      • Semantic Mediation

        • Ecological ontology creation/use for data integration

      • Grid/Wed based data discovery

      • Resolution of Taxonomic Names/Concepts

        • Standards development

        • Concept matching server

        • Visualisation tools

    • http://seek.ecoinformatics.org

    Exploiting Diverse Sources of Scientific Data


    Is it safe to match on names
    Is it safe to match on names?

    • I hope I have convinced you that the answer is

      NO

      • as a general rule…

        BUT

    • Depends on the purpose of the data

      • therefore the accuracy required

    • The degree of automation used in matching

      • greater automation – greater potential problem

    • Expertise of person involved in the matching

    Exploiting Diverse Sources of Scientific Data


    Many outstanding issues
    Many Outstanding Issues….

    • Educating biologists of the inherent problem in names

      • Not limited to the Linnaean system of nomenclature

    • Lack of good taxon concept data

    • Widening usage and application of taxon concepts

      • Adopting GUIDs

      • Provision of reliable ‘look up’ facilities

      • Cross referencing of GUIDs

        • Reuse is vital

        • Must not create duplicate GUIDs if possible

    • Conversion of legacy data

    • Develop good matching algorithms

    • Potential move from XML schema -> semantic web technologies

    • ……..

    Exploiting Diverse Sources of Scientific Data


    Acknowledgements
    Acknowledgements

    • This material is based upon work supported by:

    • The National Science Foundation

    • SEEK Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Arizona State University, UC Davis

      • Matt Jones – for many of the slides….

    • Global biodiversity Information Facility

    • eScience Institute

      • Research Theme Programme

      • Malcolm Atkinson

    Exploiting Diverse Sources of Scientific Data


    Exploiting diverse sources of scientific data1
    Exploiting Diverse sources of Scientific Data

    • Upcoming Workshop

      • discussing possible technology solutions

        RDF, Ontologies and Meta-Data Workshop

        7th – 9th June, 2006

        e-Science Institute

        15 South College Street

        Edinburgh

        http://www.nesc.ac.uk/esi/events/683/

    Exploiting Diverse Sources of Scientific Data


    ad