1 / 47

Integrating Diverse Sources of Scientific Data: Is it safe to match on names?

Integrating Diverse Sources of Scientific Data: Is it safe to match on names?. Prof. Jessie Kennedy. Exploiting Diverse Sources of Scientific Data. Wealth and diversity of scientific data collected and stored is growing rapidly Increase in automation

Download Presentation

Integrating Diverse Sources of Scientific Data: Is it safe to match on names?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrating Diverse Sources of Scientific Data: Is it safe to match on names? Prof. Jessie Kennedy

  2. Exploiting Diverse Sources of Scientific Data • Wealth and diversity of scientific data collected and stored is growing rapidly • Increase in automation • Genetic sequencing, remote sensing, astronomy satellites • Decrease in technological costs • Computers more powerful, disk space greater for the same £ • Huge potential for scientific discovery by exploiting this data • especially multi-disciplinary research • Number, complexity and diversity of resources makes this a difficult task • Case Study • Data Integration • Matching data sets on biological names Exploiting Diverse Sources of Scientific Data

  3. SEEK • Science Environment for Ecological Knowledge • USA National Science Foundation funding • Multidisciplinary project • Biology: Ecology, Taxonomy • Environmental science: Geography, Remote sensing, Meteorology, Climatology • Computer Science: Database, GRID/Web, Ontologies, Workflows, Algorithms, Human Computer Interaction Exploiting Diverse Sources of Scientific Data

  4. Model of niche in ecological dimensions occurrence points on native species distribution precipitation temperature Project back onto geography Native range prediction Invaded range prediction The SEEK Prototype: Ecological Niche Modeling Geographic Space Ecological Space Geospatial and remotely sensed data Biodiversity information e.g. data from museum specimens, ecological surveys ecological niche modeling Results taken to integrate with other data realms (e.g., human populations, public health, etc.) Exploiting Diverse Sources of Scientific Data

  5. Species prediction map Predicted Distribution: Amur snakehead (Channa argus) Image from http://www.lifemapper.org Exploiting Diverse Sources of Scientific Data

  6. SEEK - Informatics Challenges • Data is Distributed • Data is Heterogeneous • Syntax • e.g. Text, Excel, Relational Database….. • Schema • e.g. Names of the tables, columns in tables • Semantics  principal focus for SEEK • From many disciplines • Biodiversity surveys, hydrology, atmospheric chemistry, spatial data, behavioural experiments,… • Data on economics, demographics, legal issues,… Exploiting Diverse Sources of Scientific Data

  7. SEEK Overview BEAM WG: Biodiversity and Ecological Analysis and Modelling EcoGrid: Making diverse environmental data systems interoperate Analysis and Modelling System (Kepler) Modelling scientific workflows Knowledge Representation WG: Ontologies, Metadata Taxon WG: Taxonomic name/concept resolution server Semantic Mediation System: “Smart” data discovery and integration Exploiting Diverse Sources of Scientific Data

  8. SEEK Overview EcoGrid Exploiting Diverse Sources of Scientific Data

  9. LUQ AND HBR VCR NTL Metacat node SRB node VegBank node DiGIR node Xanthoria node Legacy system EcoGrid Resources Partnership for Interdisciplinary Studies of Coastal Oceans (4) Natural History Collections (>> 100) UC Natural Reserve System (36) Multi-agency Rocky Intertidal Network (60) LTER Network (24) Organization of Biological Field Stations (180) Exploiting Diverse Sources of Scientific Data

  10. EcoGrid Data Access • EcoGrid registry to discover data sources • EML (Ecological Metadata Language) • Experimental data, survey data, spatial raster and vector data, etc. • XML based • Discovery information • Creator, Title, Abstract, Keyword, etc. • Coverage • Geographic, temporal, and taxonomic extent • Logical and physical data structure • Data semantics via unit definitions and typing • Protocols and methods • DarwinCore • Museum collections Exploiting Diverse Sources of Scientific Data

  11. EcoGrid Services • Service to Analysis and Modelling Layer • Interaction with Kepler – Workflows • Interaction with Grid Computing Facilities • Distributed computation • Service to Semantic Mediation Layer • Access to Ontologies; Taxon Services • Access to Legacy Apps • LifeMapper • Spatial Data Workbench Exploiting Diverse Sources of Scientific Data

  12. SEEK Overview AMS Exploiting Diverse Sources of Scientific Data

  13. Query EcoGrid to find data Archive output to EcoGrid with workflow metadata Scientific Workflows • Model the way scientists currently work with data • coordinate export and import of data among software systems • Workflows emphasize data flow • Output generation includes creating appropriate metadata • The analysis workflow itself becomes metadata • The workflow describes the data lineage as it has been transformed • Derived data sets can be stored in EcoGrid with provenance Exploiting Diverse Sources of Scientific Data

  14. Scientific workflows • EML provides semi-automated data binding Exploiting Diverse Sources of Scientific Data

  15. Kepler: Ecological Niche Model (200 to 500 runs per species x 2000 mammal species x 3 minutes/run) = 833 to 2083 days Exploiting Diverse Sources of Scientific Data

  16. (200 to 500 runs per species x 2000 mammal species x 3 minutes/run) / 100 nodes = 8 to 20 days Grid-enable Kepler • Utilize distributed computing resources • Execute single steps or sub-workflows on distributed machines KeplerGrid for Niche Modeling Exploiting Diverse Sources of Scientific Data

  17. SEEK Overview SMS Exploiting Diverse Sources of Scientific Data

  18. Metadata • Key information needed to read and machine process a data file is in the metadata • Physical descriptors (CSV, Excel, RDBMS, etc.) • Logical Entity (table, image..),Attribute (column) descriptions • Name • Type (integer, float, string…) • Codes (missing values, nulls...) • Integrity constraints • Semantic descriptions (ontology-based type systems) • Metadata driven data ingestion Exploiting Diverse Sources of Scientific Data

  19. Ecological ontologies • What was measured (biomass or photosynthetic solar radiation) • Type of quantity measured (mass, length) • Context of measurement (Psychotria limonensis, wavelength band) • How it was measured (dry weight, total solar radiation) Exploiting Diverse Sources of Scientific Data

  20. Data Ontology Workflow Components Semantic Mediation • Label data with semantic types • Label inputs and outputs of analytical components with semantic types • Use reasoning engine to generate transformation step • Use reasoning engine to discover relevant component Exploiting Diverse Sources of Scientific Data

  21. Data integration • Homogeneous data integration • Integration via EML metadata is relatively straightforward • Heterogeneous Data integration • Requires advanced metadata and processing • Attributes must be semantically typed • Collection protocols must be known • Units and measurement scale must be known • Measurement relationships must be known • e.g., that ArealDensity=Count/Area Exploiting Diverse Sources of Scientific Data

  22. Simple Example Exploiting Diverse Sources of Scientific Data

  23. Life Sciences Data • Much of the data gathered in ecological studies and used in ecological data analysis is bio-referenced data • typically organisms are referenced by a Latin name • e.g. Picea rubens • Many analyses require integrating data • originating in many locations and • at various points in time • For most bio-referenced data, integration involves matching on organism name • SEEK Taxon investigating associated issues Exploiting Diverse Sources of Scientific Data

  24. Biological (Scientific) Names • Used for communicating information about known organisms and groups of organisms – taxa • Framework for all biologists to communicate… • Arise from taxonomists applying them to species and higher taxa following classification • Formalized according to strict codes of nomenclature • differ depending on kingdom • Use a Latin naming scheme • polynomial for species + below; monomial for genus + above • Quoted as: LatinName NameAuthors Year • Example: Carya floridana Sarg. 1913 • Can cause problems in data analysis….. Exploiting Diverse Sources of Scientific Data

  25. _a Taxon_concept Genus Type specimens classify _b _c _d Taxon_concept Taxon_concept Taxon_concept Species Pile of specimens Taxonomic Hierarchy Classification, Concepts & Names Exploiting Diverse Sources of Scientific Data

  26. classify Pile of specimens Taxon_concept_d Taxon_concept_d Classification, Concepts & Names Exploiting Diverse Sources of Scientific Data

  27. (ii) Aus L.1758 (i) Aus L.1758 Publications of Taxonomic Revisions Fry splits Aus bea Archer. 1965 into two species, retains the name for one and creates a new one Tucker finds new specimens and combines Aus aus L. 1758 and Aus bea Archer. 1965 into one species, retains the name. Archer splits Aus aus L. 1758 into two species, retains the name for one and creates a new one Pargiter decides to re-split Aus aus but believes bea(beus) is in a new genus Xus. Aus bea Archer 1965 Aus aus L.1758 type specimen Genus concept genus name (iv) Aus L.1758 (v) Aus L.1758 (iii) Aus L.1758 Archer 1965 Linnaeus 1758 Aus aus L.1758 Aus aus L.1758 Aus aus L. 1758 Aus aus L.1758 Species concept Aus bea Archer 1965 Aus ceus BFry 1989 Aus cea BFry 1989 species name Aus cea BFry 1989 Xus Pargiter 2003 Tucker 1991 A diligent nomenclaturist, Pyle (1990), notes that the species epithet of Aus bea and Aus cea are of the wrong gender and publishes the corrected names Aus beus corrig. Archer 1965 and Aus ceus corrig. BFry 1989 Xus beus (Archer) Pargiter 2003. Fry 1989 Pargiter publishes his revision using Pyle’s corrigendum of the epithet bea to beus and Aus cea to Aus ceus. Tucker publishes his revision without noting Pyle’s corrigendum of the name of Aus cea Publications of Purely Nomenclatural Observation publication Pargiter 2003 In Linnaeus 1758 In Archer 1965 In Pyle 1990 In Tucker 1991 In Fry 1989 In Pargiter 2003 specimen Taxonomic history of Aus L. 1758 bea and cea noted as invalid names and replaced with beus and ceus. Pyle1990 Exploiting Diverse Sources of Scientific Data

  28. Problems with Taxonomic Names • Are not unique • “Re-use” of names with changed definition • Name is ambiguous without definition/context • Subject to alterations and 'corrections' in time • Often recorded inappropriately in datasets • No author and/or year (e.g. Carya floridana) • Abbreviated (e.g. C. floridana) • Internal code (e.g. PicRub for Picea rubens) • Vernacular used (e.g. Scrub Hickory) • Misspelled Exploiting Diverse Sources of Scientific Data

  29. Taxon Concepts …… • The published expert opinion defining and describing a group of organisms which are given a (scientific) name • Scientific names qualified with a reference to the definition of a concept • Should be used for communicating about groups of organisms • Comparing or integrating data based on taxon concepts will be more accurate Exploiting Diverse Sources of Scientific Data

  30. Taxon Concepts… • Created by someone - an Author • Described in a Publication • Given a Name • Related to the type specimen • Definition • Referenced by • Full Scientific name + “according to” (Author + Publication + Date)  Definition • Carya floridana Sarg. (1913) “according to” Charles Sprague Sargent, Trees & Shrubs 2:193 plate 177 (1913) Exploiting Diverse Sources of Scientific Data

  31. Taxon Concepts …… • Defined by • set of Specimens examined during classification • set of common Characters • context dependent; differentiate taxa rather than fully describe them; • use natural language with all its ambiguities • relationships to other Taxon Concepts • Taxon circumscription • the lower level taxa • Congruence, overlap, includes etc. to taxa in other classifications Exploiting Diverse Sources of Scientific Data

  32. Taxon Concepts …… • Original concept • 1st use of name as described by the taxonomist • same author + date in scientific name and “according to” • Carya floridana Sarg. (1913) Charles Sprague Sargent, Trees & Shrubs 2:193 plate 177 (1913) • TC_a • Revised concept • Re-classification of a group • Carya floridana Sarg. (1913) “according to” Stone, Flora of North America 3:424 (1997) • TC_b • Relationship between the taxon concepts • TC_b includes TC_a Exploiting Diverse Sources of Scientific Data

  33. Legacy Data … • In legacy data names often appear in place of concepts • Names are imprecise • inappropriate for referring to information regarding taxa • e.g. observational/collection data • BUT…sometimes that’s all we have • How do we interpret names?….. • potentially multiple definitions • the sum of all definitions that exist for the name • one of the existing definitions • the “attributes” in common to all the definitions • represented by the type specimen Exploiting Diverse Sources of Scientific Data

  34. Names as Taxon Concepts • Nominal concepts • Sub-set of TaxonConcepts • Name but no AccordingTo • non-unique (concept) identifier attributes • can be given a unique concept identifier • No definition • Explicitly saying it’s something with this name • but not really sure what is/was meant by the name • Encourage people to understand and address the issue of names • Allowing mark-up of data with names allows them to believe names are really good enough • Will improve long term usefulness of scientific data • Ease integration Exploiting Diverse Sources of Scientific Data

  35. SEEK Taxon’s Message….. • Scientific names are not unique identifiers for biological entities • Integrating data from different sources based on names alone could cause serious errors in analysis of the integrated data • Biologists must reference organisms precisely • if datasets to be of use long term or to other users • Reference by taxon concept rather than name • integrate data for analysis on taxon concepts Exploiting Diverse Sources of Scientific Data

  36. Taxonomic Databases • Main taxonomic list servers are still name based • single perspective on taxonomy • don’t represent multiple classifications • unclear what the definition is (don’t even try!) • provide non-standardised interface (web page, xml download) • SEEK Taxon aims to prototype a concept/name resolution service for ecologists working with SEEK • Find concepts given a name • Compare concepts • Relate concepts • Mark up ecological data sets with concepts • First • Need data on names and concepts • Need an exchange standard…. Exploiting Diverse Sources of Scientific Data

  37. Taxon Concept Schema • TCS standard for exchange of taxonomic names/concept data • Taxonomic Databases Working Group (TDWG) • Global Biodiversity Information Facility (GBIF) • XML based exchange schema • Makes heavy use of Globally Unique Identifiers (GUIDs) • Not designed as the “correct way” to model a Taxon Concept • No “rules” as to what a taxon must have • Design to accommodate different models • Includes Taxon Names • more constrained - the codes of nomenclature • TCS/EML • TCS modifications to EML taxon coverage Exploiting Diverse Sources of Scientific Data

  38. Taxon Names and Taxon Concepts • Important to be able to pass names alone • For nomenclatural and some taxonomic purposes • But not for identifications/observations • Taxon Concepts refer to Names • By GUID • Names must not change • Can’t record original taxon concept Exploiting Diverse Sources of Scientific Data

  39. Taxon Concept/Name Resolution Server • Taxon Object Server • Schema based on the TCS model • Implements the GUIDs using LSID technology • Tool to import/export data from TCS documents • TOS Allows • registration, retrieval of taxonomic datasets • Match concepts given names, concepts, etc. • Allow users to • See different taxonomic opinions • Uses GUIDs to reference concepts (LSIDs) • Find concepts… • Author new concepts • Make new relationships between existing concepts • Integrated with Kepler workflow system Exploiting Diverse Sources of Scientific Data

  40. SEEK User Interface Tools • Concept mapper • A desktop tool to assist taxonomists to relate concepts from one source to another • For use in creating data sets for TOS or TCS • For creating new relationships between concepts in TOS • Taxonomy comparison visualisation • Visualisation tool to explore different classifications • Compare concepts Exploiting Diverse Sources of Scientific Data

  41. Query concepts Concepts Relationships Concept Mapper Main GUI Exploiting Diverse Sources of Scientific Data

  42. Concept Comparison Visualisation Exploiting Diverse Sources of Scientific Data

  43. SEEK Summary • Environment to support large scale ecological data analysis • Scientific Workflows: Kepler • Semantic Mediation • Ecological ontology creation/use for data integration • Grid/Wed based data discovery • Resolution of Taxonomic Names/Concepts • Standards development • Concept matching server • Visualisation tools • http://seek.ecoinformatics.org Exploiting Diverse Sources of Scientific Data

  44. Is it safe to match on names? • I hope I have convinced you that the answer is NO • as a general rule… BUT • Depends on the purpose of the data • therefore the accuracy required • The degree of automation used in matching • greater automation – greater potential problem • Expertise of person involved in the matching Exploiting Diverse Sources of Scientific Data

  45. Many Outstanding Issues…. • Educating biologists of the inherent problem in names • Not limited to the Linnaean system of nomenclature • Lack of good taxon concept data • Widening usage and application of taxon concepts • Adopting GUIDs • Provision of reliable ‘look up’ facilities • Cross referencing of GUIDs • Reuse is vital • Must not create duplicate GUIDs if possible • Conversion of legacy data • Develop good matching algorithms • Potential move from XML schema -> semantic web technologies • …….. Exploiting Diverse Sources of Scientific Data

  46. Acknowledgements • This material is based upon work supported by: • The National Science Foundation • SEEK Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Arizona State University, UC Davis • Matt Jones – for many of the slides…. • Global biodiversity Information Facility • eScience Institute • Research Theme Programme • Malcolm Atkinson Exploiting Diverse Sources of Scientific Data

  47. Exploiting Diverse sources of Scientific Data • Upcoming Workshop • discussing possible technology solutions RDF, Ontologies and Meta-Data Workshop 7th – 9th June, 2006 e-Science Institute 15 South College Street Edinburgh http://www.nesc.ac.uk/esi/events/683/ Exploiting Diverse Sources of Scientific Data

More Related