Integrated support for data integration and science portals

Integrated support for data integration and science portals Amarnath Gupta University of California San Diego

Overview • We will first • Discuss what “cyberinfrastructure” for science means • Situate the business of “data integration” within the cyberinfrastructure setting • Then we will briefly describe a few cyberinfrastructure projects in different science disciplines • Biomedical sciences, geo-sciences, environmental sciences, marine biology, physical oceanography … • We will examine some dimensions of the data integration problem • Discuss how they are approached in different projects from a CS /Data Management perspective • Discuss common and complementary themes across these approaches

Cyberinfrastructure • Cyberinfrastructureis the organized aggregate of technologies enabling access and coordination of information technology resources to facilitate science, engineering, and societal goals. • Data access from distributed systems • Data inter-operability and assimilation • Computation: grid based and workflows • Visualization • Tools • Information Integration: highlighted today National Science Foundation’s Cyberinfrastructure NSF Blue Ribbon Panel (Atkins) Report provided a compelling and comprehensive vision of an integrated Cyberinfrastructure Modified from Berman, SDSC, 2005 CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005

Source: Mark Ellisman

We are here: • Making more general-purpose data integration infrastructure over distributed resources • Extending to accommodate various scientific applications with stored and streaming data Source: Mark Ellisman

GEONworkbench Modeling Environment Indexing Services Data Integration Services Workflow Services Registration Services Visualization & Mapping Services GEONgrid Software Layers Portal (login, myGEON) Registration GEONsearch Core Grid Services GT3, OGSA-DAI, GSI, CAS, gridFTP, SRB, PostGIS, mySQL, DB2 Physical Grid RedHat Linux, ROCKS, Internet, I2, OptIPuter (planned) GEON Space

Complete Workflows Application Portal Command/Batch Access Overall Operations Integrated SW Distribution Domain Application Tools Data Integration Mechanisms Distributed Data Collections Mgmnt Distributed Data File Management Computation/Analysis Facilities Identity/Login Management Authorization and Role Definition BIRN: Major System Components Collaborating Groups of Biomedical Researchers Registered BIRN Data

BIRN Portal Command/Batch Access BIRN-CC Semi-Annual BIRN SWDistribution Storage Resource Broker (SRB) AFS (file system) Condor, Globus: Local clusters + Teragrid GSI-Based. GAMA + MyProxy SRB for Access Control to Data BIRN: Specific Implementations Mouse, Function, Morphometry (+ New Areas and Users ) Pegasus, Kepler, Loni Pipeline, etc. e.g., AFNI, Air, 3DSlicer, LONI, .. BIRN Data Integration Suite Registered BIRN Data

The OntoGrid View Tavernae-Science workbench Third-party tools Applications LSID Launchpad Haystack Web portals Utopia e-Science process patterns LSID support myGrid information model e-Science mediator e-Science coordination Metadata Management Data Management e-Science events KAVE metadata store Service & workflowdiscovery mIRmyGrid information repository Fetasemantic discovery KAVE provenance capture Core Services Pedro semantic publication Workflow enactment Pedro semantic publication Freefluoworkflow engine GRIMOIRES federated UDDI+ registry Notification service myGrid ontology Web Service (Grid Service) communication fabric External Services Java applications Soaplab AMBITtext extraction service OGSA-DAI DQP service Executable codes with an IDL Gowlab Legacy applications Web Services OGSA-DAI databases Web Sites Courtesy: Carole Goble

A Word about Data in ScienceExcerpts from a Report by NSF’s Office of the Cyberinfrastructure • Data. … data are any and all complex data entities from observations, experiments, simulations, models, and higher order assemblies, along with the associated documentation needed to describe and interpret the data. • Metadata. Metadata are a subset of data, and are data about data. Metadata summarize data content, context, structure, inter-relationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identification of similar data in different data collections. • Ontology. An ontology is the systematic description of a given phenomenon, often includes a controlled vocabulary and relationships, captures nuances in meaning and enables knowledge sharing and reuse.

What is data integration? • For applications where there are a number of data sources (recall previous slide) • Geographically distributed • Having data on different platforms • (may be) on systems with different query capabilities (e.g., different DBMSs, files, spreadsheets) • Perhaps even having different data models • Having different schema • BUT about one common, general theme • One may want to construct • A general-purpose information system such that • All these data sources can be co-accessed as if they belong to a single data source • It can produce “combined information objects” on-demand for ad hoc queries to facilitate problem-specific analyses performed through other software products (workflows, atlases, statistical packages …) • Data integration refers to a body of techniques to produce such an information system

myActiveNeuroCollection patientRecordsCollection image.cgi image.wsdl image.sql E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where... Data Integration vis-à-vis Data Grid • A different aspect of data management Inter-organizational Information Storage Management Semantic data Organization (with behavior) Virtual Data Transparency Data Replica Transparency image_0.jpg…image_100.jpg Data Identifier Transparency Storage Location Transparency Storage Resource Transparency Courtesy: Reagan Moore and Arun Jagatheesan

Data Integration in Science Starts with Science Questions • GeoScience (GEON) • What is the geologic and geophysical record of Super-Continent assembly and dispersal? • What are the architectures of terrain boundaries at depth? • How do composition, temperature and strain fabrics vary within the lithosphere and asthenosphere? Are lithospheric and asthenospheric strain coupled? • Neuroscience (BIRN) • Find volumetric data/metadata from MRIs of humans with specific diagnosis(es) • Which structures are decreased/increased in size relative to normal controls • Which structures show structural differences across a variety of diagnoses • Given a structure which shows structural differences • Which other structures are associated with it • Do any of these associated structures show structural differences • Do these other changed structures have commonalities (i.e. cell types, neurotransmitters, other afferent/efferent connections) • Environmental Science (PAKT, CAMERA) • Explain biodiversity by correlating distribution of a taxonomic group with spatial (temporal) distribution of temperature, dissolved oxygen, salinity. • What accounts for large-scale genetic variation in microbial genomes that share a very recent common ancestry among coral reef habitats? DATA NEEDED TO ADDRESS THESE QUESTIONS ARE DISTRIBUTED ACROSS THE WORLD

A Science Question can be Complex Q1. What is the geologic and geophysical record of Super-Continent assembly and dispersal? Needs complex integration of geophysical data with those associated with sub-crustal lithosphere ages, its composition and physical properties (seismic, thermal etc), surface geology and associated events chronology Adapted from D.Seber, SDSC A.K.Sinha, Virginia Tech, 2005

Converting Questions to Queries CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005

(Some) Dimensions of Information Integrationin Cyberinfrastructure Projects • Source Information Model • Integration Engine’s Information Model • Specification of semantic correspondences across sources • The 3-party power play among “global schema”, “local schema”, “ontology” • Query paradigms over integrated data • The mechanics of • query planning • query execution

About Semantic Correspondences • The general problem • For any data integration across multiple sources there needs to be a way to • Specify how two objects from different data sources may correspond • Specify of the “joining” of these two objects would create a composite data object • What’s the big deal? • Identical object versus equivalent objects • Complete objects versus partial objects • Multi-scale representations of the same object • Handling definitional differences • Taking into account natural variability • Contextual correspondence Are these always specifiable through ontological standards like OWL? Do we need to have “correspondence checking” services? Listen to Oscar and Carol’s session tomorrow for a different angle

About the 3-party Power Play • While we want to create a single (cyber-) infrastructure with a data integration component, different applications have different integration scenarios • Is there a single global schema? • Do new applications (and hence global schema) get added all the time over existing sources and ontologies? • Are the sources fixed? Do new sources get added all the time? Do sources come and go? • Are sources added dynamically as “data sets” that users want to integrate “on the fly”? • Do local schemata come with their own ontologies? Is there a global ontology that all local ontologies must map to? • How does the global schema (if one exists) relate to the global and local ontologies? • Do new (or modified) ontologies get added all the time? • Do the local schemata evolve all the time? Is there a general way to manage this? Do we need to architect any cyberinfrastructure components differently?

Source Information Models • BIRN • Data Sources • Relational DBMS • Standard data types • Semantic data types (attribute-domain references to ontologies) • Some data and computation sources expose a set of functions • Key constraints • Ontology Sources • Simplifying assumptions • Ontologies can be approximated by edge-labeled directed graphs stored in relational systems • Graph traversal functions can be mimicked as database functions • BONFIRE • Glue ontology for simple inter-ontology mappings and extensions • Image and Spatial Data Sources • Discussed later

Source Information Models • GEON • Data Sources • Assumption: all data are in GEONSpace • Items and Item details • Any relational jdbc data source (e.g., Excel files) is admitted • Standard relational data types, shapefiles for spatial data • Semantic Data types by connecting to ontology • Ontology Sources • Any OWL-specified ontology • Registration in GEON • Level 1: Federation Based Integration • Users should know the component database schemata • Level 2: View Based Integration • Same as in BIRN • Level 3: Ontology Based Integration • Preferred Method

Source Information Models • PAKT (marine biogeography) • Data Sources • Relational • Spatial (vectors) supported by GIS and Spatial DBMS • Spatial (raster – continuously partitionable arrays) • ArcGIS (map algebra), • Nested, non-aligned, multiple resolution • Spatially-indexed time series • Function-exposing sources (WSDL) • Parameter and result data types are interpretable or BLOBS • Ontology Sources • Any ontology specified in a subset of OWL • Any DAG-structured data source

Source Information Models • CAMERA • PAKT ++ • Data sources that export annotated sequences as a base data type • Phylogenetic trees • XML repositories with XPath/XQuery Processor • RDBMS with XML processing capabilities • Graphs such as molecular interaction networks (e.g., biological pathways), chemical reaction networks …

Integration Engine’s Information Model • BIRN • Sources from the mediator’s view • Base relations may have binding patterns • Distinction between data and metadata is not strictly observed • SRB metadata catalog is treated as a relational source with some special functions • Files are accessed by reference to data-grid URIs (SRB ids) • Integration Model • Essentially Global-as-view (GAV) mediation • “semantic” aspect of the mediation executed through opaque functions over ontology sources • Key constraints not used during standard query processing but are used for keyword queries

Integration Engine’s Information Model • BIRN (contd.) • The 3-party power-play • Many integrated views used by several global schemata on a relatively fixed set of sources • Ontologies are used in two ways • A global view may be defined using ontology functions • Keyword queries use simple ontological relationships • Some terms in the global schema mapped to ontologies through semantic typing • Otherwise the global schema and integrated views are independent from the ontology • Some data are warped to a common atlas coordinate systems to enable atlas queries • Atlas mapping ≡ spatial annotation

Integration Engine’s Information Model • Gateway • has XML API for source registration, source schema update • Has XML API for queries • Can be accessed as web service • Registry • API-based access to schema elements and view definitions • Implemented over MySQL for portability • Spatial registry for image data • Planner and Executor • Described later • Wrappers • Local and remote • OTIS • Inverted index for ontological terms • BIRN Integration architecture Atlas Client Query Client Onto Client Atlas Query Processor Ontological Query Processor Spatial Registry Mediator OTIS Data Grid Access Wrapper Access

BIRN Tool: Source Registration

Information Engine’s Information Model • GEON • Sources from the Integration Engine’s Viewpoint • Metadata (Item-level information) maintained in a GEON standard called ADN (Alexandria-Delese-NASA) • Item-detail level information is either any relationalizable data or shapefiles • Any WMS, WFS service is a valid source for map information management • Does not permit an external ontology source, all ontologies have to be defined in the GEON framework • Integration Model • Every source schema is registered to an ontology

Integration Engine’s Information Model • 3-party power play • Several global schemata can be defined • A global schema IS the OWL-DL compliant ontology • A couple of consequences • All transitive closure information is pre-computed after registration • If a concept class have key constraints, subsumption is NEXP-Time hard, and undecidable if the key constraint has a complex domain • Does not matter much in practice because subsumption is hardly computed • Pragmatics • As new sources join, or new applications are attempted, the ontology needs to evolve

Input a data set name Click on Submission to register a dataset Choose an ontology class Select a zipped shapefile Virginia Tech & GEON Geon Data Registration CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005

Virginia Tech & GEON Registration of Item Detail CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005

<odal:NamedIndividuals odal:id="RockSample" odal:database="VTDatabase"> <odal:Class odal:resource="http://geon.vt.edu#RockSample" /> <odal:Table>Samples</odal:Table> <odal:Table>RockTexture</odal:Table> <odal:Table>RockGeoChemistry</odal:Table> <odal:Table>ModalData</odal:Table> <odal:Table>MineralChemistry</odal:Table> <odal:Table>Images</odal:Table> <odal:Column>ssID</odal:Column> </odal:NamedIndividuals> GUI generate to ODAL processor ODAL(Ontological Database Annotation Language) • Create a partial model of ontologies from database • Independent on any GUI • Independent on any concrete implementations • reusable Thevaluesin thecolumn ssIDof thetable Samples, RockTexture, RockGeoChemistry, ModalData,MineralChemistry and Images represent instancesof RockSample

ODAL: Import Ontologies The Ontologies used for annotating a database can be imported as follows: <?xml version="1.0"?> <odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:odal = “http://www.sdsc.edu/odal#” > <odal:Ontology> <odal:Imports rdf:resource="http://www.library.org/Book.owl"/> <odal:Imports rdf:resource="http://www.writer.org/Writer.owl"/> </odal:Ontology> …… </odal:ODAL>

ODAL: Database Connection Declaration The target database for making annotation is declared as follows: <?xml version="1.0"?> <odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:odal = “http://www.sdsc.edu/odal#” > …… <odal:Database odal:id="PublicationDatabase"> <odal:DatabaseProductName>Oracle<odal:DatabaseProductName> <odal:DatabaseProductVersion>9.1.21<odal:DatabaseProductVersion> <odal:Host>oracle.sdsc.edu</odal:Host> <odal:Port>3456</odal:Port> <odal:DatabaseName>Publications</odal:DatabaseName> </odal:Database> …… </odal:ODAL>

ODAL: Simple Named Individuals Suppose the book ontology contains a class Book and the schema Collection contains a table book-price with a column ISBN. <odal:NamedIndividuals odal:id="BookInTableBookPrice" odal:database="PublicationDatabase" > <odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/> <odal:Schema>Collections</odal:Schema> <odal:Table>book-price</odal:Table> <odal:Column>ISBN</odal:Column> </odal:NamedIndividuals> The statement says that each value in the column ISBN represents a book individual. odal:id gives a name to the declaration, and represents the set of the individuals generated by the statement.

ODAL: The Names of Individuals <odal:NamedIndividuals odal:id="BookInTableBookPrice" odal:database="PublicationDatabase" > <odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/> <odal:Schema>Collections</odal:Schema> <odal:Table>book-price</odal:Table> <odal:Column>ISBN</odal:Column> </odal:NamedIndividuals> Individual Name (BookInTableBookPrice, PublicationDatabase.Collections.book-price.ISBN:0817313478)

ODAL: Named Individuals from Multiple Columns Suppose an ontology contains a class Location and a database table Rock-Sample with two columns Latitude and Longitude. <odal:NamedIndividuals odal:id="LocationInTableRockSample" > <odal:Class odal:resource="http://www.usgs.org/Space.owl#Location"/> <odal:Schema>California</odal:Schema> <odal:Table>Rock-Sample</odal:Table> <odal:Column>Latitude</odal:Column> <odal:Column>Longitude</odal:Column> </odal:NamedIndividuals> The statement says that a pair of latitude and longitude gives a location

ODAL: Named Individuals with Conditions <odal:NamedIndividuals odal:id="MaleEmployeeInTableEmployee" > <odal:Class odal:resource="http://www.abc.com/Employee.owl#MaleEmployee"/> <odal:Table>employee</odal:Table> <odal:Column>EmployeeId</odal:Column> <odal:Condition><![CDATA[ Gender=’M’ >]]</odal:Condition> </odal:NamedIndividuals> <odal:NamedIndividuals odal:id="FemaleEmployeeInTableEmployee" > <odal:Class odal:resource="http://www.abc.com/Employee#FemaleEmployee"/> <odal:Table>employee</odal:Table> <odal:Column>EmployeeId</odal:Column> <odal:Condition><![CDATA[ Gender=’F’ >]]</odal:Condition> </odal:NamedIndividuals> A condition in an odal:Condition element should be a Boolean expression which is valid to be used in any WHERE clauses of SQL queries

ODAL: Data Type Property Declaration Person … SSN … age … … 123-56-7890 … 8 … hasAge posInt <odal:NamedIndividuals odal:id="PersonInTablePerson" > <odal:Class odal:resource="http://www.foo.org/Person.owl#Person"/> <odal:Table>Person</odal:Table> <odal:Column>ssn</odal:Column> </odal:NamedIndividuals> <odal:OntologyProperty> <odal:DatatypeProperty odal:resource="http://www.foo.org/Person.owl#hasAge"/> <odal:Table>person</odal:Table> <odal:Domain odal:resource="PersonInTablePerson" /> <odal:Range odal:resource="age" /> </odal:OntologyProperty>

Conditions for Joining Individuals from Different Resources • Usually we don’t make join on individuals cross different resources • A set of datatype properties can be declared as a key for a class in the ontology. We do join cross multiple resources based on keys. e.g. {hasLatitude, hasLongitude} can be declared as a key of Location Two locations from different resources are same if they have the same latitude and longitude Rock We don’t know whether 10001 represents the same rock in the two resources. By default, we assume they are not.

The Architecture of GEON Semantic Mediator Oracle DB2 MySQL SQL Server PostgreSQL PostGIS Query Execution Query Optimization Query Planning Internal Database SQL Parser Spatial SQL against federal schemas Mediator JDBC Driver SOQL Parser Semantic Query Rewriter SOQL Ontology Reasoner ODAL Processor GUI Portal or Application OWL ODAL SOQL Processor

The Map Integration Architecture

Snapshot after querying “Paleozoic” Map Integration

Integration Engine’s Information Model • PAKT (briefly) • Type extensibility of the mediator • Nested relational query language extended by tree and a restricted set of graph pattern operations • Construction operations important • Passive extensibility • Source more powerful than the mediator • Source exports a set of type-based optimization rules to the mediator • Active extensibility • Mediator extends its set of interpreted types • Ontology management • Ontological queries processed by a separate co-processor that interoperates with mediator • Query planner partitions the query into ontological and mediated query processors

Query Paradigms • What are the different kinds of queries scientists and applications pose to an integrated system? • Metadata-based file access • 21,038 raw image files per subject • 2.4 GB of raw image data per subject • 25 GB to 40 GB of processed image data per subject • 10 million slices of functional imaging data in Phase II • 7 Terabytes of image data for all of the Phase II analyses (conservative estimate of 25 GB/subject) • Ontologically supported mediated queries • “Find most recent FMRI data of all patients with low scores in working memory tasks having volumetric changes of hippocampus over 10% in 2 years” • Keyword queries • FMRI “working memory task” hippocampus • Ontologically supported keyword queries • Associative searches

location RockSample Location hasSiO2 lat long value float ValueWithUnit unit SELECT X.location.*; FROM RockSample X WHERE X.location.lat > 60 AND X.location.long > 100 AND X.hasSiO2.value < 30 AND X.hasSiO2.unit =‘weightPercetage’ string GUI generate to SOQL processor GEON: SOQL (Simple Ontology Query Language) • Query single or integrated resources • via ontologies (i.e., high level logical views) • independent on any physical presentation (i.e. schemas)

GEON SOQL GUI SELECT X.code, X.location.* FROM SeismicStation X, Railroad Y WHERE distance(X.location, Y.geometry) < 1 SOQL Processor SELECT X2.stationcode, X2.lat, X2.lon FROM railroads_of_the_united_states X1, stationdatatable X2 WHERE distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1 Schema Mediator SELECT X1.the_geom FROM railroads X1 distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1 Seismic Stations Railroad shapefile Question: Finding all seismic stations within 1 mile from railroads SELECT X2.stationcode, X2.lat, X2.lon FROM stationdatatable X2 WHERE bounding box condition

BIRN: A Functional View of the Mediation Process Query Expression (UCQ+ + Nesting + Grouping & Aggregate) Pre-Executable Plan Executable Plan Flattening of Nested Queries Post-processing + aggregate Execution Control View Unfolding Normalization to DNF Result Building Predicate Reordering (binding patterns + maximal chunk) Result Reporting Maximal Feasible Plan Algebraic Plan Cost/Selectivity-based Optimization Pre-Executable Plan

View Definition and Query Language • Union of conjunctive queries • May contain function term • Expressed in XML Datalog with aggregated functions • Query q(X,F(Y)):-r1(X,Z),r2(Z,Y), - where F(Y) – aggregate function operated on set of Y and X group-by variables. • Planner and Executor translate this to: • q’(X,Y):-r1(X,Z),r2(Z,Y) • q(X,W):-F(gb(q’(X,Y)) • Where group-by “gb” function with aggregate function F pushed to data source whenever possible or evaluate at Mediator. • Query Language allows for nested query – inner queries are assigned to intermediate variables that are used by main query

BIRN” Mapping Relations • Ontology Mapping -maps data values from a source to an ontology term of a known ontology (UMLS) • Joinable relation pairs attributes from different relations • Value-Map – maps mediator-supported data value to source supported (for example: gender – 0/1 at some source is male/female for mediator)

Integrated support for data integration and science portals

Integrated support for data integration and science portals

Presentation Transcript

Data Integration for Big Data

Integrated Science

Project 3.4 Integrated Data Management and Portals

Integrated Science

Software Integration and Support

Semantic Web for Life Science Data Representation and Integration

Integrated science

Data Mining and Decision Support Integration

Methods for Data Discovery – Portals

Curating data for integrated science

From Data Integration to Integrated Solutions

Versioning Support for Large Applications and Portals

Portals are Made for Enterprise Application Integration

Integration of Portals and Web Services

VLab: Collaborative Grid Services and Portals to Support Computational Material Science

Using ArcGIS to Support Healthy Communities and Open Health Data Portals

VLab: Collaborative Grid Services and Portals to Support Computational Material Science

“An Integrated Science Cyberinfrastructure for Data-Intensive Research”

Semantic Web for Life Science Data Representation and Integration

Integration of Portals and Web Services

Data Science Job Support | Data Science Online Job Support