Scientific Data Management: From Data Integration to Analytical Pipelines

Scientific Data Management: From Data Integration to Analytical Pipelines Bertram Ludäscher ludaesch@sdsc.edu Data & Knowledge Systems San Diego Supercomputer Center University of California, San Diego

Outline • Motivation: Scientific Data Integration Problems • “Semantic” (Model-based) Mediation • Scientific Workflows and Analytical Pipelines

National Science Foundation (NSF) www.nsf.gov GEOsciences Network (NSF) www.geongrid.org Biomedical Informatics Research Network (NIH) www.nbirn.net Science Environment for Ecological Knowledge (NSF) seek.ecoinformatics.org Scientific Data Management Center (DOE) sdm.lbl.gov/sdmcenter/ Acknowledgements

addall.com ? Information Integration barnes&noble.com A1books.com amazon.com half.com An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” “One-World” Scenario: XML-based mediator Mediator (virtual DB) (vs. Datawarehouse)

? Information Integration Crime Stats Demographics Realtor School Rankings A Home Buyer’s Information Integration Problem What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? “Multiple-Worlds” Mediation

Biomedical Informatics Research Network http://nbirn.net Some BIRNing Data Integration Questions • Data Integration Approaches: • Let’s just share data, e.g., link everything from a web page! • ... or better put everything into an relational or XML database • ... and do remote access using the Grid • ... or just use Web services! • Nice try. But: • “Find the files where the amygdala was segmented.” • “Which other structures were segmented in the same files?” • “Did the volume of any of those structures differ much from normal?” • “What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents?”

? Information Integration sequence info (CaPROT) protein localization (NCMIR) morphometry (SYNAPSE) neurotransmission (SENSELAB) Biomedical Informatics Research Network http://nbirn.net A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? “Complex Multiple-Worlds” Mediation

Information Integration Challenges: Heterogeneities = S4... • System Aspects • platforms, devices, distribution, APIs, protocols, … • Syntaxes • heterogeneous data formats (one for each tool ...) • Structures • heterogeneous schemas(one for each DB ...) • heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …) • Semantics • unclear & “hidden” semantics: e.g., incoherent terminology, multiple taxonomies, implicit assumptions, ...

Semantics Structure Syntax • reconciling S4heterogeneities • “gluing” together multiple data sources • bridging information and knowledge gaps computationally System aspects Information Integration Challenges • System aspects: “Grid” Middleware • distributed data & computing • Web Services, WSDL/SOAP, OGSA, … • sources = functions, files, data sets, • … • Syntax & Structure: • (XML-Based) Data Mediators • wrapping, restructuring • (XML) queries and views • sources = (XML) databases • Semantics: • Model-Based/Semantic Mediators • conceptual models and declarative views • Knowledge Representation: ontologies, description logics (RDF(S),OWL ...) • sources = knowledge bases (DB+CMs+ICs)

Information Integration from a DB Perspective • Information Integration Problem • Given: data sources S1, ..., Sk (DBMS, web sites, ...) and user questions Q1,..., Qn that can be answered using the Si • Find: the answers to Q1, ..., Qn • The Database Perspective: source = “database” • Si has a schema (relational, XML, OO, ...) • Sican be queried • define virtual (or materialized) integrated views Vover S1 ,..., Skusing database query languages(SQL, XQuery,...) • questions become queries Qi against V(S1,..., Sk)

USER/Client Query Q ( G (S1,..., Sk) ) Integrated Global (XML) View G Integrated View Definition G(..) S1(..)…Sk(..) MEDIATOR (XML) Queries & Results Standard (XML-Based) Mediator Architecture (XML) View (XML) View (XML) View wrappers implemented as web services Wrapper Wrapper Wrapper S1 S2 Sk

GeoSciences Network domain knowledge ? Information Integration Knowledge Representation: ontologies, concept spaces Database mediation Data modeling raw data Scientific Data Integration... Questions to Queries ... What is the distribution and U/ Pb zircon ages of A-type plutons in VA? How about their 3-D geometry ? How does it relate to host rock structures? “Complex Multiple-Worlds” Mediation GeoPhysical (gravity contours) Geologic Map (Virginia) GeoChronologic (Concordia) Foliation Map (structure DB) GeoChemical

Towards Shared Conceptualizations: Data Contextualization via Concept Spaces

Rock Classification Ontology Genesis Fabric Composition Texture

Some enabling operations on “ontology data” • Concept expansion: • what else to look for when asking for ‘Mafic’ Composition

Some enabling operations on “ontology data” • Generalization: • finding data that is “like” X and Y Composition

domain knowledge Knowledge representation AGE ONTOLOGY Nevada Show formations where AGE = ‘Paleozic’ (with age ontology) Show formations where AGE = ‘Paleozic’ (without age ontology)

domain knowledge Knowledge representation AGE ONTOLOGY +/- Energy Nevada +/- a few hundred million years GEON Metamorphism Equation: Geoscientists + Computer Scientists Igneous Geoinformaticists Example: Geologic Map Integration

Midatlantic Region Rocky Mountains GEON and “Semantic” Data Integration

Mediator Demo

Biomedical Informatics Research Network http://nbirn.net Getting Formal: Source Contextualization & Ontology Refinement in Logic

GeoSciences Network CS & theory Distributed Querying Processing Challenges: Part I, The Basics • “Scientific data” (BIRN, GEON, ...) variant of data integration problem studied by database CS community • Given • user query against integrated view • view to source mappings (GAV/LAV) • sources with limited access patterns • Compute a distributed query plan Ps.t. • P has a feasible execution order • P optimized wrt. time/space/networking complexity

Autonomous field sensors Seismic, oceanic, climate, ecological, …, video, audio, … RT Data Acquisition: ANZA Seismic Network (1981-present):13 Broadband Stations, 3 Borehole Strong Motion Arrays, 5 Infrasound Stations, 1 Bridge Monitoring System; Kyrgyz Seismic Network (1991-present): 10 Broadband Stations; IRIS PASSCAL Transportable Array (1997-Present):15 - 60 Broadband and Short Period Stations; IDA Global Seismic Network (~1990 -Present): 38 Broadband Stations High Performance Wireless Research Network (HPWREN) High performance backbone network: 45Mbps duplex point-to-point links, backbone nodes at quality locations, network performance monitors at backbone sites; High speed access links: hard to reach areas, typically 45Mbps or 802.11radios, point-to-point or point-to-multipoint Data Grid Technology (SRB) collaborative access to distributed heterogeneous data, single sign-on authentication and seamless authorization,data scaling to Petabytes and 100s of millions of files, data replication, etc. Real-time Observatories, Applications, and Data management Network

A P2P Problem from ROADNet • Networks of ORBs send each other various data streams • Avoid actual loops in the presence of virtual loops: • A  B  C • A: c1B • B: c2  C • C: c3  A • ... • Idea: L(c1)  L(c2)  L(c3) … = {} • In the real system: unix regexps

Scientific Workflows and Analytical Pipelines

Representation of the workflow for cortical reconstruction using FreeSurfer. Raw anatomical MR images are first pre-processed and then must be manually edited to correct defects in the pre-processing. Once verified for correctness the pre-processed images can then be analyzed. During the processing various “snapshots” of the data are returned to the BIRN Virtual Data Grid. Biomedical Informatics Research Network http://nbirn.net Scientific Workflows/Analytical Pipelines over Brain Data

Example: Promoter Identification Workflow (PIW) (simplified) • scientific data sets flow between the steps • abstraction of tasks into higher conceptual levels • branching/merging of tasks and looping

AM: Analysis and Modeling System Execution Environment EcoGrid providesunified access toDistributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via theExecution Environment Semantic Mediation System & Analysis and Modeling SystemuseWSDL/UDDItoaccess services within the EcoGrid, enabling analytically driven data discovery and integration SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities SAS, MATLAB,FORTRAN, etc Example of “AP0” Analytical Pipeline (AP) TS2 ASy TS1 ASx ASz ASr W S D L / U D D I etc. Parameters w/ Semantics Data Binding ASr SMS: SemanticMediation System Semantic Mediation Engine AP0 j¬y j¬ a Invasive speciesover time Logic Rules ECO2-CL Query Processing Library of Analysis Steps,Pipelines & Results WSDL/UDDI WSDL/UDDI C C Raw data setswrappedfor integrationw/ EML, etc. Dar C C ECO2 MC EML C C ParameterOntologies Wrp KNB SRB Species ... ECO2 TaxOn SEEK: Vision & Overview • Large collaborative NSF/ITR project: UNM, UCSB, UCSD, UKansas,.. • Fundamental improvements for researchers: Global access to ecologically relevant data; Rapidly locate and utilize distributed computation; Capture, reproduce, extend analysis process

SEEK Components • EcoGrid • Seamless access to distributed, heterogeneous data: ecological, biodiversity, environmental data • “Semantically” mediated and metadata driven • Centralized search & management portal(s) • Analysis and Modeling System • Capture, reproduce, and extend analysis process • Declarative means for documenting analysis • “Pipeline” system for linking generic analysis steps • Strong version control for analysis steps • Easy-to-use interface between data and analysis • Semantic Mediation System: • “smart” data discovery, “type-correct” pipeline construction & data binding: • determine whether/how to link analytic steps • determine how data sets can be combined • determine whether/how data sets are appropriate inputs for analysis steps

AMS Overview • Objective • Create a semi-automated system for analyzing data and executing models that provides documentation, archiving, and versioning of the analyses, models, and their outputs (visual programming language?) • Scope • Any type of analysis or model in ecology and biodiversity science • Massively streamline the analysis and modeling process • Archiving, rerunning analyses in SAS, Matlab, R, SysStat, C(++),… • …

SMS Requirements from AMS • ...assist users in determining the appropriateness of combining various analytical steps and data sources based on semantic mediation... • Semantic mediation should occur in three areas: • determine whether it is appropriate to link together particular analytic steps. • mediate between multiple data sets to determine in what ways they can be combined. • determine whether the selected data sources are appropriate inputs for the selected analysis.

Some functional requirements • SMS should have the ability to ... FR1: recognize data types (XML Schema types!? EML types?) of registered EcoGrid data sets FR2: recognize semantic types (OWL and/or RDF(S) !?) of registered EcoGrid data sets FR3: recognize registered EcoGrid ontologies Note: semantic types reference those ontologies FR4: recognize data type signature (XML Schema? WSDL?) of analytical steps (ASs) FR5: recognize semantic type signature of analytical steps FR6: recognize semantic constraints (OWL? First-order? What syntax? KIF? Prolog?) Note: data schemas and signatures of analytical steps have those

... some functional requirements • Ability to ... FR8: check well-typedness (data and semantics) of a data set wrt. an analytical step FR9: check compatibility of two data sets wrt. "generalized operations" between those data sets (e.g., "semantic" join and union) FR10: check well-typedness (data and semantics) of chained analytical steps FR11: introduce data type conversions (e.g., int  float) FR12: perform and "explain" semantic type substitutions (e.g. if some AS works for Cs and D-isa-C, it also works for Ds) FR13: [optional] generate type correct APs from a given schema of desired output and (optionally) input parameters

Use Cases • Clients of the SMS include the AMS, the EcoGrid, and "scientific workflow engineers". • UC1: Client requests type signature (data and semantic types) of a registered EcoGrid data set (DS) • UC2: Client requests "other semantic constraints" of a DS. • UC3: Client requests type signature (data and semantic types) of an analytical step (AS) • UC4: Client requests "other semantic constraints" of an AS. • UC5: Client requests type signature of an AP. • UC6: Client requests type checking of AP. • UC7: Client requests registered data sets compatible with the inputs of an AS (e.g., if AS is scale sensitive, then all data sets must have the same scale; a flag is raised if data needs scaling). • UC8: Client requests all registered ASs which can produce a given parameter (the latter is part of a registered ontology) • UC9: Client requests candidate predecessor and successor steps for a given AS.

Planned Components SW1: Formal language(s) for representing/instantiating data types, semantic types, ontologies, and "other semantic constraints". SW2: System for data type checking and inference (includes introduction of data type conversion steps) SW3: System for semantic type checking and inference SW4: [optional] System for "planning" APs given some of: output parameters, data sets, and input parameteres

THE PROBLEM – Reconcile this: • Simple, intuitive graph/pipeline language, • … which is expressive enough to handle real-world flows (SciDAC: PIW), • … and allows some static analysis • while trying to leverage existing work: • e.g., Ptolemy-II directors: Process Networks (PN), Synchronous Dataflow (SDF), ..., • or workflow standards and systems

(Analytical) Pipelines …. (Scientific) Workflows • Spectrum of languages & formalisms: • Pipelines (a la Unix) • Dataflow languages: • Kahn’s process networks (PN) • Synchronous dataflow networks (SDF) • “Web page-flow”: • Active XML, WebML, … • Hesitating-weak-alternating-tree-automata-ML • … • (Business) Workflows: • WfMC’s XPDL, WSFL, BPELWS, …

Kahn Process Networks (PN) • Concurrent processes communication through one-way FIFO channels with unbounded capacity • A functional process Fmaps a set of input sequences into a set of output sequences (sounds like XSM!) • increasing chain of sets of sequences  outputs may notincrease! • Consider increasing chains (wrt. prefix ordering “<“) of streams • PN is continuous if lub(Xs) exists for all increasing chains Xs and • F(lub(Xs)) < lub(F(Xs)) • Continuous implies montonic: • if Xs < Ys then F(Xs)<F(Ys)

Process Networks (cont’d) • PN in essence: simultaneous relations between sequences • Network of functional processes can be described by a mapping X = F(X,I) • X denotes all the sequences in the network (inputs I+outputs) • X that forms a solution is a fixed point • Continuity implies exactly one “minimal” fixed point • minimal in the sense of pre-fix ordering for any inputs I • execution of the network: given I = ^ and find the minimal fixed point (works because of the monotonic property)

Synchronous Data Flow Networks (SDF) • Special case of PN • Ptolemy-II SDF overview • SDF supports efficient execution of Dataflow graphs that lack control structures • with control structures Process Networks(PN) • requires that the rates on the ports of all actors be known before hand • do not change during execution • in systems with feedback, delays, which are represented by initial tokens on relations must be explicitly noted  SDF uses this rate and delay information to determine the execution sequence of the actors before execution begins.

Extended Kahn-MacQueen Process Networks • A process is considered active from its creation until its termination • An active process can block when trying to read from a channel (read-blocked), when trying to write to a channel (write-blocked) or when waiting for a queued topology change request to be processed (mutation-blocked) • A deadlock is when all the active processes are blocked • real deadlock: all the processes are blocked on a read • artificial deadlock: all processes are blocked, at least one process is blocked on a write  increase the capacity of receiver with the smallest capacity amongst all the receivers on which a process is blocked on a write. This breaks the deadlock. • If the increase results in a capacity that exceeds the value of maximumQueueCapacity, then instead of breaking the deadlock, an exception is thrown. This can be used to detect erroneous models that require unbounded queues.

Analytical Pipelines: An Open Source Tool

A commercial tool for Analytical Pipelines

MAP: Data Massaging a la Blue-Titan/Perl

Compiling Abstract Scientific Workflows into Web Service Workflows SSDBM’03

The Problem • Scientist would like to ... • create a high-level “abstract” WF and • not bother about web service urls, parameter passing, low-level data transformations,... • How to go from ... • a high-level Abstract Workflow (AWF) to • an Executable (web service) Workflow (EWF) ?? • Idea: • Using nested definitions, express AWF in terms of other AWFs and EWFs; unfold definitions at compile-time  Abstract-as-View approach

WF Language Constructs (AWF+EWF)  cond   cond   cond  

Conceptual Workflow Retrieve Transcription factors Arrange Transcription factors Retrieve matching cDNA Retrieve genomic Sequence Align promoters Extract promoter Region(begin, end) Create consensus sequence Compute clusters (min. distance) For each promoter Select gene-set (cluster-level) Compute Subsequence labels For each gene With all Promoter Models Compute Joint Promoter Model

Abstract Workflow (AWF)(= chain program over relations with i/o patterns) % AWF piw(DB,Gene,TFBSModel) :- cDNASequence(Gene, GeneSeq), localAlignment(DB, CDNASeq,RankedPromoterList), firstRest(Promoter,RankedPromoters,RankedPromoters1), promoter_detail(Promoter, PromoterId, Start, End, Orientation), cDNASequence(PromoterId,GenomicSeq), trim_sequence(GenomicSeq, Start, End, Orientation, ShortSeq), convertSeq(Orientation,ShortSeq,PosSeq), transfac(PosSeq, TFBSModel).

Scientific Data Management: From Data Integration to Analytical Pipelines