Towards Semantic Typing Support for Scientific Workflows

Towards Semantic Typing Support for Scientific Workflows Bertram Ludäscher Knowledge-Based Information Systems Lab San Diego Supercomputer Center University of California San Diego http://seek.ecoinformatics.org http://www.geongrid.org

Outline • Motivation: Traditional vs Scientific Data Integration • Semantic (a.k.a. Model-Based) Mediation • Scientific Workflows (a.k.a. Analysis Pipelines) • DB Theory Appetizer: Web Service Composition Through Declarative Queries

Semantics Structure Syntax • reconciling S4heterogeneities • “gluing” together resources • bridging information and knowledge gaps computationally System aspects Information Integration Challenges • System aspects: “Grid” Middleware • distributed data & computing • Web Services, WSDL/SOAP, OGSA, … • sources = functions, files, data sets • … • Syntax & Structure: • (XML-Based) Data Mediators • wrapping, restructuring • (XML) queries and views • sources = (XML) databases • Semantics: • Model-Based/Semantic Mediators • conceptual models and declarative views • Knowledge Representation: ontologies, description logics (RDF(S),OWL ...) • sources = knowledge bases (DB+CMs+ICs)

Information Integration from a DB Perspective • Information Integration Problem • Given: data sources S1, ..., Sk (DBMS, web sites, ...) and user questions Q1,..., Qn that can be answered using the Si • Find: the answers to Q1, ..., Qn • The Database Perspective: source = “database” • Si has a schema (relational, XML, OO, ...) • Sican be queried • define virtual (or materialized) integrated/global view Gover S1 ,..., Skusing database query languages(SQL, XQuery,...) • questions become queries Qi against G(S1,..., Sk)

1. Query Q ( G (S1,..., Sk) ) 6. {answers(Q)} Integrated Global (XML) View G Integrated View Definition G(..) S1(..)…Sk(..) MEDIATOR (XML) View (XML) View (XML) View web services as wrapper APIs Wrapper Wrapper Wrapper S1 S2 Sk Standard (XML-Based) Mediator Architecture USER/Client 3. Q1 Q2 Q3 4. {answers(Q1)} {answers(Q2)} {answers(Q3)}

Query Planning for Mediators • Given: • User query Q: answer(…)  …G ... • … & { G  … S … } global-as-view (GAV) • … & { S  … G … } local-as-view (LAV) • … & { false  … S … G… } integrity constraints (ICs) • Find: • equivalent (or min. containing, max.contained) query plan Q’: answer(…)  … S … • Results: • A variety of results/algorithms; depending on classes of queries, views, and ICs: P, NP,…, undecidable • many variants still open

From Scientific Data Integration to Process & Application Integration (and back…) • Data Integration • Database mediation + Knowledge-based extension  Query rewriting w/ GAV, LAV, ICs, access patterns • “Process/Application”Integration • Scientific models (ocean, atmosphere, ecology, …), assimilation models (e.g., real-time data feeds), … • Data sets • Legacy tools • Components = web services • Applications = composite components (“workflows”) • Need for semantic type extensions

Geologic Map Integration • Given: • Geologic maps from different state geological surveys (shapefiles w/ different data schemas) • Different ontologies: • Geologic age ontology • Rock type ontologies: • Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC) • Single hierarchy from British Geological Survey (BGS) • Problem • Support uniform queries against the multiple geologic mapsusing different ontologies • Support registration w/ ontology A, querying w/ ontology B

Ontology Mappings: Motivation • Establish correspondences between ontologies  Integrate data sets which are registered to different ontologies  Query data sets through different ontologies Ontology A register Data set 1 queries Ontology mappings Ontology B register Data set 2

A Multi-Hierarchical Rock Classification Ontology (GSC) Genesis Fabric Composition Texture

Some enabling operations on “ontology data” • Concept expansion: • what else to look for when asking for ‘Mafic’ Composition

Some enabling operations on “ontology data” • Generalization: • finding data that is “like” X and Y Composition

Implementation in OWL: Not only “for the machine” …

+/- Energy +/- a few hundred million years GEON Metamorphism Equation: Geoscientists + Computer Scientists Igneous Geoinformaticists domain knowledge Geologic Map Integration Knowledge representation Ontologies!? Nevada

Choose an ontology class Click on Submission Select a shapefile Data set name Geology Workbench: Registering Data to an OntologyStep 1: Choose Classes

It contains information about geologic age Geology Workbench: Data RegistrationStep 2: Choose Columns for Selected Classes

Two terms are not matched any ontology terms Manually mapping algonkian into the ontology Geology Workbench: Data RegistrationStep 3: Resolve Mismatches

All areas with the age Paleozoic Click on the name Choose interesting Classes Geology Workbench: Ontology-enabled Map Integrator

New query interface Run it Switch from Canadian Rock Classification to British Rock Classification Ontology mapping between British Rock Classification and Canadian Rock Classification Submit a mapping Geology Workbench: Change Ontology

Ontologies and Data Management Ontology use concepts from (explicitly or implicitly) Design Artifact Conceptual Model Conceptual Model Schema Schema Schema Schema  Metadata Data • How to define and refine an ontology? • How to register a dataset to an ontology?

Biomedical Informatics Research Network http://nbirn.net Refining an Ontology – the logic way, enables “Source Contextualization”

DataCollectionEvent Measurement MeasurementContext MeasurableItem SpeciesCount SpeciesAbundance AbundanceCollectionEvent Location LTERSite SBLTERSite {naples,…} ⊑ contains.Measurement ⊑measureOf.MeasurableItem ⊓hasContext.MeasurementContext ⊑ hasTime.DateTime ⊓ hasLocation.Location ⊑ hasUnit.Unit ⊓hasValue.UnitValue ⊑ MeasurableItem ⊓hasSpecies.Species ⊓hasUnit.RatioUnit … ⊑ Measurement ⊓measureOf.SpeciesCount ⊑ DataCollectionEvent ⊓contains.SpeciesAbundance ⊑ position.Coordinate ⊑ Location ⊑ LTERSite ⊓position.SBLTERCoordinate ⊑ SBLTERSite Connecting Datasets to Ontologies:“Semantic Registration” Ontology (snippet) How can we “register” the dataset to concepts in the Ontology? Dataset Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Purpose of Semantic Registration Expose “hidden” information: • What do attributes represent? • What do specific values represent? • What conceptual“objects” are in the dataset? Capture connections between the dataset and ontology to: • Find existing datasets (or parts of datasets) via ontological concepts (discovery) • Enable integration of datasets (mediation) • Generate metadata for new data products (in a pipeline)

Semantic Registration Framework Step 1: Data provider selects relevant ontological concepts (for the dataset) Step 2: The semantic registration system creates a structural representation based on chosen concepts (data provide refines if needed) Step 3: The data provider maps the dataset information to the generated structural representation

Step1: Selecting Relevant Concepts Concepts from an Ontology • DataCollectionEvent • AbundanceCollectionEvent • Measurement • Abundance • SpeciesAbundance • MeasurementContext • … • Location • LTERSite • SBLTERSite • naples • Species • … • MeasurableItem • SpeciesCount Dataset Date Site Transect SP_Code Count 2000-09-08 CARP 1 CRGI 0 2000-09-08 CARP 4 LOCH 0 2000-09-08 CARP 7 MUCA 1 2000-09-22 NAPL 7 LOCH 1 2000-09-18 NAPL 1 PAPA 5 2000-09-28 BULL 1 CYOS 57

Step2: Generate Object Model Concepts from an Ontology • DataCollectionEvent • AbundanceCollectionEvent • Measurement • Abundance • SpeciesAbundance • MeasurementContext • … • Location • LTERSite • SBLTERSite • naples • Species • … • MeasurableItem • SpeciesCount Abundance Collection Event contains measureOf SpeciesCount SpeciesAbundance hasValue hasSpecies hasUnit Species RatioUnit hasTime hasLoc RatioValue SBLTERSite DateTime

Scientific Workflows

Promoter Identification Workflow (PIW) Source: Matt Coleman (LLNL)

Source: NIH BIRN (Jeffrey Grethe, UCSD)

Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Test sample (d) Species presence & absence points (native range) (a) Native range prediction map (f) Training sample (d) GARP rule set (e) Data Calculation Map Generation Map Generation EcoGrid Query EcoGrid Query Validation Validation User Sample Data +A2 +A3 Model quality parameter (g) Generate Metadata Integrated layers (native range) (c) Layer Integration Layer Integration +A1 Environmental layers (native range) (b) Invasion area prediction map (f) Selected prediction maps (h) Model quality parameter (g) Integrated layers (invasion area) (c) Environmental layers (invasion area) (b) Species presence &absence points (invasion area) (a) Ecology: GARP Analysis Pipeline for Invasive Species Prediction Source: NSF SEEK (Deana Pennington et. al, UNM)

Scientific Workflows: Some Findings • More dataflow than (business) workflow • Need for “programming extension” • Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …) • Need for abstraction and nested workflows • Need for data transformations • Need for rich user interaction & workflow steering: • pause / revise / resume • select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF • Need for high-throughput transfers (“grid-enabling”, “streaming”) • Need for persistence of intermediate products  data provenance (“virtual data” concept)

see! Our Starting Point: Dataflow Process Networks and Ptolemy II read! try! Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Ilkay Altintas SDM Chad Berkley SEEK Shawn Bowers SEEK Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Efrat Jaeger GEON Matt Jones SEEK Edward A. Lee Ptolemy II Kai Lin GEON Ashraf Memon GEON Bertram Ludaescher BIRN, GEON, SDM, SEEK Steve Mock NMI Steve Neuendorffer Ptolemy II Mladen Vouk SDM Yang Zhao Ptolemy II … Kepler Team, Projects, Sponsors Ptolemy II

Commercial Workflow/Dataflow Systems

SCIRun: Problem Solving Environments for Large-Scale Scientific Computing • SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations • Component model, based on generalized dataflow programming Steve Parker (cs.utah.edu)

E-Science and Link-Up Buddies • … <UPDATE ME> … • Taverna, Scufl, Freefluo, .. • DiscoveryNet • Triana • ICENI • …

typed i/o ports FIFO actor actor Dataflow Process Networks:Putting Computation Models first! • Synchronous Dataflow Network (SDF) • Statically schedulable single-threaded dataflow • Can execute multi-threaded, but the firing-sequence is known in advance • Maximally well-behaved, but also limited expressiveness • Process Network (PN) • Multi-threaded dynamically scheduled dataflow • More expressive than SDF (dynamic token rate prevents static scheduling) • Natural streaming model • Other Execution Models (“Domains”) • Implemented through different “Directors” advanced push/pull

Promoter Identification Workflow (PIW) Source: Matt Coleman (LLNL)

Execution Semantics Promoter Identification Workflow in Ptolemy-II [SSDBM’03]

designed to fit designed to fit hand-crafted Web-service actor hand-crafted control solution; also: forces sequential execution! No data transformations available Complex backward control-flow

Back to purely functional dataflow process network (= a data streaming model!) Re-introducing map(f) to Ptolemy-II (was there in PT Classic) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go from piw(GeneId) to PIW :=map(piw) over [GeneId] Simplified Process Network PIW map(f)-style iterators Powerful type checking Generic, declarative “programming” constructs Generic data transformation actors Forward-only, abstractable sub-workflow piw(GeneId)

PIW as a declarative, referentially transparent functional process optimization via functional rewriting possible e.g. map(fog) = map(f) o map(g) Details: Technical report &PIW specification in Haskell Optimization by Declarative Rewriting map(fo g) instead ofmap(f) o map(g) Combination of map and zip http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf

Web Services & Scientific Workflows in Kepler • Web services = individual components (“actors”) • “Minute-Made” Application Integration: • Plugging-in and harvesting web service components is easy and fast • Rich SWF modeling semantics (“directors” and more): • Different and precise dataflow models of computation • Clear and composable component interaction semantics  Web service composition and application integration tool • Coming soon: • Shrinked wrapped, pre-packaged “Kepler-to-Go” (v0.8) • SWFs with structural and semantic data types (better design support) • Grid-enabled web services (for big data, big computations,…) • Different deployment models (SWF WS, web site, applet, …)

KEPLER Core Capabilities (1/2) • Designing scientific workflows • Composition of actors (tasks) to perform a scientific WF • Actor prototyping • Accessing heterogeneous data • Data access wizard to search and retrieve Grid-based resources • Relational DB access and query • Ability to link to EML data sources

KEPLER Core Capabilities (2/2) • Data transformation actors to link heterogeneous data • Executing scientific workflows • Distributed and/or local computation • Various models for computational semantics and scheduling • SDF and PN: Most common for scientific workflows • External computing environments: • C++, Python, C (… Perl--planned ...) • Deploying scientific tasks and workflows as web services themselves (… planned …)

The KEPLER GUI (Vergil) Drag and drop utilities, director and actor libraries.

Towards Semantic Typing Support for Scientific Workflows