Enhancing Phylogenetic Data Integration: The pPOD Project

The pPOD (processing PhylOData) Project a collaborative between University of Pennsylvania University of California, Davis Yale University University of Florida

AToL Projects' Data 1. Genotypic descriptions and their provenance; 2. Phenotypic descriptions and their provenance; 3. Specimens and their provenance including collection information, voucher deposition, etc.; 4. Interpretation of the primary measurements including homology; 5. Estimates of phylogenies, and information about the methods employed; 6. Supertree construction, and information about the methods employed; and 7. Post-tree analyses such as character evolution hypotheses. … more … Need to develop the infrastructure to integrate AToL data sources together with other valuable resources such as publication archival databases, morphological character databases, phylogenomics databases, etc. Hence the pPOD mission.

pPOD-developed technologies that will enable integration An extensible core data model for phylogenetic data, with a query language and a persistence tool; Components for peer-to-peer data integration and exchange, using schema mappings (Orchestra); A scientific workflow system for interoperating the data integration components with the local database access components and with analysis tools (Kepler). Ubiquitous theme: strong support for systematics-oriented provenance management

Personnel Val Tannen (PI), Sue Davidson, Zack Ives, Junhyong Kim (coPIs) Zhaowei Bao, T.J. Green, Greg Karvounarakis, U Pennsylvania Bertram Ludaescher (PI), Shawn Bowers, Tim McPhillips (coPIs) Manish Anand, UC Davis Reed Beaman (PI), U Florida Bill Piel (coPI), Greg Jordan, Yale Consultants: Peter Buneman, U Edinburgh Michael Donoghue, Yale Jim Leebens-Mack , U Georgia Francois Lutzoni, Duke David Maddison, U Arizona Wayne Maddison, U British Columbia Brent Mishler, UC Berkeley Bernard Moret, EPF Lausanne Rod Page, U Glasgow Mike Sanderson, U Arizona Todd Vision, U North Carolina and NESCENT Collaboration with Sarah Cohen-Boulakia and Christine Froidevaux, U Paris-Sud Many of these people and others have participated in a community feedback meeting 11-12 Sept. 2007 (at NESCENT)

The pPOD Core Data Model The pPOD CDM team: Bill Piel, Shirley Cohen, Tim McPhillips, Shawn Bowers, Sarah Cohen-Boulakia, Reed Beaman, Val Tannen Special thanks to Brent Mishler, David Maddison, Jeff Oliver, Rutger Vos, Francois Lutzoni, Martin Ramirez, Jonathan Coddington, Wayne Maddison, Fan Ge, Ashley Green,Jin Ruan, Martin Wu, John Lundberg, John Sullivan MIAPA compatibility: working with Jim Leebens-Mack and Todd Vision

Goals • The Core Data Model (CDM) under development in the pPOD project will serve the following purposes: • It will allow experimentation with the modeling of provenance in phylogenetic pipelines. • It will serve as a schema for a persistence tool, to work (1) in standalone mode, (2) with our lab notebook suite and (3) integrated with Mesquite as a module. 3. It will serve as a target for schema mappings used to connect other AToL databases, resources like TreeBASE, etc., using the Orchestra integration engine.

The Role of Provenance Backwards provenance Starting from a research “product”, eg. a tree, a supertree, a matrix, track backwards through stored objects to all the raw input information that led to this product. Forwards provenance Starting from a raw input, eg., a specimen, an image, a sequence, track forwards through stored objects to all research products that this input contributed to. In both cases, navigate biological assumptions in both directions, eg., homology assumptions.

“Kinds” of Provenance • Relationship between stored objects • Eg., tree T123 was obtained from matrix M456 by Joe Bio on 01/31/2001 using PAUP with parameters… • Tracking through copy or cut/paste operations, possibly across repositories • Trace of data moving through a workflow • Sequence of timestamps, tool invocations (parameters), authors • Trace of data through a logically expressed view/query • Can be computed automatically as the view/query output is computed In our CDM tools In our workflow tool

store commands provenance query query (phylogenetic query language) AToL AAA schema mappings TreeBASE persistence manager RDBMS Persistence Tool CDM (an OO schema) Kepler-based workflow tool Mesquite module

Approach Modeled data: Analyzed data: trees, matrices,cells,(row) segments, operational taxomic units (OTUs),taxa, standard characters and their states, genes,gene fragments Raw data: standard views,images, sequences,chromatograms,primers, specimens,samples, collections Technology: object-oriented modeling (from OO databases), use Hibernate for relational storage OQL (ODMG) and HQL (Hibernate) for query language

Example of Phylogenetic Query Find all standard matrices with some character C whose label contains the substring "elytra" and some OTU whose state for character C contains the substring "transverse"; return all such matrices, together with their characters, OTUs and states satisfying the conditions.

Semi-formalized (OQL/HQL) query example SELECT M, label of C, label of X, label of state encoded in cell E FROM M over all standard matrices, C over all characters of M, X over all OTUs of M, E is the cell corresponding to C and X in M WHERE the label of C is like "*elytra*" AND the label of the state encoded in cell E is like "*transverse*"

PhyloWidget Bill Piel and Greg Jordan (Yale) Funded in part by NESCENT/Google Summer of Code. A phylogeny browser and editor designed for large tree visualization and manipulation. Java applet web deployable or stand-alone application (under construction). To be integrated with the pPOD CDM query language. www.phylowidget.org

PhyloWidget screenshot

MIAPA-compatible DB MIAPA TreeBASE Other DB of interest Violet’s Lab Virtual schema query P2P Integration Alice’s AToL Lab query schema mapping Core data model (virtual schema) Bob’s AToL Lab Chester’s AToL Lab

What Is a Schema Mapping? For relational databases: AliceTable(Attr1, - , Attr2, - ) CDMTable(Attr1,Attr2, - ) From TreeBASE II to pPOD CDM: study(studyId,studyName,studyAcc), analysis(anId,studyId,anName,_), ANALYSISSTEP(stepId,anId,softId, dataId,stepNam,_,parms),SOFTWARE(softId,soft,-,-,-,-), analyzeddata(dataId,false,len), MATRIX(matrixId, dataId,matrixTitle,-,-,-), matrixrow(rowId,matrixId,mLabelId),TAXONLABEL(mLabelId,inLabel), PHYLOTREE(treeId,dataId,treeTitle), treenode(nodeId,treeId,tLabelId, nodeLabel), TAXONLABEL(nodeId,otuLabel) TREE(treeId,treeTitle), PROVENANCE(provId,treeId,matrixId,soft,parms,-,-), MATRIX(matrixId,matrixTitle), matrixotu(matrixId,otuId,-), OTU(otuId,otuLabel), treeotu(treeId,inOtuId), OTU(inOtuId,inLabel)

The Orchestra Solution Each peer is able to Map its schema to other peers' schema Query through its own view of the shared data Publish its updates to the other peers Reconcile automatically its data with other peers' data through mappings Curate other peers data when imported Define trust policies over the related databases It is easy to change the mappings to adapt to schema changes, new data needs, ... A much simpler development task! Zack Ives, Val Tannen, T.J. Green, Greg Karvounarakis http://www.cis.upenn.edu/~zives/orchestra/ Working on deployment with Reed Beaman

UCDAVIS Department of Computer Science The Kepler/pPOD SystemScientific workflow and provenance support for the AToL community Kepler/pPOD @UCDavis Bertram Ludaescher, Shawn Bowers, Tim McPhillips, Manish Anand • Developing a scientific workflow system for phylogenetic data analysis • Automatic tracking and querying of scientist-oriented data provenance

Scientific workflows in Kepler … • Automating analytical tools and the management of data, services, and provenance • Support for high-level modeling of analysis and data management tasks • Including data access, format conversion, and visualization • Scientific workflow benefits (over conventional scripting approaches) • Built-in provenance support • Workflow & component reuse • Analysis design & documentation • Archival, sharing • Improved performance • Distributed execution • Concurrency, pipelining • Uniform data access / models • …

Aligning sequences using a CIPRes service and a local application • Distinct workflow components (‘actors’) automate steps in an analysis • Actors are provided for data load, store, and manipulation as well scientific calculations • Actors can provide access to remote services and automate local applications • Actors for applications and services using different data standards can be combined easily in a single workflow A workflow for aligning protein sequences using the CIPRes ClustalW service A workflow for refining protein sequence alignments using the local Gblocks application • Parameter settings for applications and services are saved to allow workflow to be executed again automatically or shared with others • System records in detail how results were computed in a workflow ‘trace’ file

A family of workflows for tree inference Similar workflows for inferring trees using the CIPRes RAxML service... • In general, actors of this kind can be strung together in an intuitive order with minimal configuration. • Consequently, workflows can easily be modified to invoke different methods for a particular step. • Resulting workflows are easy to understand. • Behavior of a workflow is easy to predict. ...the CIPRes MrBayes service... ... and the local Phylip protpars program

Integrated data provenance support … • Capture data dependencies during a workflow run • Store workflow results in trace files • Tools to view data provenance and navigate history of the workflow execution • Output of a run can be used as input of another (provenance is maintained)

Enhancing Phylogenetic Data Integration: The pPOD Project