the ppod p rocessing p hyl od ata project n.
Skip this Video
Loading SlideShow in 5 Seconds..
The pPOD ( p rocessing P hyl OD ata) Project PowerPoint Presentation
Download Presentation
The pPOD ( p rocessing P hyl OD ata) Project

Loading in 2 Seconds...

play fullscreen
1 / 22

The pPOD ( p rocessing P hyl OD ata) Project - PowerPoint PPT Presentation

  • Uploaded on

The pPOD ( p rocessing P hyl OD ata) Project. a collaborative between University of Pennsylvania University of California, Davis Yale University University of Florida. AToL Projects' Data. 1. Genotypic descriptions and their provenance;

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

The pPOD ( p rocessing P hyl OD ata) Project

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the ppod p rocessing p hyl od ata project

The pPOD (processing PhylOData) Project

a collaborative between

University of Pennsylvania

University of California, Davis

Yale University

University of Florida

atol projects data
AToL Projects' Data

1. Genotypic descriptions and their provenance;

2. Phenotypic descriptions and their provenance;

3. Specimens and their provenance including collection information, voucher deposition, etc.;

4. Interpretation of the primary measurements including homology;

5. Estimates of phylogenies, and information about the methods employed;

6. Supertree construction, and information about the methods employed; and

7. Post-tree analyses such as character evolution hypotheses.

… more …

Need to develop the infrastructure to integrate AToL data sources together with other valuable resources such as publication archival databases, morphological character databases, phylogenomics databases, etc.

Hence the pPOD mission.

ppod developed technologies that will enable integration

pPOD-developed technologies that will enable integration

An extensible core data model for phylogenetic data, with a query language and a persistence tool;

Components for peer-to-peer data integration and exchange, using schema mappings (Orchestra);

A scientific workflow system for interoperating the data integration components with the local database access components and with analysis tools (Kepler).

Ubiquitous theme:

strong support for systematics-oriented provenance management



Val Tannen (PI), Sue Davidson, Zack Ives, Junhyong Kim (coPIs)

Zhaowei Bao, T.J. Green, Greg Karvounarakis, U Pennsylvania

Bertram Ludaescher (PI), Shawn Bowers, Tim McPhillips (coPIs)

Manish Anand, UC Davis

Reed Beaman (PI), U Florida

Bill Piel (coPI), Greg Jordan, Yale


Peter Buneman, U Edinburgh

Michael Donoghue, Yale

Jim Leebens-Mack , U Georgia

Francois Lutzoni, Duke

David Maddison, U Arizona

Wayne Maddison, U British Columbia

Brent Mishler, UC Berkeley

Bernard Moret, EPF Lausanne

Rod Page, U Glasgow

Mike Sanderson, U Arizona

Todd Vision, U North Carolina and NESCENT

Collaboration with Sarah Cohen-Boulakia and Christine Froidevaux, U Paris-Sud

Many of these people and others

have participated in a

community feedback meeting

11-12 Sept. 2007 (at NESCENT)

the ppod core data model

The pPOD Core Data Model

The pPOD CDM team: Bill Piel, Shirley Cohen, Tim McPhillips, Shawn Bowers, Sarah Cohen-Boulakia, Reed Beaman, Val Tannen

Special thanks to Brent Mishler, David Maddison, Jeff Oliver, Rutger Vos, Francois Lutzoni, Martin Ramirez, Jonathan Coddington, Wayne Maddison, Fan Ge, Ashley Green,Jin Ruan, Martin Wu, John Lundberg, John Sullivan

MIAPA compatibility: working with Jim Leebens-Mack and Todd Vision

  • The Core Data Model (CDM) under development in the pPOD project will serve the following purposes:
      • It will allow experimentation with the modeling of provenance in phylogenetic pipelines.
      • It will serve as a schema for a persistence tool, to work (1) in standalone mode, (2) with our lab notebook suite and (3) integrated with Mesquite as a module.

3. It will serve as a target for schema mappings used to connect other AToL databases, resources like TreeBASE, etc., using the Orchestra integration engine.

the role of provenance
The Role of Provenance

Backwards provenance

Starting from a research “product”, eg. a tree, a supertree, a matrix, track backwards through stored objects to all the raw input information that led to this product.

Forwards provenance

Starting from a raw input, eg., a specimen, an image, a sequence, track forwards through stored objects to all research products that this input contributed to.

In both cases, navigate biological assumptions in both directions, eg., homology assumptions.

kinds of provenance
“Kinds” of Provenance
  • Relationship between stored objects
    • Eg., tree T123 was obtained from matrix M456 by Joe Bio on 01/31/2001 using PAUP with parameters…
    • Tracking through copy or cut/paste operations, possibly across repositories
  • Trace of data moving through a workflow
    • Sequence of timestamps, tool invocations (parameters), authors
  • Trace of data through a logically expressed view/query
    • Can be computed automatically as the view/query output is computed

In our CDM


In our



persistence tool







query language)








Persistence Tool


(an OO schema)


workflow tool




Modeled data:

Analyzed data: trees,

matrices,cells,(row) segments,

operational taxomic units (OTUs),taxa,

standard characters and their states,

genes,gene fragments

Raw data: standard views,images,


specimens,samples, collections

Technology: object-oriented modeling (from OO databases),

use Hibernate for relational storage

OQL (ODMG) and HQL (Hibernate) for query language

example of phylogenetic query
Example of Phylogenetic Query

Find all standard matrices

with some character C whose label contains the substring "elytra"

and some OTU whose state for character C contains the substring "transverse";

return all such matrices, together with their characters, OTUs and states satisfying the conditions.

semi formalized oql hql query example
Semi-formalized (OQL/HQL) query example

SELECT M, label of C, label of X,

label of state encoded in cell E

FROM M over all standard matrices,

C over all characters of M,

X over all OTUs of M,

E is the cell corresponding

to C and X in M

WHERE the label of C is like "*elytra*"

AND the label of the state encoded

in cell E is like "*transverse*"



Bill Piel and Greg Jordan (Yale)

Funded in part by NESCENT/Google Summer of Code.

A phylogeny browser and editor designed for large tree visualization and manipulation. Java applet web deployable or stand-alone application (under construction).

To be integrated with the pPOD CDM query language.

p2p integration

MIAPA-compatible DB



Other DB

of interest

Violet’s Lab

Virtual schema


P2P Integration

Alice’s AToL Lab


schema mapping

Core data model

(virtual schema)

Bob’s AToL Lab

Chester’s AToL Lab

what is a schema mapping
What Is a Schema Mapping?

For relational databases:

AliceTable(Attr1, - , Attr2, - ) CDMTable(Attr1,Attr2, - )

From TreeBASE II to pPOD CDM:

study(studyId,studyName,studyAcc), analysis(anId,studyId,anName,_), ANALYSISSTEP(stepId,anId,softId, dataId,stepNam,_,parms),SOFTWARE(softId,soft,-,-,-,-), analyzeddata(dataId,false,len), MATRIX(matrixId, dataId,matrixTitle,-,-,-), matrixrow(rowId,matrixId,mLabelId),TAXONLABEL(mLabelId,inLabel), PHYLOTREE(treeId,dataId,treeTitle), treenode(nodeId,treeId,tLabelId, nodeLabel), TAXONLABEL(nodeId,otuLabel)

TREE(treeId,treeTitle), PROVENANCE(provId,treeId,matrixId,soft,parms,-,-),

MATRIX(matrixId,matrixTitle), matrixotu(matrixId,otuId,-), OTU(otuId,otuLabel),

treeotu(treeId,inOtuId), OTU(inOtuId,inLabel)

the orchestra solution
The Orchestra Solution

Each peer is able to

Map its schema to other peers' schema

Query through its own view of the shared data

Publish its updates to the other peers

Reconcile automatically its data with other peers' data through mappings

Curate other peers data when imported

Define trust policies over the related databases

It is easy to change the mappings to adapt to schema changes, new data needs, ... A much simpler development task!

Zack Ives, Val Tannen, T.J. Green, Greg Karvounarakis

Working on deployment with Reed Beaman

the kepler ppod system scientific workflow and provenance support for the atol community


Department of

Computer Science

The Kepler/pPOD SystemScientific workflow and provenance support for the AToL community

Kepler/pPOD @UCDavis

Bertram Ludaescher, Shawn Bowers, Tim McPhillips, Manish Anand

  • Developing a scientific workflow system for phylogenetic data analysis
  • Automatic tracking and querying of scientist-oriented data provenance
scientific workflows in kepler
Scientific workflows in Kepler …
  • Automating analytical tools and the management of data, services, and provenance
    • Support for high-level modeling of analysis and data management tasks
    • Including data access, format conversion, and visualization
  • Scientific workflow benefits (over conventional scripting approaches)
    • Built-in provenance support
    • Workflow & component reuse
    • Analysis design & documentation
    • Archival, sharing
    • Improved performance
      • Distributed execution
      • Concurrency, pipelining
    • Uniform data access / models
aligning sequences using a cipres service and a local application
Aligning sequences using a CIPRes service and a local application
  • Distinct workflow components (‘actors’) automate steps in an analysis
  • Actors are provided for data load, store, and manipulation as well scientific calculations
  • Actors can provide access to remote services and automate local applications
  • Actors for applications and services using different data standards can be combined easily in a single workflow

A workflow for aligning protein sequences using the CIPRes ClustalW service

A workflow for refining protein sequence alignments using the local Gblocks application

  • Parameter settings for applications and services are saved to allow workflow to be executed again automatically or shared with others
  • System records in detail how results were computed in a workflow ‘trace’ file
a family of workflows for tree inference
A family of workflows for tree inference

Similar workflows for inferring trees using the CIPRes RAxML service...

  • In general, actors of this kind can be strung together in an intuitive order with minimal configuration.
  • Consequently, workflows can easily be modified to invoke different methods for a particular step.
  • Resulting workflows are easy to understand.
  • Behavior of a workflow is easy to predict.

...the CIPRes MrBayes service...

... and the local Phylip protpars program

integrated data provenance support
Integrated data provenance support …
  • Capture data dependencies during a workflow run
  • Store workflow results in trace files
  • Tools to view data provenance and navigate history of the workflow execution
  • Output of a run can be used as input of another (provenance is maintained)