slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Supertrees: Algorithms and Databases PowerPoint Presentation
Download Presentation
Supertrees: Algorithms and Databases

Loading in 2 Seconds...

play fullscreen
1 / 50

Supertrees: Algorithms and Databases - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

Supertrees: Algorithms and Databases. Roderic Page University of Glasgow r.page@bio.gla.ac.uk DIMACS Working Group Meeting on Mathematical and Computational Aspects Related to the Study of The Tree of Life. What do we mean by the “Tree of Life”.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Supertrees: Algorithms and Databases' - zedekiah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Supertrees:

Algorithms and Databases

Roderic Page

University of Glasgow

r.page@bio.gla.ac.uk

DIMACS Working Group Meeting on Mathematical

and Computational Aspects Related to the Study of

The Tree of Life

what do we mean by the tree of life
What do we mean by the “Tree of Life”

Our perception of what the tree is may affect what we view as being the “interesting” problems

or

Tree algorithms, models, genomics,

lateral gene transfer

Supertrees, datatypes, databases, taxonomy

topics
Topics
  • Supertrees (MinCut)
  • Phylogenetic databases
tree terminology
Tree terminology

d

a

b

c

leaf

{

a,b

}

edge

internal node

cluster

{

a,b,c

}

root

{

a,b,c,d

}

nestings and triplets
Nestings and triplets

d

a

b

c

Nestings

{a,b} <T {a,b,c,d}

{b,c} <T {a,b,c,d}

Triplets

(bc)d

bc|d

supertree
Supertree

d

a

b

c

a

b

c

b

c

d

=

+

T

T

1

2

supertree

some desirable properties of a supertree method steel et al 2000
Some desirable properties of a supertree method(Steel et al., 2000)
  • The supertree can be computed in polynomial time
  • A grouping in one or more trees that is not contradicted by any other tree occurs in the supertree
aho et al s algorithm onetree
Aho et al.’s algorithm (OneTree)

Aho, A. V., Sagiv, Y., Syzmanski, T. G., and Ullman, J. D. 1981. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 10: 405-421.

Input: set of rooted trees

1. If set is compatible (i.e., will agree on a tree), output that tree.

2. If set is not compatible, stop!

slide9

a

b

a

a

b

b

a, b

c

a, b, c, d

a, b, c

d

d

c

c

a

b

c

b

c

d

Aho et al.’s

OneTree

algorithm

T

T

1

2

supertree

mincut supertrees
Mincut supertrees

Semple, C., and Steel, M. 2000. A supertree method for rooted trees. Discrete Appl. Math. 105: 147-158.

  • Modifies OneTree by cutting graph
  • Requires rooted trees (no analogue of OneTree for unrooted trees)
  • Recursive
  • Polynomial time
slide11

b

a

c

e

d

a

b

c

d

e

a

b

c

d

T

T

1

2

S

{

T

,

T

}

1

2

Semple and Steel (2000)

collapsing the graph semple and steel mincut algorithm
Collapsing the graph(Semple and Steel mincut algorithm)

This edge

has

maximum

weight

b

a,b

2

1

1

c

a

c

1

1

1

e

d

e

d

1

1

max

S

S

/

E

{

T

,

T

}

{

T

,

T

}

{

T

,

T

}

1

2

1

2

1

2

cut the graph to get supertree
Cut the graph to get supertree

a,b

a

b

c

d

e

1

c

1

e

d

1

max

S

/

E

{

T

,

T

}

{

T

,

T

}

1

2

1

2

supertree

my mincut supertree implementation darwin zoology gla ac uk rpage supertree
My mincut supertree implementationdarwin.zoology.gla.ac.uk/~rpage/supertree
  • Written in C++
  • Uses GTL (Graph Template Library) to handle graphs (formerly a free alternative to LEDA)
  • Finds all mincuts of a graph faster than Semple and Steel’s algorithm
mincut gives this strange result
Mincut gives this (strange) result
  • Disputed relationships among a, b, and c are resolved
  • x1, x2, and x3 collapsed into polytomy

c

x

1

x

2

x

3

b

a

y

1

y

2

y

3

y

4

problem cuts depend on connectivity in this example it is a function of tree size

S

{

T

,

T

}

1

2

Problem:Cuts depend on connectivity(in this example it is a function of tree size)

y4

x3

y1

x2

b

y2

x1

y3

c

a

so mincut doesn t work
So, mincut doesn’t work
  • But, Semple and Steel said it did
  • My program seems to work
  • Argh!!! What is happening….?
what mincut does and does not do
What mincut does… …and does not do
  • Mincut supertree is guaranteed to include any nesting which occurs in all input trees
  • Makes no claims about nestings which occur in only some of the trees
  • “Does exactly what it says on the tin™”
modifying mincut supertree
Modifying mincut supertree
  • Can we incorporate more of the information in the input trees?
  • Three categories of information
  • Unanimous (all trees have that grouping)
  • Contradicted (trees explicitly disagree)
  • Uncontradicted (some trees have information that no other tree disagrees with)
uncontradicted information assume we have k input trees
Uncontradicted informationassume we have k input trees

a and b co-occur

in a tree

a and b nested

in a tree

n

c

a

b

a

b

c - n = 0  uncontradicted (if c = k then unanimous)

c - n > 0  contradicted

uncontradicted information assume we have k input trees1
Uncontradicted informationassume we have k input trees

a and b in a fan

a and b co-occur

in a tree

a and b nested

in a tree

f

n

c

a

b

a

b

a

b

c - n -f = 0  uncontradicted (if c = k then unanimous)

c - n - f > 0  contradicted

classifying edges
Classifying edges

S

{

T

,

T

}

1

2

y

x

1

1

y

y

1

2

x

x

y

2

1

2

y

y

x

3

4

2

x

3

b

y

b

4

y

x

3

3

c

a

a

c

Uncontradicted

Uncontradicted but adjacent to contradicted

Contradicted

modified mincut
Modified mincut
  • Species a, b, and c form a polytomy
  • x1, x2, and x3 resolved as per the input tree

modified

mincut

a

b

c

x

1

x

2

x

3

y

1

y

2

y

3

y

4

if no tree contradicts an item of information is that information always in the supertree

1

1

1

1

2

2

2

2

3

3

3

3

4

4

4

4

5

5

5

5

If no tree contradicts an item of information, is that information always in the supertree?

(23)5

(12)5

(45)1

(34)1

no steel dress b cker 2000

1

2

3

4

5

No!Steel, Dress, & Böcker 2000
  • The four trees display (12)5, (23)5, (34)1, and (45)1
  • No tree displays (IK)J or (JK)I for any (IJ)K above
  • Triplets are uncontradicted, but cannot form a tree
future directions for supertrees
Future directions for supertrees
  • Improve handling of uncontradicted information
  • Add support for constraints
  • Visualising very big trees
  • Better integration into phylogeny

databases (www.treebase.org)

darwin.zoology.gla.ac.uk/~rpage/supertree

supertree challenge proposed by mike sanderson mjsanderson@ucdavis edu
Supertree Challenge (proposed by Mike Sanderson mjsanderson@ucdavis.edu)

The TreeBASE database currently contains over 1000 phylogenies with over 11,000 taxa among them. Many of these trees share taxa with each other and are therefore candidates for the construction of composite phylogenies, or "supertrees", by various algorithms. A challenging problem is the construction of the largest and "best" supertree possible from this database. "Largest" and "best" may represent conflicting goals, however, because resolution of a supertree can be easily diminished by addition of "inappropriate" trees or taxa.

it s a scandal
It’s a scandal
  • We cannot answer even the most basic question: “what is the phylogeny for group x?”
  • GenBank is currently the best phylogenetic database (!)
  • Can't even say how many species are in a given group
  • Little idea of who is doing what
tree of life tolweb org
Tree of Lifetolweb.org
  • Provides text and images
  • Relies on extensive manual effort (e.g., writing text)
  • Can’t do any computations with it
  • Limited research value
treebase www treebase org
TreeBASEwww.treebase.org
  • Relational database
  • Query by author, taxon, study number
  • Compute supertrees
  • Submit NEXUS data files
slide34

TreeBASE and mincut supertrees

  • User selects two or more trees
  • Clicks on button

and script on darwin.zoology.gla.ac.uk is run to create supertree

  • Can view as PS, PDF, treefile, or in Java applet (ATV)
what s wrong with treebase
What’s wrong with TreeBASE?
  • No consistency of taxon names
  • (e.g., Human, Homo sapiens,

Homo sapiens X54666-1)

  • No consistency of data names (e.g., gene names, morphological characters, etc.)
slide37

www.all-species.org

“The ALL Species Foundation is a non-profit organization dedicated to the complete inventory of all species of life on Earth within the next 25 years - a human generation.”

Press Release: November 13, 2002

Starting December 1, the ALL Species Foundation will close its San Francisco office because of a lack of funding for the Foundation.

the first challenge
The first challenge
  • We need a taxonomic name server that can resolve the name of any organism
  • This server needs to reconcile multiple classifications (e.g., GenBank, ITIS, etc.)
  • Must handle at least 1 million names, perhaps 100 million
second challenge
Second Challenge
  • How do we query trees?
  • Trees can be classifications or phylogenies
sql queries on trees
SQL Queries on Trees
  • Oracle SQL Transitive Closure Query (recursion)
  • Nested queries
  • Node path queries
node paths
Node paths

/1/1/2

/1/2/1

/1/2/2

/1/1/1/2

/1/1/1/1

/2

/1/1/1

/1/2

/1/1

/1

node paths selecting subtree
Node paths - selecting subtree

/1/1/2

/1/2/1

/1/2/2

/1/1/1/2

/1/1/1/1

/2

/1/1/1

/1/2

/1/1

/1

SELECT node

WHERE (path LIKE “/1/1/%”)

AND (path < “/1/10/%”);

node paths selecting subtree1
Node paths - selecting subtree

/1/1/2

/1/2/1

/1/2/2

/1/1/1/2

/1/1/1/1

/2

/1/1/1

/1/2

/1/1

/1

SELECT node

WHERE (path LIKE “/1/1/%”)

AND (path < “/1/10/%”)

AND (num_children IS 0);

node paths lca
Node paths - LCA

/1/1/2

/1/2/1

/1/2/2

/1/1/1/2

/1/1/1/1

/2

/1/1/1

/1/2

/1/1

/1

Common substring starting

from left

what do we do now
What do we do now…?
  • Setup a taxonomic name server (TNS)
  • Develop a phylogenetic genetic database linked to TNS, PubMed, GenBank, etc.
  • Develop easy ways to populate database (e.g., from TreeBASE, GenBank, journal databases)
  • Develop standard set of tree queries
  • Deploy