Working with trees in the phyloinformatic age
Download
1 / 36

Working with Trees in the Phyloinformatic Age - PowerPoint PPT Presentation


  • 258 Views
  • Updated On :

Working with Trees in the Phyloinformatic Age. William H. Piel Yale Peabody Museum Hilmar Lapp NESCent, Duke University. Dealing with the Growth of Phyloinformatics. Trees: Too Many Search, organize, triage, summarize, synthesize Review existing methods

Related searches for Working with Trees in the Phyloinformatic Age

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Working with Trees in the Phyloinformatic Age' - LeeJohn


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Working with trees in the phyloinformatic age l.jpg

Working with Trees in the Phyloinformatic Age

William H. Piel

Yale Peabody Museum

Hilmar Lapp

NESCent, Duke University


Dealing with the growth of phyloinformatics l.jpg
Dealing with the Growth of Phyloinformatics

  • Trees: Too Many

    • Search, organize, triage, summarize, synthesize

      • Review existing methods

      • Describe queries for BioSQL phylo extension

      • Making generic queries

  • Trees: Too Big

    • Visualizing and manipulating large trees

      • Demo PhyloWidget


Searching stored tree l.jpg
Searching Stored Tree

  • Path Enumerations

  • Nested Sets

  • Adjacency Lists

  • Transitive Closure


Slide4 l.jpg

0.1.1

0.1.2

0.2.1.1

0.2.1.2

0.2.2

A

B

C

D

E

0.1

0.2.1

0.2

0

Dewey system:


Slide5 l.jpg

A

B

C

D

E

Find clade for: Z = (<CS+Ds)

Find common pattern starting from left

SELECT *

FROM nodes

WHERE (path LIKE “0.2.1%”);


Slide6 l.jpg

  • ATreeGrep

    • Uses special suffix indexing to optimize speed

    • Shasha, D., J. T. L. Wang, H. Shan and K. Zhang. 2002. ATreeGrep: Approximate Searching in Unordered Tree. Proceedings of the 14th SSDM, Edinburgh, Scotland, pp. 89-98.

  • Crimson

    • Uses nested subtrees to avoid long strings

    • Zheng, Y. S. Fisher, S. Cohen, S. Guo, J. Kim, and S. B. Davidson. 2006. Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms. 32nd International Conference on Very Large Data Bases, ACM, pp. 1231-1234.


Searching stored tree7 l.jpg
Searching Stored Tree

  • Path Enumerations

  • Nested Sets

  • Adjacency Lists

  • Metrics

  • Transitive Closure


Slide8 l.jpg

A

B

C

D

E

3

4

5

6

11

12

13

15

16

10

14

7

2

9

8

17

1

18

Depth-first traversal scoring each node with a lef and right ID


Slide9 l.jpg

A

B

C

D

E

3

4

5

6

10

11

12

13

15

16

14

2

7

9

8

17

1

18

Minimum Spanning Clade of Node 5

SELECT *

FROM nodes

INNER JOIN nodes AS include

ON (nodes.left_id BETWEEN include.left_id AND include.right_id)

WHERE include.node_id = 5 ;


Slide10 l.jpg

  • PhyloFinder

    • Duhong Chen et al.

    • http://pilin.cs.iastate.edu/phylofinder/

  • Mackey, A. 2002. Relational Modeling of Biological Data: Trees and Graphs. Bioinformatics Technology Conference. http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html


Searching stored tree11 l.jpg
Searching Stored Tree

  • Path Enumerations

  • Nested Sets

  • Adjacency Lists

  • Metrics

  • Transitive Closure


Slide12 l.jpg

A

-

-

-

-

C

B

E

D

7

8

4

2

3

9

6

5

1

1

2

-

6

5

2

5

1

6

A

B

C

D

E

3

4

7

8

9

2

6

5

1


Slide13 l.jpg

-

D

-

E

A

C

B

-

-

2

1

4

5

3

7

9

8

6

1

1

2

5

6

-

6

5

2

node_label:

node_id:

parent_id:

SQL Query to find parent node of node “D”:

SELECT *

FROM nodes AS parent

INNER JOIN nodes AS child

ON (child.parent_id = parent.node_id)

WHERE child.node_label = ‘D’;

…but this requires an external procedure to navigate the tree.


Searching stored tree14 l.jpg
Searching Stored Tree

  • Path Enumerations

  • Nested Sets

  • Adjacency Lists

  • Metrics

  • Transitive Closure


Slide15 l.jpg

A

B

C

D

A

B

C

D

Searching trees by distance metrics: USim distanceWang, J. T. L., H. Shan, D. Shasha and W. H. Piel. 2005. Fast Structural Search in Phylogenetic Databases. Evolutionary Bioinformatics Online, 1: 37-46


Searching stored tree16 l.jpg
Searching Stored Tree

  • Path Enumerations

  • Nested Sets

  • Adjacency Lists

  • Transitive Closure


Transitive closure l.jpg
Transitive Closure

  • Finding paths between vertices on a graph

  • DB2 and Oracle have special functions:

    • From EdgeStart With (child_id = A and tree_id = T)Connect By (Prior parent_id = child_id)And (Prior tree_id = tree_id)

  • Nakhleh, L., D. Miranker, F. Barbancon, W. H. Piel, and M. Donoghue. 2003. Requirements of phylogenetic databases. Third IEEE Symposium on Bioinformatics and Bioengineering, p. 141-148.

  • Paths can be precomputed and stored: BioSQL


Dealing with the growth of phyloinformatics18 l.jpg
Dealing with the Growth of Phyloinformatics

  • Trees Too Many

    • Search, organize, triage, summarize, synthesize

      • Review existing methods

      • Describe queries for BioSQL phylo extension

      • Making generic queries

  • Trees Too Big

    • Visualizing and manipulating large trees

      • Demo PhyloWidget


Slide19 l.jpg

BioSQL: http://www.biosql.org/

Schema for persistent storage of sequences and features tightly integrated with BioPerl (+ BioPython, BioJava, and BioRuby)

• phylodb extension designed at NESCent Hackathon

• perl command-line interface by Jamie Estill, GSoC


Slide20 l.jpg

4

4

3

2

3

5

A

B

3

4

2

2

C

2

1

1

5

1

1

1

Index of all paths from ancestors to descendants

CREATE TABLE node_path (

child_node_id integer,

parent_node_id integer,

distance integer);


Slide21 l.jpg

4

4

3

2

3

5

A

B

3

4

2

2

C

2

1

1

5

1

1

1

Find all paths where A and B share a common parent_node_id

SELECT pA.parent_node_id

FROM node_path pA, node_path pB, nodes nA, nodes nB

WHERE pA.parent_node_id = pB.parent_node_id

AND pA.child_node_id = nA.node_id

AND nA.node_label = 'A'

AND pB.child_node_id = nB.node_id

AND nB.node_label = 'B';


Slide22 l.jpg

4

4

3

2

3

5

A

B

3

4

2

2

C

2

1

1

5

1

1

1

…of those paths, select one that has the shortest path

SELECT pA.parent_node_id

FROM node_path pA, node_path pB, nodes nA, nodes nB

WHERE pA.parent_node_id = pB.parent_node_id

AND pA.child_node_id = nA.node_id

AND nA.node_label = 'A'

AND pB.child_node_id = nB.node_id

AND nB.node_label = 'B'

ORDER BY pA.distance

LIMIT 1;


Slide23 l.jpg

4

4

3

2

3

5

A

B

3

4

2

2

C

2

1

1

5

1

1

1

…of those paths, select one that has the longest path

SELECT pA.parent_node_id

FROM node_path pA, node_path pB, nodes nA, nodes nB

WHERE pA.parent_node_id = pB.parent_node_id

AND pA.child_node_id = nA.node_id

AND nA.node_label = 'A'

AND pB.child_node_id = nB.node_id

AND nB.node_label = 'B'

ORDER BY pA.distance DESC

LIMIT 1;


Slide24 l.jpg

Return an adjacency list for each subtree

Get all

ancestors

shared by

A and B

Exclude those

that are also

ancestors to C

Find the maximum spanning clade (i.e. the subtree) for each tree that

includes A and B but not C:

SELECT e.parent_id AS parent, e.child_id AS child, ch.node_label, pt.tree_id

FROM node_path p, edges e, nodes pt, nodes ch

WHERE e.child_id = p.child_node_id

AND pt.node_id = e.parent_id

AND ch.node_id = e.child_id

AND p.parent_node_id IN (

      SELECT pA.parent_node_id

      FROM   node_path pA, node_path pB, nodes nA, nodes nB

      WHERE pA.parent_node_id = pB.parent_node_id

      AND   pA.child_node_id = nA.node_id

      AND   nA.node_label = 'A'

      AND   pB.child_node_id = nB.node_id

      AND   nB.node_label = 'B')

AND NOT EXISTS (

    SELECT 1 FROM node_path np, nodes n

    WHERE    np.child_node_id = n.node_id

    AND n.node_label  = 'C'

    AND np.parent_node_id = p.parent_node_id);


Slide25 l.jpg

List the set of trees with these ancestors

Get all

ancestors

shared by

A and B

Exclude those

that are also

ancestors to C

Find trees that contain a clade that includes A and B but not C:

SELECT DISTINCT t.tree_id, t.name

FROM node_path p, nodes ch, trees t

WHERE

ch.node_id = p.child_node_id

AND ch.tree_id = t.tree_id

AND p.parent_node_id IN (

SELECT pA.parent_node_id

FROM node_path pA, node_path pB, nodes nA, nodes nB

WHERE pA.parent_node_id = pB.parent_node_id

AND pA.child_node_id = nA.node_id

AND nA.node_label = 'A'

AND pB.child_node_id = nB.node_id

AND nB.node_label = 'B')

AND NOT EXISTS (

SELECT 1 FROM node_path np, nodes n

WHERE

np.child_node_id = n.node_id

AND n.node_label = 'C'

AND np.parent_node_id = p.parent_node_id);


Slide26 l.jpg

Get all ancestors

of A, B, C from all

trees that have

A, B, C

Number of ingroups that share node

Exclude those

that are also

ancestors to D, E

But make sure that

the tree still contains D, E

Number of non-ingroups that must be in tree

Number of clades that each tree must satisfy

Find trees that contain a clade that includes (A, B, C) but not D or E:

SELECT qry.tree_id, MIN(qry.name) AS "tree_name"

FROM ( SELECT DISTINCT ON (n.node_id) n.node_id, t.tree_id, t.name

FROM trees t, nodes n,

(SELECT DISTINCT ON (inN.tree_id) inP.parent_node_id

FROM nodes inN, node_path inP

WHERE inN.node_label IN ('A','B','C')

AND inP.child_node_id = inN.node_id

GROUP BY inN.tree_id, inP.parent_node_id

HAVING COUNT(inP.child_node_id) = 3

ORDER BY inN.tree_id, inP.parent_node_id DESC) AS lca,

WHERE n.node_id IN (lca2.parent_node_id)

AND t.tree_id = n.tree_id

AND NOT EXISTS (SELECT 1

FROM nodes outN, node_path outP

WHERE outN.node_label IN ('D','E')

AND outP.child_node_id = outN.node_id

AND outP.parent_node_id = lca.parent_node_id)

AND EXISTS (SELECT c.tree_id

FROM trees c, nodes q

WHERE q.node_label IN ('D','E')

AND q.tree_id = c.tree_id

AND c.tree_id = t.tree_id

GROUP BY c.tree_id

HAVING COUNT(c.tree_id) = 2)) AS qry

GROUP BY (qry.tree_id)

HAVING COUNT(qry.node_id) = 1;


Slide27 l.jpg

Here's a faster, cleaner version:

SELECT t.tree_id, t.name

FROM trees t

INNER JOIN

(SELECT DISTINCT ON (inN.tree_id) inP.parent_node_id, inN.tree_id

FROM nodes inN, node_path inP

WHERE inN.node_label IN ('A','B','C')

AND inP.child_node_id = inN.node_id

GROUP BY inN.tree_id, inP.parent_node_id

HAVING COUNT(inP.child_node_id) = 3

ORDER BY inN.tree_id, inP.parent_node_id DESC) AS lca

USING (tree_id)

WHERE NOT EXISTS (

SELECT 1

FROM nodes outN, node_path outP

WHERE outN.node_label IN ('D','E')

AND outP.child_node_id = outN.node_id

AND outP.parent_node_id = lca.parent_node_id)

AND EXISTS (

SELECT c.tree_id

FROM trees c, nodes q

WHERE q.node_label IN ('D','E')

AND q.tree_id = c.tree_id

AND c.tree_id = t.tree_id

GROUP BY c.tree_id

HAVING COUNT(c.tree_id) = 2);


Slide28 l.jpg

A

B

C

D

E

3

4

7

8

9

2

6

5

1

Matching a whole tree means querying for all clades

(A, B) but not C, D, E

(C, D) but not A, B, E

(C, D, E) but not A, B


Dealing with the growth of phyloinformatics29 l.jpg
Dealing with the Growth of Phyloinformatics

  • Trees Too Many

    • Search, organize, triage, summarize, synthesize

      • Review existing methods

      • Describe queries for BioSQL phylo extension

      • Making generic queries

  • Trees Too Big

    • Visualizing and manipulating large trees

      • Demo PhyloWidget


Slide30 l.jpg

Sus scrofa

Balaenoptera

Hippopotamus

Hippopotamus

Balaenoptera

Sus scrofa

Equus caballus

Equus caballus

Felis catus

Felis catus

Mining trees for interesting, general, relationship questions:

(((Sus_scrofa, Hippopotamus),Balaenoptera),Equus_caballus)

vs

((Sus_scrofa, (Hippopotamus,Balaenoptera)),Equus_caballus)


Slide31 l.jpg

Sus scrofa

Sus celebensis

Hippopotamus

Hippopotamus

Balaenoptera

Balaenoptera

Equus caballus

Equus asinus

Felis catus

Felis catus

Even if with perfectly-resolved OTUs, you will still fail to hit relevant trees:


Slide32 l.jpg

A

B

C

D

E

3

4

7

8

9

2

6

5

1

Step 1: for each clade all trees in database, run a stem query on a classification tree (e.g. NCBI)

Step 2: label each node with an NCBI taxon id (if there is a match)

Step 3: do the same for the query tree

Stem Queries:

Node 2: (>A, B - C, D, E)

Node 3: (>A - B, C, D, E)

Node 4: (>B - A, C, D, E)

Node 5: (>C, D, E - A, B)

Node 6: (>C, D - A, B, E)

Node 7: (>C - A, B, D, E)

Node 8: (>D - A, B, C, E)

Node 9: (>E - A, B, C, D)


Slide33 l.jpg

Hominoidea

Cercopithecoidea

Gorilla gorilla

Gorilla

Pongo pygmaeus

Hominoidea

Homo sapiens

Homo

Pan

Pan troglodytes

Macaca sinica

Macaca sinica

Macaca nigra

Macaca nigra

Macaca irus

Cercopithecoidea

Rename nodes according to their deepest stem query…


Dealing with the growth of phyloinformatics34 l.jpg
Dealing with the Growth of Phyloinformatics

  • Trees Too Many

    • Search, organize, triage, summarize, synthesize

      • Review existing methods

      • Describe queries for BioSQL phylo extension

      • Making generic queries

  • Trees Too Big

    • Visualizing and manipulating large trees

      • Demo PhyloWidget


Phylowidget l.jpg
PhyloWidget

  • Greg Jordan

    • Google Summer of Code student

    • Nick Goldman's group, EBI

  • Java Applet

    • Uses the Processing graphics library

  • Originally as a graphical phylogenetic query and display tool for TreeBASE, BioSQL, etc

  • Can be used for:

    • Manipulating, visualizing large trees

    • Building supertrees through pruning & grafting