1 / 18

Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data

Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data. Kobe University. Toshiki Saitoh (ERATO) Joint work with Masashi Kiyomi (JAIST) Yoshio Okamoto (JAIST). Yokohama City University. The University of Electro-Communications.

lars
Download Presentation

Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data Kobe University ToshikiSaitoh (ERATO) Joint work with Masashi Kiyomi (JAIST) Yoshio Okamoto (JAIST) Yokohama City University The University of Electro-Communications 11th International Symposium on Experimental Algorithms Bordeaux, France, June 7-9, 2012

  2. Directed Binary Perfect Phylogeny ○ 0 → 1 ×1 → 0 • Input: A species-character matrix M • All characters are binary. • msc= 1iffthe species s has the character c • Output: A directed perfect phylogeny • An unordered rooted tree whose leaves have one species label. • Each character is labeled one node. • A species s has a character cif and only if the leaf with label s is a descendant of the node with label c. c3 c1 c2 c6 c4 s5 s2 s3 c5 s1 s4

  3. Directed Binary Perfect Phylogeny Lemma [Jannson, 2008] A matrix M admits a directed perfect phylogeny if and only if for every pair of columns iand j, either Ciand Cjare disjoint or one contains the other. Ci : the set of species with thecharacterci We can construct a phylogeny in polynomial time. C3={s1, s2, s4} C4={s1, s4} C6={s3} c3 c1 c2 c6 c4 s5 s2 s3 c5 s1 s4 C4C3

  4. Incomplete Directed Perfect Phylogenies • Input: An incomplete species-character matrix • The states of some characters are unknown. • Output: A directed perfect phylogeny • The unknown states are completed C3 C1 C6 We can find one phylogeny in polynomial time. C2 C4 S5 S2 S3 [Pe’er et al., 2004] C5 S1 S4

  5. Incomplete Directed Perfect Phylogenies • Input: An incomplete species-character matrix • The states of some characters are unknown. • Output: A directed perfect phylogeny • The unknown states are completed Enumeration of all perfect phylogenies from incomplete data C3 C1 C1 C3 S3 C6 C2 C4 C4 S5 S2 S3 C2 C5 C5 C6 S1 S1 S4 S5 S4 S2

  6. Why Enumeration? • Data mining • Extraction of characters from all objects • Indexing • Counting • Random sampling • Searching • Filtering C1 C3 C1 C3 S3 C6 C2 C4 C4 S5 S2 S3 C2 C5 C5 C6 . . . S1 S1 S4 S5 S4 S2

  7. Our Contribution • Proposing two enumeration algorithms • Branch and bound (B&B) • Output all perfect phylogenies one by one • Runs in O(|M| kh) time • k: #“?” in M, h: #perfect phylogenies • ZDD approach • Represent all perfect phylogenies compactly • Many applications • Counting, random sampling, filtering • Proof of #P-hardness of the counting problem • Reducing by counting the number of matchings in a bipartite graphs

  8. What is a ZDD? • ZDD: Zero-suppressed Binary Decision Diagram • Proposed by Minato [Minato, 1993] • Compact representation for a boolean function • Aboolean function corresponds to a family of sets. {{x1,x2}, {x1, x3}, {x3}} x1 Example: F=(x1 x2 x3)˅(x2x3) x1 0 1 Reduction rules Uniqueness Zero-suppression x2 x2 x2 x3 x3 x3 x3 x3 0 1 1 0 0 0 0 1 1 0 ZDD of F Binary decision tree representing F

  9. Reduction Rules Uniqueness 2. Zero-suppression Merge duplicate nodes (isomorphism subgraph) Eliminate redundant nodes x x x x 0 A ZDD represents a family of sets in a compressed way. There are algebraic operations for families of sets over ZDDs.

  10. Algebraic Operations on ZDDs • Family algebras • Union, intersection, difference, join, quotient, remainder, etc. • Filtering objects in ZDDs • Counting (random sampling) and optimization These operations can be performed in almost linear time. x1 x2 x1 {{x1,x2}, {x1, x3}, {x3}} {{x1}, {x3}, {x1,x2}, {x1, x3}, {x1, x2, x3}} x3 x2 ˅ x3 x3 0 1 x1 x2 {{x1},{x1, x2, x3}} 0 1 x3 0 1

  11. Perfect Phylogenies and ZDD • Introducing a boolean variable xsc for each species s and character c • xsc= 1 if and only if the species s has the character c. Lemma [Jannson, 2008] A matrix Madmits a directed perfect phylogenyif and only if for every pair of columns iand j, either Ciand Cjare disjoint or one contains the other. x11 x21 x22 x22 x32 0 1

  12. Perfect Phylogenies and ZDD • Introducing a boolean variable xsc for each species s and character c • xsc= 1 if and only if the species s has the character c. Lemma [Jannson, 2008] A matrix Madmits a directed perfect phylogenyif and only if for every pair of columns iand j, either Ciand Cjare disjoint or one contains the other. for every distinct character ci and cj exactly one of the following three is satisfied. • for all species s, ifxsci=1 thenxscj=1 • for all species s, ifxsci=0 thenxscj=0 • for all species s, ifxsci=1 thenxscj=0 CiCj CjCi

  13. Experiments • Instances: • Constructing an incomplete data from complete data • Random data set [Hudson, 2002] • “1” or “0” -> “?” with probability p (={0.1,0.2, 0.3, 0.4, 0.5}) • Matrix size (n, m): ({50, 100}, {50, 100}) • 100 instances for each triple (n, m, p) • B&B algorithm is written by C. • ZDD approach is written by C++(ZDD library is developed by Minato) • Machine spec • OS: SuSE Linux Enterprise Server 10 • CPU: Quad-Core AMD Opteron Processor 8393 • #CPUs 16, #Processors 32, Clock Freq. 3092MHz • Memory: 512GB

  14. Experimental Results The number of solved instances by B&B and ZDD approach. (“solved” means that the algorithm successfully halts.) Timeout: 2 minutes

  15. Experimental Results

  16. Experimental Results

  17. Experimental Results The size of ZDD is 1017.77 times smaller than the number of perfect phylogenies.

  18. Conclusion • Our results • Proposing two enumeration algorithms • Branch and bound algorithm (B&B) • ZDD approach • ZDD approach solved more instances than B&B. • Spends more time with more the ZDD size. • Show high compression rate of ZDD for the random data. • Proof of #P-hardness of the counting problem Thank you for your attention!

More Related