1 / 19

High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader]

High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader]. Kexue Liu CMSC 838 Presentation. Motivation. Phylogeny reconstruction from molecular data Poses complex optimization problem NP hard and thus computationally intractable

eunice
Download Presentation

High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader]

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation

  2. Motivation • Phylogeny reconstruction from molecular data • Poses complex optimization problem • NP hard and thus computationally intractable • High performance Algorithm Engineering • Reduce the running time of existing phylogenetic algoritms CMSC 838T – Presentation

  3. Talk Overview • Overview of talk • Background • Breakpoint Phylogeny • Breakpoint Analysis • Re-Engineering Techniques • Impact in computational Biology • Observations CMSC 838T – Presentation

  4. Background • Algorithm Engineering • Transform a pencil-and-paper algorithm into an efficient, robust implementation. • Main focus is experimentation • High Performance Algorithm Engineering • Running time and quality of the solution as the paramount goal • Includes parallelism • Refining serial part of the code • Cache-aware programming is a key to performance CMSC 838T – Presentation

  5. Background • Phylogeny • Reconstruction of the evolutionary history of a collection of organisms • Takes the form of an evolutionary tree • Computational Phylogenetics • Is extremely computation-intensive • Methods for sequence data (RNA, DNA, amino acid, Protein) do not scale up to whole genome • Genome level data • At this level, evolution is slow • Enable us to recover deep evolutionary relationships • Much hard to analyze than sequence data • Optimization criteria • Heuristics • Parsimony criterion • Maximum likelihood CMSC 838T – Presentation

  6. Breakpoint Phylogeny • Deal with simple genomic data • Organisms have a single chromosome or contain single-chromosome organelles • Each chromosome can be represented by an ordering of oriented genes. • Evolutionary process includes inversion, transposition, insertion, deletion and duplication. • Approaches • Construct parsimonious tree • Known or conjectured to be NP hard • No automated tool to solve it • Neighbor-joining heuristics • Fast and valuable • Can’t recover the ancestral gene orders. • Breakpoint phylogeny by Blanchette and Sankoff. CMSC 838T – Presentation

  7. Breakpoint phylogeny • More special case: • All the genomes have the same set of genes • Each gene appears once. • Is of interest to biologists • Inversions are the main evolutionary mechanism on such genomes • Works well for certain datasets. • Implementation developed by Sankoff and Blanchette • Breakpoint Analysis • Too slow to be used on anything other than small datasets with a few genes. CMSC 838T – Presentation

  8. Breakpoint Analysis: Details • Breakpoint: • Two genomes G and G’ with the same set of genes and each gene appears exactly once in each genome • Ordered pair of genes, (gi , gj) appears in G • Neither (gi , gj) nor (-gj , -gi) appears in G’ • Breakpoint Distance • Number of breakpoints between two genomes. • Median for three genomes • The genome which minimizes the breakpoint distance • Median Problem for Breakpoints • Construct a median of given genomes • NP hard CMSC 838T – Presentation

  9. Breakpoint Analysis • Method developed by Sankoff and Blanchette to solve breakpoint phylogeny • Uses reduction from MPB to Travelling Salesman Problem • Directed MPB to undirected TSP • Representing each gene by a pair of cities connected by an edge • Outer loop enumerates all (2n-5)!! trees on n leaves • Inner loop runs unknown number of iterations • Computation complexity is exponential in each of the number of genomes and the number of genes. CMSC 838T – Presentation

  10. Breakpoint Analysis Initially label all internal nodes with gene orders Repeat For each internal node v, with neighbors A, B, C do Solve the MPB on A, B, C to yield label m If relabelling v with m improves the score of T, then do it until no internal node can be relabelled CMSC 838T – Presentation

  11. Re-Engineering Techniques • Profiling: • Identify bottlenecks to balance implementation • Eliminate problems which include excessive resource consumption or poor results. • Examples: • Hand-unrolling loops, cut the running time down by a factor at least six. • Refine distance computations • Refine lower bound computations • Speed-up by one order of magnitude on Campanulaceae dataset CMSC 838T – Presentation

  12. Re-Engineering Techniques • Cache Awareness • Memory footprint • BPAnalysis: 60MB • GRAPPA: 1.8MB • Memory locality • BPAnalysis: poor locality, working set size of about 12MB • GRAPPA : good locality, working set size of about 600KB • Minimizing pointer dereferencing • Reuses allocated storage • Studies indicate that gain is likely to be factors of anywhere from 2 to 40 CMSC 838T – Presentation

  13. Re-Engineering Techniques • Low-level Algorithmic Changes • Using all of the available information • Examples: • Using lower bound to eliminate over 95% of the tree. • Take advantage of special structures: TSP has only two nontrivial edges( cost 1 and cost 2) • Speed-up by a factor of 5-10. CMSC 838T – Presentation

  14. Re-Engineering Techniques: Parallel Aspects • Efficient Tree Generation, • Avoid unbounded-precision arithmetic • Allow generation from any count with variable gap • Provides parallel generation and also sampling of search space • Portable MPI implementation, each processor handles a fraction of trees. • On the 512-processor Alliance cluster LOS LOBOS at UNM, obtained a 512-fold speedup. • Summarize speedups: • Profiling: one order of magnitude • Cache awareness: factors of anywhere from 2 to 40 • Low-level Algorithmic changes: 5-10 • 512-processor parallelism: 512 • Overall, Grappa demonstrated a million-fold speedup over the original implementation CMSC 838T – Presentation

  15. Evaluation: the Bluebell Family • Dataset: full gene sequences for the chloroplasts of 12 species of Campanulaceae (Bluebells), plus tobacco. • Chloroplast • A semi-independent organism that lives within plant cells and allow them to photosynthesize. • Have a single chromosome with about 120 genes. • Optimization target: reconstruct the phylogeny with the least total amount of genomic changes. • Environment: 512-processor Los Lobos supercluster at UNM • Results: • Speedup by three to four orders in the serial part • Total speedup by over one million CMSC 838T – Presentation

  16. Phylogeny of Bluebell Family CMSC 838T – Presentation

  17. Impact in Computational Biology • Much faster implementations • Alter the practice of research in biology and medicine • Reducing the time of an analysis from two years down to a day • Makes an enormous difference in the pace and cost of drug discover and development • Fast and accurate analysis software • Enables researchers to pursue more leads, develop better institution on small dataset • Form new conjectures about biological mechanism CMSC 838T – Presentation

  18. Observations • Algorithm re-engineering • Uncovers salient characteristic of the algorithm • Enable us to develop better algorithms • Example: find a true linear time algorithm for computing inversion distance in the development of GRAPPA. • Can be applied to any existing bioinformatics algorithms • Several have been engineered for performance, such as BLAST • Limited benefits in theoretical terms when applied to NP-hard optimization problems • Does not scale up to “industrial-strengthen” • Grappa only enables to move from 10 taxa to 13 taxa CMSC 838T – Presentation

  19. Thank you CMSC 838T – Presentation

More Related