high performance computing for reconstructing phylogenies from gene order data
Download
Skip this Video
Download Presentation
High-Performance Computing for Reconstructing Phylogenies from Gene-Order Data

Loading in 2 Seconds...

play fullscreen
1 / 25

High-Performance Computing for Reconstructing Phylogenies from Gene-Order Data - PowerPoint PPT Presentation


  • 161 Views
  • Uploaded on

High-Performance Computing for Reconstructing Phylogenies from Gene-Order Data. David A. Bader Electrical & Computer Engineering University of New Mexico [email protected] http://hpc.eece.unm.edu/. Acknowledgment of Support. National Science Foundation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'High-Performance Computing for Reconstructing Phylogenies from Gene-Order Data' - knox


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
high performance computing for reconstructing phylogenies from gene order data

High-Performance Computing forReconstructing Phylogenies fromGene-Order Data

David A. Bader

Electrical & Computer Engineering

University of New Mexico

[email protected]

http://hpc.eece.unm.edu/

acknowledgment of support
Acknowledgment of Support
  • National Science Foundation
    • CAREER: High-Performance Algorithms for Scientific Applications (00-93039)
    • ITR: Algorithms for Irregular Discrete Computations on SMPs (00-81404)
    • DEB: Ecosystem Studies: Self-Organization of Semi-Arid Landscapes: Test of Optimality Principles (99-10123)
    • ITR/AP: Reconstructing Complex Evolutionary Histories (01-21377)
    • DEB Comparative Chloroplast Genomics: Integrating Computational Methods, Molecular Evolution, and Phylogeny (01-20709)
    • ITR/AP(DEB): Computing Optimal Phylogenetic Trees under Genome Rearrangement Metrics (01-13095)
  • PACI: NCSA/Alliance, NPACI/SDSC, PSC
  • Sun Microsystems

High-Performance for Phylogeny Reconstruction, David A. Bader

algorithms that scale from the blade to the fire
Algorithms that Scale from the Blade to the Fire

High-Performance for Phylogeny Reconstruction, David A. Bader

commercial aspects of phylogeny reconstruction
Commercial Aspects of Phylogeny Reconstruction
  • Identification of microorganisms
    • public health entomology
    • sequence motifs for groups are patented
    • example: differentiating tuberculosis strains
  • Dynamics of microbial communities
    • pesticide exposure: identify and quantify microbes in soil
  • Vaccine development
    • variants of a cell wall or protein coat component
    • porcine reproductive and respiratory syndrome virus isolates from US and Europe were separate populations
    • HIV studied through DNA markers
  • Biochemical pathways
    • antibacterials and herbicides

Glyphosate (Roundup, Rodeo , and Pondmaster ): first herbicide targeted at a pathway not present in mammals

    • phylogenetic distribution of a pathway is studied by the pharmaceutical industry before a drug is developed
  • Pharmaceutical industry
    • predicting the natural ligands for cell surface receptors which are potential drug targets
    • a single family, G protein coupled receptors (GPCRs), contains 40% of the targets of most pharm. companies

High-Performance for Phylogeny Reconstruction, David A. Bader

grappa genome rearrangements analysis
GRAPPA: Genome Rearrangements Analysis
  • Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms
    • http://www.cs.unm.edu/~moret/GRAPPA/
    • Open-source
    • already used by other computational phylogeny groups, Caprara, Pevzner, LANL, FBI, PharmCos.
  • Gene-order Phylogeny Reconstruction
    • Breakpoint Median
    • Inversion Median
  • over one-million fold speedup from previous codes
  • Parallelism
    • Scales linearly with the number of processors
  • Developed using Sun Forte C

High-Performance for Phylogeny Reconstruction, David A. Bader

molecular data for phylogeny
Molecular Data for Phylogeny
  • simple DNA sequence: nucleotides
  • low-level functionality: amino acids, etc.
  • genomic level: genes
  • (next is functional level: proteomics, etc.)

Biologists now have full gene sequences for many single-chromosome organisms and organelles (e.g., mitochondria, chloroplasts) and for more and more larger organisms

High-Performance for Phylogeny Reconstruction, David A. Bader

gene order phylogeny
j

-i

i

-j

j+1

j+1

i -1

i -1

Gene Order Phylogeny
  • Many organelles appear to evolve mostly through processes that simply rearrange gene ordering (inversion, transposition) and perhaps alter gene content (duplication, loss).
  • Chloroplast have a single, typically circular, chromosome and appear to evolve mostly through inversion:

The sequence of genes i, i+1, …, j is inverted and

every gene is flipped.

High-Performance for Phylogeny Reconstruction, David A. Bader

gene order phylogeny cont d
Gene Order Phylogeny (cont’d)
  • The real problem
    • Reconstruct the “true” tree, identify the “true” ancestral genomes, and recover on each edge the “true” sequence of evolutionary changes
  • The optimization problem (parsimony)
    • Reconstruct a tree and ancestral genomes so as to minimize the sum, over all tree edges, of the inferred evolutionary distance along each edge
  • The surrogate problem
    • Do the optimization problem with a measure of inferred evolutionary distance that lends itself to analysis

High-Performance for Phylogeny Reconstruction, David A. Bader

breakpoint analysis a surrogate for gene order
Breakpoint Analysis:A Surrogate for Gene Order
  • Breakpoint:
    • an adjacent pair of genes present in one genome, but absent in the other
  • Breakpoint distance:
    • the total number of breakpoints between two genomes (a true metric, similar to Hamming distance)
  • Breakpoint phylogeny:
    • the tree and ancestral genomes that minimize the sum, over all edges of the tree, of the breakpoint distances

Naturally, it is an NP-hard problem, even with just 3 leaves.

High-Performance for Phylogeny Reconstruction, David A. Bader

breakpoint analysis sankoff blanchette 1998
Breakpoint Analysis(Sankoff & Blanchette 1998)
  • For each tree topology do
    • somehow assign initial genomes to the internal nodes
    • repeat
      • for each internal node do
        • compute a new genome that minimizes the distances to its three neighbors
        • replace old genome by new if distance is reduced
    • until no change

Sankoff & Blanchette implemented this in a C++ package

(2n-5)!! = (2n-5) (2n-7) …  5  3 trees

unknown iterative heuristic

NP-hard

High-Performance for Phylogeny Reconstruction, David A. Bader

algorithm engineering works
Algorithm Engineering Works!
  • We reimplemented everything –
    • the original code is too slow and not as flexible as we wanted.
  • Our main dataset is a collection of chloroplast data from the flowering plant family Campanulaceae (bluebells):
    • 13 genomes of 105 gene segments each
  • On our old workstation:
    • BPAnalysis processes 10-12 trees/minute
    • Our implementation processes over 50,000 trees/minute
  • Speedup ratio is over 5,000!!
  • On synthetic datasets, we see speedups from 300 to over 50,000…

High-Performance for Phylogeny Reconstruction, David A. Bader

so what did we do
So… What did we do?!

*

(~ 10x)*

  • Absolutely no high-level algorithmic changes
  • Three low-level algorithmic changes:
    • better bounding
    • strong upper bound initialization
    • “condensing”
  • Completely different data representation
  • Two low-level algorithmic changes
    • all memory is pre-allocated
    • some loops are hand-unrolled
  • Written in C instead of C++

~10x

~10x

~ 6x

?

(convenience)

* Well, so I lied just a little bit…

High-Performance for Phylogeny Reconstruction, David A. Bader

one high level algorithmic change
One high-level algorithmic change
  • (Ok, so I lied a little…)
  • Avoid labeling the tree if possible
    • Use current best score as an upper bound.
    • Compute lower bound & prune tree away if lower bound > upper bound
  • Lower bound:
    • Get circular ordering of leaves, x1 x2 … xn
    • Compute D = d(x1,x2) + d(x2,x3) + … + d(xn,x1)
    • Then ½ D is a lower bound because
      • d(.) obeys the triangle inequality
      • every tree edge is used twice in a tree-based version of D

High-Performance for Phylogeny Reconstruction, David A. Bader

slide14
e

e

a

a

d

d

b

b

c

c

Tree

Tree

Tree version (paths)

d(e,a)

d(d,e)

d(a,b)

d(c,d)

d(b,c)

(Same trick as in the “twice around the tree” approximation for the TSP with triangle inequality.)

D = d(a,b) + d(b,c) + d(c,d) +

d(d,e) + d(e,a)

High-Performance for Phylogeny Reconstruction, David A. Bader

algorithmic changes 10x
Algorithmic Changes (~ 10x)
  • Better bounding: skip edges that would cause degree 3 or premature cycle
  • “Condensing”: whenever the same
    • gene subsequence appears in all genomes, it can be condensed into a single “superfragment”
    • done as static processing and on the fly before each TSP
  • Initializing the new median with the best of the old one and its three neighbors.
  • Condensing is very effective on real data within families, but easily defeated by large evolutionary distances.
  • (1) and (3) cause over half of the TSP instances (for finding computing “median-of-three” updated internal nodes) to be pruned away instantly.

High-Performance for Phylogeny Reconstruction, David A. Bader

data representation 10x
Data Representation (~ 10x)
  • No distance matrix for reduction of “median-of-three” to Traveling Salesperson Problem (TSP): at most 4n edges can be of interest – the others are treated as an undifferentiated pool. The adjacency lists have length  4.
    • thus, linear time at each step and reduced storage.
    • Backtracking search has a small list of edges and only searches among edges of cost 1 and 2 (– and 0 are always included) – still NP-hard, but often easy
  • When search runs out of edges, tour is completed in linear time from the pool of edges of cost 3
  • Many auxiliary arrays (á la Fortran!) to carry information on flags, degrees, other end of chains, …

High-Performance for Phylogeny Reconstruction, David A. Bader

low level coding changes 6x
Low-Level Coding Changes (~ 6x)
  • All storage allocated at start, with large #s of pointers passed to subroutines (no globals, to allow parallel execution).
      • Avoids malloc/free overhead; Improves cache locality
  • Avoid recomputations.
    • Use local variables for intermediate pointers
    • Hand unroll loops on adjacencies to preserve locality (and to avoid “mod” operations with circular genomes)
      • Speeds up addressing – never deference! & Improves cache locality

BPAnalysis uses 65MB and has a real memory footprint of ~ 12MB on our real data

Our reimplementation uses 1.6MB with a footprint of 0.6MB

High-Performance for Phylogeny Reconstruction, David A. Bader

and how did we do it
And… How did we do it?
  • 3 strategies: Profile, Profile, Profile

(and use your engineering sense/nose/… )

Sun Forte 6 Analyzer

  • We began with 4 main culprits:
    • preparing adjacency lists for the TSP
    • computing breakpoint distances
    • computing lower bounds in TSP
    • backtracking in TSP
  • Over 10 – 12 major iterations, each of which yielded a 1.5 – 2 fold speed-up, these four switched places over and over.

High-Performance for Phylogeny Reconstruction, David A. Bader

profiling
Profiling
  • And our final tally (still on the Campanulaceae dataset) is:

30% backtracking (excl. LB)

20% preparing adjacency lists

20% condensing & expanding

15% computing LB

8% computing distances

7% miscellaneous overhead

    • (no obvious culprits left)

High-Performance for Phylogeny Reconstruction, David A. Bader

high performance computing techniques
High-Performance Computing Techniques
  • Availability of hundreds of powerful processors
  • Standard parallel programming interfaces (Sun HPC)
    • Message passing interface (MPI)
    • OpenMP or POSIX threads
  • Algorithmic libraries for SMP clusters
    • SIMPLE
  • Goal: make efficient use of parallelism for
    • exploring candidate tree topologies
    • sharing of improved bounds

High-Performance for Phylogeny Reconstruction, David A. Bader

parallelization of the phylogeny algorithm
Parallelization of the Phylogeny Algorithm
  • Enumerating tree topologies is pleasantly parallel and allows multiple processors to independently search the tree space with little or no overhead
  • Improved bounds can be broadcast to other processors without interrupting work
  • Load is evenly balanced when trees are cyclically assigned (e.g. in a round-robin fashion) to the processors
  • Linear speedup

High-Performance for Phylogeny Reconstruction, David A. Bader

final remarks
Final Remarks
  • Our reimplementation led to numerous extensions as well as to new theoretical results
    • GRAPPA has been extended to inversion phylogeny, with linear-time algorithms for inversion distance and a new approach to exact inversion median-of-three.
    • Better bounding in the next version of GRAPPA yields two more orders of magnitude speedup.
    • These insights and improvements are made possible by mature development tools (Forte)
  • Algorithmic engineering techniques are widely applicable
    • We may not always get 6 orders of magnitude, but 3 – 4 orders should be nearly routine with most codes. (We are starting work on TBR and exact parsimony solvers.)

High-Performance for Phylogeny Reconstruction, David A. Bader

final remarks cont d
Final Remarks (cont’d)
  • High-performance implementations enable:
    • better approximations for difficult problems (MP, ML)
    • true optimization for larger instances
    • realistic data exploration (e.g., testing evolutionary scenarios, assessing answers obtained through other means, etc.)
  • Our analysis of the Campanulaceae dataset confirmed the conjecture of Robert Jansen et al. – that inversion is the principal process of genome evolution in cpDNA for this group.

High-Performance for Phylogeny Reconstruction, David A. Bader

work in progress and future work
Work-In-Progress and Future Work
  • Tree enumeration using circular ordering
  • Handle unequal gene content and duplicate genes using exemplars
  • Parallel branch and bound techniques (optimized for Sun HPC Servers) for searching tree space
  • Improved SPR and TBR techniques (local searches around good trees)
  • Exact Algorithm for Maximum Parsimony

High-Performance for Phylogeny Reconstruction, David A. Bader

recent publications 2001
Recent publications (2001)
  • A New Implementation and Detailed Study of Breakpoint Analysis, B.M.E. Moret, S. Wyman, D.A. Bader, T. Warnow, M. Yan, Sixth Pacific Symposium on Biocomputing 2001, pp. 583-594, Hawaii, January 2001.
  • High-Performance Algorithm Engineering for Gene-Order Phylogenies, D.A. Bader, B. M.E. Moret, T. Warnow, S.K. Wyman, and M. Yan, DIMACS Workshop on Whole Genome Comparison, DIMACS Center, Rutgers University, Piscataway, NJ, March 2001.
  • Variation in vegetation growth rates: Implications for the evolution of semi-arid landscapes, C. Restrepo, B.T. Milne, D. Bader, W. Pockman, and A. Kerkhoff, 16th Annual Symposium of the US-International Association of Landscape Ecology, Arizona State University, Tempe, April 2001.
  • High-Performance Algorithm Engineering for Computational Phylogeny, B. M.E. Moret, D.A. Bader, and T. Warnow, 2001 International Conference on Computational Science, San Francisco, CA, May 2001.
  • Cluster Computing: Applications, David A. Bader and Robert Pennington, The International Journal of High Performance Computing, 15(2):181-185, May 2001.
  • New approaches for using gene order data in phylogeny reconstruction, R.K. Jansen, D.A. Bader, B. M. E. Moret, L.A. Raubeson, L.-S. Wang, T. Warnow, and S. Wyman. Botany 2001, Albuquerque, NM, August 2001.
  • GRAPPA: a high-performance computational tool for phylogeny reconstruction from gene-order data, B. M.E. Moret, D.A. Bader, T. Warnow, S.K. Wyman, and M. Yan. Botany 2001, Albuquerque, NM, August 2001.
  • Inferring phylogenies of photosynthetic organisms from chloroplast gene orders, L.A. Raubeson, D.A. Bader, B. M.E. Moret, L.-S. Wang, T. Warnow, and S.K. Wyman. Botany 2001, Albuquerque, NM, August 2001.
  • Industrial Applications of High-Performance Computing for Phylogeny Reconstruction, D.A. Bader, B. M.E. Moret, and L. Vawter, SPIE ITCom: Commercial Applications for High-Performance Computing, Denver, CO, SPIE Vol. 4528, pp. 159-168, August 2001.
  • Using PRAM Algorithms on a Uniform-Memory-Access Shared-Memory Architecture, D.A. Bader, A. Illendula, B. M.E. Moret, and N.R. Weisse-Bernstein, Fifth Workshop on Algorithm Engineering, Springer-Verlag LNCS 2141, 129-144, Aarhus, Denmark, August 2001.
  • A Linear-Time Algorithm for Computing Inversion Distance Between Two Signed Permutations with an Experimental Study, D.A. Bader, B. M.E. Moret, and M. Yan, Journal of Computational Biology, 8(5):483-491, October 2001.

High-Performance for Phylogeny Reconstruction, David A. Bader

ad