Algorithmic problems related to sequences and phylogenetic trees
Download
1 / 25

- PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on

Algorithmic Problems Related to Sequences and Phylogenetic Trees. Bhaskar DasGupta Department of Computer Science University of Illinois at Chicago Chicago, IL 60607-7053 Email: [email protected] Outline Introduction Substructure Comparison Problems Sequences

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - Samuel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Algorithmic problems related to sequences and phylogenetic trees l.jpg

Algorithmic Problems Related to Sequences and Phylogenetic Trees

Bhaskar DasGupta

Department of Computer Science

University of Illinois at Chicago

Chicago, IL 60607-7053

Email: [email protected]


Slide2 l.jpg

Outline Trees

Introduction

Substructure Comparison Problems

Sequences

Nonoverlapping local alignment

Proteins

Transformation Based Distances

Phylogenetic Trees

Why compare?

A few distances

Genomes

Syntenic Distance

Conclusions


Slide3 l.jpg

Computational Molecular Biology Trees

A Computer Scientist’s Participation

  • Get to know the computational problems

    • Talk to biologists

    • State the computational problems as precisely as possible

  • Investigate computational aspects of the problems

    • exact solutions

      • difficult/easy ?

      • time/space efficient solutions ?

    • approximate solutions (if exact solution is hard or not time/space efficient)

      • guaranteed quality of approximation ? (tradeoff with space/time?)

      • deterministic vs. randomized algorithms

    • implementation aspects

      • programming cleverness to reduce space/time

      • algorithmic engineering techniques to reduce space/time

    • interaction with the biologists

      • are the solutions biologically meaningful ?


Slide4 l.jpg

Few Computer Science Jargons Trees

When we say What we really mean

Maximization/minimization problem Problem in which we maximize/minimize some objective function

Problem is NP-complete/hard Exact solution for large size problem will most likely require too much time

Polynomial-time solution Solvable in reasonable time in a reasonably fast computer

Approximation algorithm An approximate solution computed in reasonable time

with approximation ratio r with an objective function value of a (for maximization/minimization) least (at most r) of the optimum


Slide5 l.jpg

Substructure Similarity (or, equivalently, Dissimilarity) Trees

a

b’

b

c’

c

a’

a matches to a’ with similarity 10

b matches to b’ with similarity 15

c matches to c’ with similarity 11

total similarity 36

Goal: match disjoint substructures to maximize total similarity


Slide6 l.jpg

Few Complications Trees

  • Many short vs. fewer long substructures

  • Measure of similarity between substructures

  • Examples:

    • rmsd (root-mean-square distance) between 3D substructures

    • edit distance between subsequences

    • syntenic distance between multi-chromosome genomes


Slide7 l.jpg

Sequences Trees

Non-overlapping local alignment

total similarity 10+15=25


Slide8 l.jpg

The problem Trees

Input: pairs of fragments, one from each sequence (or, equivalently a

set of rectangles).

the weight of each pair (rectangle) is their similarity measure

Output: a set of pairs (rectangles) such that

  • no two rectangles overlap on the x-axis

    (i.e., matched fragments of the first sequence are disjoint)

  • no two rectangles overlap on the y-axis

    (i.e., matched fragments of the 2nd sequence are disjoint)

  • total similarity of selected fragment pairs is maximized


Slide9 l.jpg

Further assumption Trees

We can preprocess input data (rectangles or fragment pairs) to ensure that

  • for any two rectangles, the projection of one on the y-axis does not enclose that of another

not allowed in the input data

  • for any two rectangles, the projection of one on the x-axis does not enclose that of another


Slide10 l.jpg

A Trees

G

An illustration

Input:

15

2

G

C

1

C

10

T

G

A

A

C

A

C

C

An optimal solution of total similarity 25


Slide11 l.jpg

Previous results Trees

(n = number of rectangles (fragment pairs))

Bafna, Narayanan and Ravi (WADS’95)

  • NP-complete

  • O(n2) time approximation algorithm with approximation ratio 3.25

    • converts to a problem of finding maximum-weight independent set in a 5-clawfree graph

    • gives approximation algorithm for (d+1)-clawfree graphs with approximation ratio of

  • Halldórsson (SODA’95)

    • approximation algorithm with approximation ratio of about 2.5 when all weights are one

      • again uses clawfree graphs

  • Berman (SWAT’00)

    • O(n4) time algorithm with approximation ratio of about 2.5

      • via clawfree graphs again


Slide12 l.jpg

Our recent results Trees

(Berman, DasGupta and Muthukrishnan, SODA’02)

  • O(n log n) time approximation algorithm with approximation ratio 3

  • very simple to implement

    • uses a 2-phase approach (or, equivalently, the local-ratio technique)

      Extensions to d dimensions (d > 2)

  • Inputs are similarity measures of d fragments, one from each of given d sequences

  • Motivation: multiple sequence comparison problems

  • Generalization of our above approach:

    • O(n d log n) time approximation algorithm with approximation ratio of 2d-1

  • current best (Bar-Yehuda, Halldórsson, Naor, Shachnai and Shapira, SODA’02):

    • polynomial time algorithm with approximation ratio 2d

      • uses repeated linear programming and continuous version of local-ratio techniques


Slide13 l.jpg

Common substructure between protein structures Trees

(work in progress.......with Jie Liang and Andrew Binkowski)

  • Comparison of 2 4-helix bundles that differ by topological rearrangement, ROP and cytochrome b56

  • Topological cartoons of 1ROP and 256B. Helices are drawn as cylinders and loops as lines. Residue numbers of structurally equivalent segments are indicated on the cylinders.

  • The alignment is non-sequential.


Slide14 l.jpg

Motivation: Trees

discovering similar substructures from different proteins is essential for recognizing remote evolutionary relationship at the level of protein fragments

Few interesting points:

  • it is not easy to characterize topological structures such as void, pocket, or tunnel where ligand and other molecules bind.

  • Current computational tools do not perform very well on discovering similar substructures.

    For example:

    (a) protein structures are typically represented by distance matrices or contact maps, which record pairwise inter-distances between selected atoms (typically Cα atoms) on the primary sequences

    (b) finding common substructures becomes matching submatrices of the two contact maps

    (c) Heuristic algorithms have been developed and have proven to be useful. But, they are time consuming (typically O(n6)), and cannot be used for more demanding tasks such as identifying spatial functional motifs


Slide15 l.jpg

Our approach in work in progress Trees

  • reduce the problem to various constrained rectangle-packing problems

  • use combinatorial methods (such as the local-ratio technique) to design approximation algorithms for these problems

    Our final goal

  • identification of the most discriminating geometric and chemical features and their combinations for various proteins

  • development of a robust method to compute the similarity/dissimilarity of two shape distributions of these features


Slide16 l.jpg

Transformation rules (with costs) Trees

Transformation based distances

Objects

15

12

9

10

Goal: find distance between two specified objects

15

9

10

cost = 10+15+9 = 34

12

10

cost = 10+12 = 22

is 22

and

distance between


Slide17 l.jpg

Distances between Phylogenetic trees Trees

Objects:

Evolutionary trees (phylogenies) on n nodes

  • Transformation Rules:

    • How to modify trees locally consistent with biological applications?


Slide18 l.jpg

parsimony method Trees

Why compute distances between phylogenies ?

First motivation

compare

them

for

similarity

and

discrepancy

compatibility

method

input data

maximum-likelihood

method

distance matrix

method

different methods for

inferring phylogenies


Slide19 l.jpg

Why compute distances between phylogenies ? Trees

Second motivation

To find out information about rare genetic events such as recombination or gene conversion

recombination

gene

conversion


Slide20 l.jpg

Few distances that we have looked at...... Trees

  • Nearest neighbor interchange (nni) distance

  • Linear cost subtree transfer distances

    Synopsis of our works on these distances

  • proving that exact solution is NP-hard

  • providing fast approximate solutions

  • investigating fixed-parameter tractability

  • some implementation works .....


Slide21 l.jpg

Genomic Distance Trees

Syntenic distance between multi-chromosome genomes

(Ferretti, Nadeau and Sankoff, 1996)

  • treats genomes at a higher level of abstraction

gene

chromosome

4

9

10

8

6

3

5

7

1

2

  • order of genes in any chromosome is unknown or ignored

  • intra-chromosomal events (e.g., reversal, transposition) do not affect chromosomal assignment

  • inter-chromosomal events are important


Slide22 l.jpg

2 Trees

Inter-chromosomal events

Fission Fusion

5

2

1

3

5

1

3

4

4

5

4

3

2

1

5

4

3

2

1

(Reciprocal) translocation

5

6

7

3

4

2

1


Slide23 l.jpg

Syntenic distance between two genomes Trees

minimum number of fission, fusion and translocations necessary to transform one genome to another

Other related problems

finding the median of 3 genomes for the syntenic distance metric

(useful for phylogentic tree inference problem from synteny data)

Synopsis of our work on these problems

  • showing NP-hardness of exact computation

  • giving efficient approximation algorithms

  • exhibiting fixed-parameter tractability


Other problems l.jpg
Other problems...... Trees

  • Genome partitioning with applications to DNA microarray chip design

  • Consensus sequence reconstruction problems


Slide25 l.jpg

THE END Trees


ad