Non breaking similarity of genomes with gene repetitions
Download
1 / 23

Non-breaking Similarity of Genomes with Gene Repetitions - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

Non-breaking Similarity of Genomes with Gene Repetitions. Binhai Zhu Computer Science Department, Montana State University Joint work with Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang and Zhiyu Zhao. Background.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Non-breaking Similarity of Genomes with Gene Repetitions' - zaltana-torres


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Non breaking similarity of genomes with gene repetitions

Non-breaking Similarity of Genomes with Gene Repetitions

Binhai Zhu

Computer Science Department, Montana State University

Joint work with Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang and Zhiyu Zhao


Background
Background

  • Computing genomic distance between genomes is important in evolutionary molecular biology, the problem was first studied by Sturtevant and Dobzhansky in 1936.

  • A lot of research has been done on computing genomic distances since 1990, assuming that each gene appears in a genome once, e.g., the famous result by Hannenhalli and Pevzner on sorting signed permutations by reversals.


Background cond
Background (cond.)

  • On the other hand, gene repetition is very common in genomes. So computing genomic distances with gene repetition is a more realistic problem.

  • This is a typical optimization problem, it makes sense to study the approximability of the problem.


Definitions
Definitions

  • Given n gene families (alphabets) F, a genome G’ is a sequence of elements of F such that each element has a (+/-) sign.

    Example. F={a,b,c,d},

    G’=-bd-cab-d-c

  • We will focus on unsigned sequences in this work.

  • A genome G is said to be exemplar if every gene appears exactly once in G.


Definitions cond
Definitions (cond.)

  • Given exemplar genomes G and H, over the same set of gene families, if gene ab is a substring in G but not in H, then ab constitutes a breakpoint in G.

    Example, G=abcdefg

    H=efgdcab

    there are 3 breakpoints in G (and symmetrically in H).

  • The number of breakpoints between G and H is called the breakpoint distance between G and H.


Exemplar breakpoint distance problem
Exemplar Breakpoint Distance Problem

  • Given two genomes G’ and H’ over n gene families, compute two exemplar genomes G and H such that the breakpoint distance between G and H is minimized.

  • We call this the exemplar breakpoint distance problem (between G’ and H’). Denote this distance by eb(G’,H’)=b(G,H).


Approximation algorithms
Approximation Algorithms

  • Given a minimization (maximization) problem Л, let the optimal solution of Л be OPT, an approximation algorithm A provides a performance guarantee of α for Л if for every instance of Л the solution value returned by A is at most x OPT (at least OPT/).

  • Usually we say that A is a factor- approximation for Л.


Prior results 1
Prior Results (1)

  • We showed that the exemplar breakpoint distance problem does not admit any approximation, unless P=NP (or, deciding whether eb(G’H’)=0 is NP-complete) [Chen, Fu and Zhu;2006].

  • This result holds for any genomic distance d( ) satisfying G=H implies d(G,H)=0.

  • Based on the above result, even under a weaker model of approximation, we showed that the exemplar conserved interval distance problem does not admit any WEAK approximation of a superlinear factor [Chen, Fowler, Fu and Zhu, 2007].


Prior results 2
Prior Results (2)

  • On the other hand, for the exemplar breakpoint distance problem, Sankoff has used branch-and-bound [Sankoff, 1999] and Nguyen, Tay and Zhang [2005] have used divide-and-conquer on practical datasets to obtain good empirical results.

  • As a related, but slightly different effort, Chauve, et al. [2006] studied the exemplar genomic similarity problems which does not satisfy G=H implies d(G,H)=0, e.g., the exemplar common interval measure problem.


Background for this work
Background for this work

  • We try to look at the complement of the breakpoint distance under the gene duplication model.

  • As the problem is still hard to approximate, we follow Nguyen, et al. by considering genomes satisfying some practical conditions.


Definitions1
Definitions

  • Given exemplar genomes G and H drawn from the same alphabet, ab is a non-breaking point, if ab appears in both G and H.

    Example. G = abcdefg

    H = fegcdab

    We have two non-breaking points in G and H, which is called the non-breaking similarity of G and H, denoted as nbs(G,H).

    Note that when |G|=|H|=n, if G=H, nbs(G,H)=n-1.

  • Given genomes G’ and H’ drawn from the same alphabet, possibly with gene repetitions, the exemplar non-breaking similarity problem is to delete redundant genes to obtain exemplar genomes G and H such that nbs(G,H) is maximized. The corresponding measure is also denoted as enbs(G’,H’).


Example
Example

G’ = abcadcefg

H’ = cfegcdabf

We have 4 possible exemplar genomes for G’: abcdefg, abdcefg, bcadefg, badcefg.

We have 4 possible exemplar genomes for H’: cfegdab, cegdabf, fegcdab, egcdabf.

enbs(G’,H’)=nbs(abcdefg,fegcdab)=2.


Inapproximability result
Inapproximability Result

Theorem 1. Given an exemplar genome G and another genome H’ such that the genes are all from the same alphabet with size n and each gene appears in H’ at most two times, the Exemplar Non-breaking Similarity Problem over G and H’ does not admit any approximation of factor n1-ε, unless P=NP.

Proof Idea: A linear reduction from Independent Set (IS).


e2

v2

v1

N=5 vertices, M=5 edges

N+M is even

e4

e3

e1

e5

v4

v3

v5

G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5

H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 =

x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2

Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M

H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2

correspond to the optimal independent set {v3,v4}

Input graph has an IS of size K iff enbs(G,H’)=K.


e2

v2

v1

N=5 vertices, M=5 edges

N+M is even

e4

e3

e1

e5

v4

v3

v5

G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5

H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 =

x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2

Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M

H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2

correspond to the optimal independent set {v3,v4}

Input graph has an IS of size K iff enbs(G,H’)=K.


Positive results
Positive Results

Our motivation was from Nguyen, Tay and Zhang [2005], who observed that for certain bacteria genome pairs (Baphi-Wigg, Pmult-Hinft, Ecoli-Styphi, Xaxo-Xcamp and Ypes), repeated genes are usually pegged, e.g.,

…xyx…aba…


Positive results1
Positive Results

Definition:

occ(g,G’) is the number of occurrence of g in G’.

span(g,G’) is the maximum distance between two copies of g in G’.

totalocc(c,G’)=∑gene g in G’ withspan(g,G’)≥c occ(g,G’)


Positive results2
Positive Results

Definition:

occ(g,G’) is the number of occurrence of g in G’.

span(g,G’) is the maximum distance between two copies of g in G’.

totalocc(c,G’)=∑gene g in G’ withspan(g,G’)≥c occ(g,G’)

Example. G’=abcdaebd

span(a,G’)=4, span(b,G’)=5, span(d,G’)=4,

totalocc(4,G’)=6


Positive results3
Positive Results

Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.


Positive results4
Positive Results

Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.

Idea 1: Given an exemplar genome G and another genome H” satisfying span(g,H”)≤c, for every g in H”, we can use divide and conquer to compute enbs(G,H”) in O(nc+2+ε) time.

Roughly speaking, H”=H1H2H3, |H2|=c, then enumerate all solutions on H2 and recurse.

T(n) ≤ 2c+1[2T(n/2+c)] + O(n) ≤ O(nc+2+ε)


Positive results5
Positive Results

Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.

Idea 2: As t is considered as a constant, we enumerate all possibilities for deleting duplicated genes in G’ (to obtain G) and for deleting genes with span greater than c in H’ (to obtain H”). By Lemma 6, there are at most 43└t/3┘ such combinations. Therefore, the total running time is 43└t/3┘O(nc+2+ε) = O(3└t/3┘nc+2+ε) time.


Positive results6
Positive Results

Theorem 3. Let G’ and H’ be two genomes with a total of t genes g satisfying shift(g,G’,H’) >c, for some constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘n2c+1+ε) time.

Example. G’=abcadef

H’=bcedefad

shift(a,G’,H’) = 6


Conclusion
Conclusion

  • We introduce non-breaking similarity, which is the complement of the famous breakpoint distance, for genome comparison.

  • The general exemplar non-breaking similarity problem is hard to approximate.

    3. For some special cases, we can obtain polynomial solutions.


ad