Loading in 5 sec....

Non-breaking Similarity of Genomes with Gene RepetitionsPowerPoint Presentation

Non-breaking Similarity of Genomes with Gene Repetitions

- 103 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Non-breaking Similarity of Genomes with Gene Repetitions' - zaltana-torres

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Non-breaking Similarity of Genomes with Gene Repetitions

Binhai Zhu

Computer Science Department, Montana State University

Joint work with Zhixiang Chen, Bin Fu, Jinhui Xu, Boting Yang and Zhiyu Zhao

Background

- Computing genomic distance between genomes is important in evolutionary molecular biology, the problem was first studied by Sturtevant and Dobzhansky in 1936.
- A lot of research has been done on computing genomic distances since 1990, assuming that each gene appears in a genome once, e.g., the famous result by Hannenhalli and Pevzner on sorting signed permutations by reversals.

Background (cond.)

- On the other hand, gene repetition is very common in genomes. So computing genomic distances with gene repetition is a more realistic problem.
- This is a typical optimization problem, it makes sense to study the approximability of the problem.

Definitions

- Given n gene families (alphabets) F, a genome G’ is a sequence of elements of F such that each element has a (+/-) sign.
Example. F={a,b,c,d},

G’=-bd-cab-d-c

- We will focus on unsigned sequences in this work.
- A genome G is said to be exemplar if every gene appears exactly once in G.

Definitions (cond.)

- Given exemplar genomes G and H, over the same set of gene families, if gene ab is a substring in G but not in H, then ab constitutes a breakpoint in G.
Example, G=abcdefg

H=efgdcab

there are 3 breakpoints in G (and symmetrically in H).

- The number of breakpoints between G and H is called the breakpoint distance between G and H.

Exemplar Breakpoint Distance Problem

- Given two genomes G’ and H’ over n gene families, compute two exemplar genomes G and H such that the breakpoint distance between G and H is minimized.
- We call this the exemplar breakpoint distance problem (between G’ and H’). Denote this distance by eb(G’,H’)=b(G,H).

Approximation Algorithms

- Given a minimization (maximization) problem Л, let the optimal solution of Л be OPT, an approximation algorithm A provides a performance guarantee of α for Л if for every instance of Л the solution value returned by A is at most x OPT (at least OPT/).
- Usually we say that A is a factor- approximation for Л.

Prior Results (1)

- We showed that the exemplar breakpoint distance problem does not admit any approximation, unless P=NP (or, deciding whether eb(G’H’)=0 is NP-complete) [Chen, Fu and Zhu;2006].
- This result holds for any genomic distance d( ) satisfying G=H implies d(G,H)=0.
- Based on the above result, even under a weaker model of approximation, we showed that the exemplar conserved interval distance problem does not admit any WEAK approximation of a superlinear factor [Chen, Fowler, Fu and Zhu, 2007].

Prior Results (2)

- On the other hand, for the exemplar breakpoint distance problem, Sankoff has used branch-and-bound [Sankoff, 1999] and Nguyen, Tay and Zhang [2005] have used divide-and-conquer on practical datasets to obtain good empirical results.
- As a related, but slightly different effort, Chauve, et al. [2006] studied the exemplar genomic similarity problems which does not satisfy G=H implies d(G,H)=0, e.g., the exemplar common interval measure problem.

Background for this work

- We try to look at the complement of the breakpoint distance under the gene duplication model.
- As the problem is still hard to approximate, we follow Nguyen, et al. by considering genomes satisfying some practical conditions.

Definitions

- Given exemplar genomes G and H drawn from the same alphabet, ab is a non-breaking point, if ab appears in both G and H.
Example. G = abcdefg

H = fegcdab

We have two non-breaking points in G and H, which is called the non-breaking similarity of G and H, denoted as nbs(G,H).

Note that when |G|=|H|=n, if G=H, nbs(G,H)=n-1.

- Given genomes G’ and H’ drawn from the same alphabet, possibly with gene repetitions, the exemplar non-breaking similarity problem is to delete redundant genes to obtain exemplar genomes G and H such that nbs(G,H) is maximized. The corresponding measure is also denoted as enbs(G’,H’).

Example

G’ = abcadcefg

H’ = cfegcdabf

We have 4 possible exemplar genomes for G’: abcdefg, abdcefg, bcadefg, badcefg.

We have 4 possible exemplar genomes for H’: cfegdab, cegdabf, fegcdab, egcdabf.

enbs(G’,H’)=nbs(abcdefg,fegcdab)=2.

Inapproximability Result

Theorem 1. Given an exemplar genome G and another genome H’ such that the genes are all from the same alphabet with size n and each gene appears in H’ at most two times, the Exemplar Non-breaking Similarity Problem over G and H’ does not admit any approximation of factor n1-ε, unless P=NP.

Proof Idea: A linear reduction from Independent Set (IS).

v2

v1

N=5 vertices, M=5 edges

N+M is even

e4

e3

e1

e5

v4

v3

v5

G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5

H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 =

x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2

Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M

H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2

correspond to the optimal independent set {v3,v4}

Input graph has an IS of size K iff enbs(G,H’)=K.

v2

v1

N=5 vertices, M=5 edges

N+M is even

e4

e3

e1

e5

v4

v3

v5

G:v1v’1v2v’2v3v’3v4v’4v5v’5x1e1x’1x2e2x’2x3e3x’3x4e4x’4x5e5x’5

H’:YN+M-1YN+M-3…Y1YN+MYN+M-2…Y2 =

x4x’4x2x’2v5e4e5v’5v3e1v’3v1e1e2v’1x5x’5x3x’3x1x’1v4e3e5v’4v2e2e3e4v’2

Yi=viAiv’i, if i ≤N; YN+i=xix’i, if i≤M

H:x4x’4x2x’2v5e5v’5v3v’3v1e1e2v’1x5x’5x3x’3x1x’1v4v’4v2e3e4v’2

correspond to the optimal independent set {v3,v4}

Input graph has an IS of size K iff enbs(G,H’)=K.

Positive Results

Our motivation was from Nguyen, Tay and Zhang [2005], who observed that for certain bacteria genome pairs (Baphi-Wigg, Pmult-Hinft, Ecoli-Styphi, Xaxo-Xcamp and Ypes), repeated genes are usually pegged, e.g.,

…xyx…aba…

Positive Results

Definition:

occ(g,G’) is the number of occurrence of g in G’.

span(g,G’) is the maximum distance between two copies of g in G’.

totalocc(c,G’)=∑gene g in G’ withspan(g,G’)≥c occ(g,G’)

Positive Results

Definition:

occ(g,G’) is the number of occurrence of g in G’.

span(g,G’) is the maximum distance between two copies of g in G’.

totalocc(c,G’)=∑gene g in G’ withspan(g,G’)≥c occ(g,G’)

Example. G’=abcdaebd

span(a,G’)=4, span(b,G’)=5, span(d,G’)=4,

totalocc(4,G’)=6

Positive Results

Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.

Positive Results

Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.

Idea 1: Given an exemplar genome G and another genome H” satisfying span(g,H”)≤c, for every g in H”, we can use divide and conquer to compute enbs(G,H”) in O(nc+2+ε) time.

Roughly speaking, H”=H1H2H3, |H2|=c, then enumerate all solutions on H2 and recurse.

T(n) ≤ 2c+1[2T(n/2+c)] + O(n) ≤ O(nc+2+ε)

Positive Results

Theorem 2. Let G’ and H’ be two genomes with t=totalocc(1,G’) + totalocc(c,H’), for a constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘nc+2+ε) time.

Idea 2: As t is considered as a constant, we enumerate all possibilities for deleting duplicated genes in G’ (to obtain G) and for deleting genes with span greater than c in H’ (to obtain H”). By Lemma 6, there are at most 43└t/3┘ such combinations. Therefore, the total running time is 43└t/3┘O(nc+2+ε) = O(3└t/3┘nc+2+ε) time.

Positive Results

Theorem 3. Let G’ and H’ be two genomes with a total of t genes g satisfying shift(g,G’,H’) >c, for some constant c. Then enbs(G’,H’) can be computed in O(3└t/3┘n2c+1+ε) time.

Example. G’=abcadef

H’=bcedefad

shift(a,G’,H’) = 6

Conclusion

- We introduce non-breaking similarity, which is the complement of the famous breakpoint distance, for genome comparison.
- The general exemplar non-breaking similarity problem is hard to approximate.
3. For some special cases, we can obtain polynomial solutions.

Download Presentation

Connecting to Server..