Genome Rearrangements

Genome Rearrangements CIS 667 April 13, 2004

Genome Rearrangements • We have seen how differences in genes at the sequence level can be used to infer evolutionary relations among species • Differences in sequences in (one or more) genes resulted from point mutations (insert, delete, substitute) • These are not the only type of changes that can occur in the genome

Genome Rearrangements • Repair of broken chromosomes is an important process • Mistakes can occur, however • Mistakes can also occur during crossover • These mistakes cause changes in gene order • A large piece of chromosome can be moved or copied to another location • It can also move from one chromosome to another • We call these movements genome rearrangments

Crossover

Chromosome Repair

Genome Rearrangements • These have important (usually fatal) consequences for the organism and its evolution • Alignments do not capture genome rearrangments • Two species may have nearly the same gene sequences, but in a different order (why would the two species then be different?)

Genome Rearrangements • We need some other way to compare entire genomes (i.e. compare at a higher level) • Rather than simple point mutations a genome is obtained from another by a number of a special kind of rearrangements: Reversals • Use the number of reversals needed to transform one genome into another to measure evolutionary distance

The Method • Use combinatorial optimization techniques in an attempt to infer a most economical sequence of rearrangement operations to account for differences among the genomes • Compare with character-based methods for phylogenetics (parsimony)

Reversals • Consider the genome of a species as a sequence of blocks • A block is some sequence of the genome (possibly containing more than one gene) transcribed as a unit • Blocks are oriented since they can be transcribed from either strand of DNA • Give homologous blocks the same label

Reversals • Relation between chloroplast genomes of alfalfa and garden pea:

Reversals • Reversal operation for oriented blocks: • Inverts the order of affected blocks and changes their orientation (arrow) • Affects a contiguous segment of blocks • What sequence of reversal operations could have changed alfalfa into garden pea? • Would like to have a polynomial time algorithm to find the shortest sequence

Genome Comparison vs. Gene Comparison • In the late 1980s, J. Palmer and his colleagues studied the mitochondrial genomes of cabbage and turnips • The gene sequences are very similar (some genes are 99% equal) • Gene order, however, differs dramatically • Genome rearrangements are now considered to be a common mode of molecular evolution

Genome Comparison vs. Gene Comparison • Extreme conservation of genes on X chromosomes across mammalian species provides an opportunity to study the evolutionary history of X chromosome independently of the rest of the genomes • According to Ohno´s law, the gene content of X chromosome has barely changed throughout mammalian development in the last 125 million years. • However, the order of genes on X chromosomes has been disrupted several times.

Human and Mouse X Chromosomes

Human and Mouse X Chromosomes -4 -6 1 7 2 -3 5 8 3 2 7 -1 6 4 5 8 1 7 2 -3 6 4 5 8 1 2 7 -3 6 4 5 8 1 2 7 3 -6 4 5 8 1 2 -5 -4 -3 6 7 8 1 2 3 4 5 6 7 8

Genome Comparison vs. Gene Comparison • The traditional molecular evolutionary technique is a gene comparison to construct a phylogenetic tree • In the ”cabbage and turnip” case this is hardly suitable, since rate of point mutations in their mitochondrial genes is so low that their genes are almost identical • Genome comparison (i.e. comparison of gene orders) is the method of choice in the case of very slowly evolving genomes • Another area is the case where genomes evolve very rapidly (genes not very similar)

Genome Comparison • Only about (17839) genome rearrangements have happened since human and mouse diverged 80 million years ago • Mouse and human genomes can be viewed as a collection of about 200 fragments which are shuffled in mice as compared to humans • A comparative mouse-human genetic map gives the position of a human gene given the location of a related mouse gene

Man-Mouse Comparative Physical Map

Definitions • A signed permutation a over the set of labels L = {1, 2, …, n} is a permutation such that a(i) = +aor –a, where a Î L • Example: +3, –2, –1 is a signed permutation over L = {1, 2, 3} • Note that no label may appear twice in the permutation • A reversal [i,j] is an operation that transforms one signed permutation into another, reversing the order or a contiguous portion and flipping the signs

Definitions • a’ = a[i,j] = a(1), …, a(i – 1), –a(j), …, –a(i),a(j + 1), …, a(n) • We are interested in the problem of sorting by reversals: Given two signed permutations a and b, find the minimumnumber of reversals r1, …, rtthat will transform a into b - a r1…rt = b • The reversal distance db(a) = t

Definitions • Note that the reversal operation does not directly correspond to the biological operations (inversion, translocation, fission, fusion) • Given a and b, can we always transform a into b using only the reversal operation? If so, how many reversals are required in the worst case?

Breakpoints • A breakpoint is a point between consecutive labels in the initial permutation that must necessarily be separated by at least one reversal to reach the target permutation • The two consecutive labels are not consecutive in the target, or their orientations are not the same in a relative sense

Breakpoints • To formalize the idea of breakpoint, we introduce the extended version of a • Let a = a(1), …, a(n) • Then the extended version of a is (L, a(1), …, a(n), R) • For example let extended abe (L, –2, –3, +1, +6, –5, –4, R) and let extended b be (L, +1, +2, +3, +4, +5, +6, R) • The breakpoints are: (L,–2), (–2,–3), (–3,+1), (+1,+6), (6,–5), (–4,R)

Breakpoints • The number of breakpoints of a permutation a is denoted by b(a) • In the example, = 6 • Can you characterize the situations where L is involved in a breakpoint? When R is involved in a breakpoint?

A Lower Bound • A reversal can remove at most two breakpoints • Cuts the permutation in exactly two places • So, if ar1… rt = b then • b(a) – b(ar1) £ 2 • b(ar1) – b(ar1r2) £ 2 • … • b(ar1…rt-1) – b(ar1…rt) £ 2 • So b(a) £ 2t. If t = d(a), b(a)/2 £d(a)

Reality and Desire Diagram • The lower bound found is not very tight • We can derive a better l.b. based on a structure called the reality-desire diagram of a permutation with respect to another • To draw the diagram, we will represent +a with the tuple (-a +a) and -a with the tuple (+a -a) • The orientation is given by the rightmost member of the tuple

Reality and Desire Diagram • A permutation is a sequence of adjacent tuples: a = +3, –2, –1, +4, –5 can be represented as: L---(–3 +3)---(+2 –2)---(+1 –1)---(–4 +4)---(+5 –5)---R b = L---(–1 +1)---(–2 +2)---(–3 +3)---(–4 +4)---(–5 +5)---R

L -3 +3 +2 -2 +1 -1 -4 +4 +5 -5 R Reality and Desire Diagram • Now we will draw a graph to represent a (L, +3, -2, -1, +4, -5, R) • The reality diagram:

Reality and Desire Diagram • Suppose that b is the identity (L, +1, +2, +3, +4, +5) • We will add desire edges to the previous graph to represent b L -3 +3 +2 -2 +1 -1 -4 +4 +5 -5 R

Reality and Desire Diagram • a is the reality • b is what is desired • The diagram (a multigraph) shows both reality and desire • Call it RD(a) • We can rearrange the nodes of the graph to make it easier to understand

Reality and Desire Diagram L R Desire Reality

Properties of RD(a) • Each vertex has degree 2 • Each node is incident to one edge from A, the set of reality edges, and B, the set of desire edges • The connected components of the graph are alternating cycles (edges alternate between reality - blue - and desire - red) • Each cycle has an even number of edges, half reality and half desire

Properties of RD(a) • The number of cycles of RD(a) is denoted by cb(a) • Note that cb(b) = n + 1 since b has no breakpoints • All cycles are two parallel edges between the same pair of nodes • We have 2n + 2 nodes, so n + 1 cycles • This is the only permutation for which cb(a) = 1

Properties of RD(a) • So transforming a into b can be seen as transforming RD(a) into a graph with as many cycles as possible - n + 1 • Now we need to see how a reversal affects the cycles of RD(a) • Note that a reversal is characterized by the two points where it cuts the current permutation, which each correspond to a reality edge

Reversals and RD(a) • Let r be a reversal defined by two reality edges (s,t) and (u,v), then RD(ar) differs from RD(a) as follows: • Reality edges (s,t) and (u,v) are replaced by (s,u) and (t,v) • Vertices u, …, t are reversed • Desire edges remain unchanged • See example on following slide

Example Some nodes/edges omitted L L R R

Orientation of Cycles • How many cycles are affected by a reversal? • First we define convergent and divergent edges • Two reality edges on the same cycle converge if they are traversed in the same direction (clockwise or counterclockwise on the circle in the diagram) on the cycle • Otherwise they diverge

Orientation of Cycles L Convergent: (+3,+2) (-1,-4) Divergent: (L,-3) (+3,+2) R

Reversals and #Cycles • Let r be a reversal acting on two reality edges e and f • If e and f belong to different cycles, c(ar) = c(a) – 1 • If e and f belong to the same cycle and converge, c(ar) = c(a) • If e and f belong to the same cycle and diverge, c(ar) = c(a) + 1

First Case • If e and f belong to different cycles, c(ar) = c(a) – 1

Second Case • If e and f belong to the same cycle and converge, c(ar) = c(a)

Third Case • If e and f belong to the same cycle and diverge, c(ar) = c(a) + 1

Reversals and #Cycles • Note that the number of cycles changes by at most one with each reversal • Use that to find another lower bound for reversal distance • Suppose we have ar1r2…rt = b we know that c(b) = n + 1 and we have: • c(ar1) - c(a)  1 • c(ar1r2) - c(ar1)  1 • … • c(ar1r2…rt) - c(ar1r2…rt-1)  1 • Adding and cancelling terms we get • n + 1 - c(a)  t • If r1r2…rtis optimal then t = d(a), n + 1 - c(a)  d(a)

Interleaving Graph • This new lower bound is better than the old one - b(a)/2 • For most signed permutations, it is close to the actual distance, however it does not always work (we can’t always choose two divergent edges) • We can classify the cycles of RD(a) as good or bad: • A cycle is good if it has two divergent reality edges • Otherwise it is bad

Interleaving Graph • The classification only applies to proper cycles (those with at least four edges) • Those with three edges don’t need to be touched since reality = desire • If we have only good cycles in a permutation, then the lower bound previously given is an equality • We sort, increasing the number of cycles by one per reversal

Interleaving Graph • If a desire edge from one cycle crosses some desire edge from another cycle we say that the two cycles interleave • Interleaved cycles allow us to change a bad cycle into a good one while breaking another cycle • This good cycle can then broken in the next step • To find interleaving cycles, we construct an interleaving graph

Interleaving Graph

Interleaving Graph • Nodes in the interleaving graph are cycles • Edge between two nodes if the cycles interleave • The connected components of the graph are called bad components if they consist entirely of bad cycles • Component otherwise is a good component

Interleaving Graph • What is the interleaving graph of the previous example? • Suppose that F and C are good cycles. • Which components of the interleaving graph are good and which are bad?

Genome Rearrangements