Loading in 2 Seconds...

THE RNA DETECTIVE GAME: FINDING RNA CHAINS FROM FRAGMENTS

Loading in 2 Seconds...

- By
**karli** - Follow User

- 103 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'THE RNA DETECTIVE GAME: FINDING RNA CHAINS FROM FRAGMENTS' - karli

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

FINDING RNA CHAINS FROM FRAGMENTS

RNA

Detective

Fred Roberts, Rutgers University

Deoxyribonucleic acid, DNA, is the basic building block of inheritance.

DNA can be thought of as a chain consisting of bases.

Each base is one of four possible chemicals:

Thymine (T), Cytosine (C), Adenine (A), Guanine (G)

Some DNA chains:

GGATCCTGG, TTCGCAAAAAGAATC

Real DNA chains are long:

Algae (P. salina): 6.6x105 bases long

Slime mold (D. discoideum): 5.4x107 bases long

Insect (D. melanogaster – fruit fly): 1.4x108 bases long

Bird (G. domesticus): 1.2x109 bases long

Human (H. sapiens): 3.3x109 bases long

The sequence of bases in DNA encodes certain genetic information.

In particular, it determines long chains of amino acids known as proteins.

How many possible DNA chains are there in humans?

Fundamental methods of combinatorics are important in mathematical biology.

How many sequences of 0’s and 1’s are there of length 2?

There are 2 ways to choose the first digit and no matter how we choose the first digit, there are two ways to choose the second digit.

Thus, there are 2x2 = 22 = 4 ways to choose the sequence.

00, 01, 10, 11

How many sequences are there of length 3?

By similar reasoning: 2x2x2 = 23.

Is this interesting?

Boring!

Really boring!

Counting may be boring at times, but we will see that it can be really powerful.

Product Rule: If something can happen in n1 ways and no matter how the first thing happens, a second thing can happen in n2 ways, then the two things together can happen in n1 x n2 ways.

More generally, if something can happen in n1 ways and no matter how the first thing happens, a second thing can happen in n2 ways, and no matter how the first two things happen a third thing can happen in n3 ways, … then all the things together can happen in n1 x n2 x n3 x … ways.

How many possible DNA chains are there in humans?

How many DNA chains are there with two bases?

Answer (Product Rule): 4x4 = 42 = 16.

There are 4 choices for the first base and, for each such choice, 4 choices for the second base.

How many with 3 bases?

How many with n bases?

How many with 3 bases? 43 = 64

How many with n bases? 4n

How many human DNA chains are possible?

4^(3.3x109)

This is greater than 10^(1.98x109)

(1 followed by 198 million zeroes!)

RNA is a “messenger molecule” whose links are defined from DNA.

An RNA chain has at each link one of four bases.

The possible bases are the same as those in DNA except that the base Uracil (U) replaces the base Thymine (T).

Sample RNA chains:

GGCAUUGGA, UAUAUGCGGCUUC

RNA chains are very long.

Can we discover what they look

like without actually

observing them?

Trick: Use enzymes.

Some enzymes break up an RNA chain into fragments after each G link.

Some enzymes break up the chain after each C or U link.

Consider the chain

CCGGUCCGAAAG

Applying the G enzyme breaks the chain into the following fragments:

G fragments: CCG, G, UCCG, AAAG

We know that these are the fragments, but we do not know the order in which they appear.

How many possible chains have these four fragments?

Chain: CCGGUCCGAAAG

G fragments: CCG, G, UCCG, AAAG

Product rule again: 4 choices for the first fragment, for each such choice 3 choices for second fragment, …

There are 4x3x2x1 = 4! = 24 possible chains.

One chain corresponding to each permutation of these four fragments.

One such chain different from the original:

UCCGGCCGAAAG

Chain: CCGGUCCGAAAG

Suppose we instead apply the U,C enzyme.

We get the following fragments:

U,C fragments: C, C, GGU, C, C, GAAAG

How many chains are there with these fragments?

Is 6! = 720 the correct answer???

Two of the permutations are the one that takes the fragments in the order given and the one that takes the second fragment first and the first second and all others in this order.

They give rise to the same chain.

So 6! is wrong.

What is the answer??

What if the fragments were

C, C, C, C, C

There are 5! permutations of these fragments, but only one RNA chain with these fragments:

CCCCC

Putting n distinguishable balls into k distinguishable boxes:

The number of ways to put n1 balls into the first box,

n2 balls into the second box, …, nk balls into the kth

box is denoted by C(n;n1,n2,…,nk), where

n = n1 + n2 + … nk.

Theorem: C(n;n1,n2,…,nk) = n!/n1!n2!...nk!

Example: How many RNA chains of length 6 have 3 C’s and 3 A’s?

Think of 2 boxes, a C box and an A box. How many ways are there to put 3 positions (balls) into the C box and 3 into the A box?

Answer: C(6;3,3) = 6!/3!3! = 20.

Some of these are: CACACA, ACACAC, AAACCC.

If a 6-link RNA chain is chosen at random, what is the probability of obtaining one with 3 C’s and 3 A’s?

Answer: There are 46 possible RNA chains of length 6.

The probability is therefore

C(6;3,3)/46 = 20/4096 .005.

The number of 10-link RNA chains consisting of 3 A’s, 2 C’s, 2 U’s, and 3 G’s is

C(10;3,2,2,3) = 25,200

What if we know they end in AAG?

Then, only the first 7 positions need to be filled, and 2 A’s and one G are already used up. Hence, the answer is

C(7;1,2,2,2) = 630

Notice how knowing the end of a chain can dramatically reduce the number of possible chains.

Recall that we have the following U,C fragments:

C, C, GGU, C, C, GAAAG

The number of RNA chains with these fragments is not 6! = 720.

Think of having 6 positions (there are 6 fragments) and assigning 4 positions to the C box, 1 to the GGU box, and one to the GAAAG box.

Then the number of ways of doing this is

C(6;4,1,1) = 6!/4!1!1! = 30

U,C fragments: C, C, GGU, C, C, GAAAG

Actually, this computation is still a bit off, though not because the combinatorial argument is wrong.

Notice that the fragment GAAAG does not end in U or C.

Thus, we know it comes last.

There are 5 remaining U,C fragments.

The number of chains beginning with these 5 fragments is given by

C(5;4,1) = 5

Beginning of the chains: CCCCGGU, CCCGGUC, CCGGUCC, CGGUCCC, GGUCCCC

We get all chains with the given U,C fragments by adding GAAAG to the end of each of these:

CCCCGGUGAAAG

CCCGGUCGAAAG

CCGGUCCGAAAG

CGGUCCCGAAAG

GGUCCCCGAAAG

Thus, there are 24 possible chains with the given G fragments and 5 with the possible U,C fragments.

But: We have not yet combined our knowledge of both G and U,C fragments.

G fragments: CCG, G, UCCG, AAAG

U,C fragments: C, C, GGU, C, C, GAAAG

Which of the 5 chains with these U,C fragments has the right G fragments?

G fragments: CCG, G, UCCG, AAAG

U,C fragments: C, C, GGU, C, C, GAAAG

Which of the 5 chains with these U,C fragments has the right G fragments?

CCCCGGUGAAAG

CCCGGUCGAAAG

CCGGUCCGAAAG

CGGUCCCGAAAG

GGUCCCCGAAAG

CCCCGGUGAAAG does not: It has CCCCG as a G fragment.

What about the others?

Checking the remaining 4 possible RNA chains with the given U,C fragments shows that only the third one,

CCGGUCCGAAAG

has the given G fragments.

Hence, we have recovered the initial chain.

This is an example of recovery of an RNA chain given a complete digest by enzymes.

How remarkable is it that we could recover the initial RNA chain this way?

CCGGUCCGAAAG

How many RNA chains are there with the same bases as this chain?

There are 12 bases: 4 C’s, 4 G’s, 3 A’s, and 1 U.

The number of chains with these bases is given by C(12;4,4,3,1) = 138,600

Thus, knowing the number of bases is not nearly as useful as knowing the fragments.

Another example.

G fragments: UG, ACG, AC

U,C fragments: U, GAC, GAC

Step 1: Does any fragment have to come last?

G fragments: UG, ACG, AC

U,C fragments: U, GAC, GAC

Step 1: Does any fragment have to come last?

None of the U,C fragments has to come last.

However, the G fragment AC has to come last.

Thus, the other two G fragments come first in some order and there are only two possible RNA chains with these G fragments: UGACGAC, ACGUGAC

G fragments: UG, ACG, AC

U,C fragments: U, GAC, GAC

There are only two possible RNA chains with these G fragments: UGACGAC, ACGUGAC

The latter has AC as a U,C fragment. So, the former is the correct chain.

Is it always possible to completely recover the original RNA chain given its G fragments and U,C fragments?

RNA

Is it always possible to completely recover the original RNA chain given its G fragments and U,C fragments?

No: sometimes the solution is ambiguous.

Exercise: Find two RNA chains with the same G and U,C fragments.

Surprisingly, eulerian paths in multidigraphs can be used to help with the RNA detective game.

When a digraph is allowed to have more than one arc from vertex x to vertex y, we call it a multidigraph.

A path in a multidigraph is called eulerian if it uses every arc once and only once. (Recall the Konigsberg Bridge Problem.)

A closed path (one that ends where it starts) is eulerian if it is eulerian as a path.

When does a multidigraph have an eulerian path or closed path?

Theorem (I.J. Good, 1946): A connected multidigraph has an eulerian closed path iff for every vertex, the indegree (number of incoming arcs) equals the outdegree (number of outgoing arcs).

Theorem (I.J. Good, 1946): A connected multidigraph has an eulerian path iff for all vertices with the possible exception of two, indegree equals outdegree, and for at most two vertices, indegree and outdegree differ by one.

Note that these theorems hold if there are loops from a vertex to itself.

A loop adds 1 to indegree and 1 to outdegree.

Thus, loops do not affect the existence of eulerian paths or closed paths.

Eulerian Paths and the RNA Detective Game

Assume that there are at least two G fragments and at least two U,C fragments. Otherwise, we can recover the original chain.

Example:

G fragments: CCG, G, UCACG, AAAG, AA

U,C fragments: C, C, GGU, C, AC, GAAAGAA

Eulerian Paths and the RNA Detective Game

G fragments: CCG, G, UCACG, AAAG, AA

U,C fragments: C, C, GGU, C, AC, GAAAGAA

Step 1: Break down each fragment after each G, U, or C.

E.g.: GAAAGAA becomes GxAAAGxAA

GGU becomes GxGxU

UCACG becomes UxCxACxG

Each piece is called an extended base.

All extended bases in a fragment except first and last are called interior extended bases.

Eulerian Paths and the RNA Detective Game

G fragments: CCG, G, UCACG, AAAG, AA

U,C fragments: C, C, GGU, C, AC, GAAAGAA

Step 2: Use the extended base breakup of fragments to find the beginning and end of the RNA chain.

Start by making two lists

All interior extended bases of all fragments:

C, C, AC, G, AAAG

Fragments with one extended base:

G, AAAG, AA, C, C, C, AC

Eulerian Paths and the RNA Detective Game

All interior extended bases of all fragments:

C, C, AC, G, AAAG

Fragments with one extended base:

G, AAAG, AA, C, C, C, AC

Theorem: Every entry on the first list is on the second list. There are always exactly two entries on the second list not on the first. One of these is the first extended base of the entire RNA chain and the other is the last.

Thus: chain begins in AA or C and ends in AA or C.

How do you tell how it ends?

Eulerian Paths and the RNA Detective Game

Thus: chain begins in AA or C and ends in AA or C.

How do you tell how it ends?

One of these must be from an abnormal fragment: a G fragment that doesn’t end in G or a U,C fragment that doesn’t end in U or C.

G fragments: CCG, G, UCACG, AAAG, AA

U,C fragments: C, C, GGU, C, AC, GAAAGAA

AA is such an abnormal fragment.

An abnormal fragment marks the end of the chain.

So: chain ends in AA and begins in C.

Eulerian Paths and the RNA Detective Game

Step 3: Build a multidigraph.

First, identify all normal fragments with more than one extended base. From each such fragment, use the first and last extended bases as vertices and draw an arc from the first to the last.

Label the arc with the corresponding fragment.

G fragments: CCG, G, UCACG, AAAG, AA

U,C fragments: C, C, GGU, C, AC, GAAAGAA

Fragment UCACG gives rise to vertices U and G and we include an arc from U to G labeled UCACG.

Eulerian Paths and the RNA Detective Game

G fragments: CCG, G, UCACG, AAAG, AA

U,C fragments: C, C, GGU, C, AC, GAAAGAA

Fragment CCG means that we include an arc from C to G labeled CCG.

Fragment GGU means that we include an arc from G to U labeled GGU.

Eulerian Paths and the RNA Detective Game

There might be several arcs from a given extended base to another if there are several normal fragments from the first to the second. That is why we get a multidigraph.

Step 4: We add one additional arc.

Identify the longest abnormal fragment.

Include an arc from the first (and perhaps only) extended base in this fragment to the first extended base in the chain.

Label this as X*Y where X is the longest abnormal fragment in the chain and Y is first extended base in the chain.

Eulerian Paths and the RNA Detective Game

G fragments: CCG, G, UCACG, AAAG, AA

U,C fragments: C, C, GGU, C, AC, GAAAGAA

GAAAGAA is the longest abnormal fragment.

Put in an arc from G (first extended base in this fragment) to C (first extended base in the chain).

Label the arc as GAAAGAA*C

Eulerian Paths and the RNA Detective Game

Theorem: This multidigraph has an eulerian closed path. The RNA chains with the given G and U,C fragments correspond to eulerian closed paths that end with the special arc X*Y.

In our example, it is easy to check it has an eulerian closed path. (Use I.J. Good’s Theorem.)

Eulerian Paths and the RNA Detective Game

GAAAGAA*C

GGU

G

C

U

UCACG

CCG

The only eulerian closed path that ends in GAAAGAA*C goes from C to G to U to G to C.

Eulerian Paths and the RNA Detective Game

GAAAGAA*C

GGU

G

C

U

UCACG

CCG

Step 5 : Use the corresponding labeling of arcs to obtain the chain:

CCGGUCACGAAAGAA

It is easy to check this has the right G and U,C fragments.

The RNA Detective Game: Concluding Comments

The “fragmentation stratagem” we have described was used by R.W. Holley and his colleagues at Cornell in 1965 to determine the first nucleic acid sequence.

The method is not used anymore and was only used for a short time before other, more efficient methods were adopted.

However, it has great historical significance and illustrates an important role for mathematical methods in biology.

The RNA Detective Game: Concluding Comments

Nowadays, by use of radioactive marking and high-speed computer analysis, it is possible to sequence long RNA and DNA chains rather quickly.

The RNA Detective Game: Concluding Comments

The mathematical power of the fragmentation stratagem, nevertheless, is a good illustration of the use of methods of discrete mathematics in modern molecular biology.

The RNA Detective Game: Concluding Comments

And of the power of counting!

Download Presentation

Connecting to Server..