1 / 31

# Algorismes de cerca - PowerPoint PPT Presentation

Algorismes de cerca. Algorismes de cerca: definició del problema (text,patró). depèn de què coneixem al principi:. Cerca exacta:. Només el text ----> Estructurar el text (suffix tree). Només el/s patró/ns ---> Estructurar el/els patró/ns.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Algorismes de cerca' - billy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Algorismes de cerca: definició del problema (text,patró)

depèn de què coneixem al principi:

• Cerca exacta:

• Només el text ----> Estructurar el text (suffix tree)

• Només el/s patró/ns ---> Estructurar el/els patró/ns

• 1 patró ---> L’algorisme depèn de la llargada i ||

• k patrons ---> L’algorisme depén de k, la llargada i ||

• Extensions

• Expressions regulars

depèn de la llargada del patró

• Programació dinàmica

• Aliniament de seqüències (parell i múltiple)

• Algorisme de hash:Engalçament de seqüències.

• Cerca probabilista

It is applied to the following topics:

• DNA sequencing .

• EST assembly

There are two techniques:

• Hibridization: provide information about l-tuples

• present in DNA.

• Shotgun: DNA sequences are broken into

• 100Kb-500Kb random fragments.

There are two techniques:

• Hibridization: provide information about l-mers

• present in DNA

• Shotgun: DNA sequences are broken into

• 100Kb-500Kb random fragments.

Let xxxxxxxxxxxxx be the sequence we want to know,

and the hybridization technique gives us

the set of 3-mers that belong to it:

AAC GAT TGC

ACG CGG GCC TTG

GGA ATT

How can the sequence be reconstructed?

This relation can be represented with a directed graph

AAC ACG

Given the 3-mers of the sequence:

AAC GAT TGC

ACG CGG GCC TTG

GGA ATT

As AAC and ACG belong to the sequence,

then AACG belongs to the sequence,

because the longest (not proper) suffix of AAC

matches the longest (not proper) prefix of ACG.

Construction of the complete suffix-prefix graph

AAC GAT TGC

ACG CGG GCC TTG

GGA ATT

that gives us the unknown sequence:

AACGGATTGCC

But, is this a realistic case?

AAC CAA GAT TGC

ACG CGG GCC TTG

GGC GGA CCG ATT

Let us introduce a more realistic case:

and the sequence is given by the Hamiltonian path

that is the path that traverses all nodes exactly once

and whose cost is NP-Complet!

Which is the cost of the hybridization method?

Cost:

1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

2. Searching for the suffix-prefix matches :

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Hamiltonian path

NP- Complet

m t = 1 mseg

10m 10t = 10 mseg

1000m 1000t = 1 seg

m t = 1mseg.

10m 100t = 100 mseg.

1000m 1000000t = 16 min

m t = 1 mseg.

10m 210 t = 1 seg

1000m 21000 t = 1030 t = 1018 anys

Linear cost: O(m)

Exponencial cost: O(2m )

Cost:

1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

2. Searching for the suffix-prefix matches :

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Hamiltonian path

NP- Complet

How the NP-completness can be avoided?

AAC GAT TGC

ACG CGG GCC TTG

GGC GGA CCG ATT

AA

GA

TG

AC

GC

TT

CG

GG

CC

AT

Search for the Hamiltonian path (NP-complet)

or search for the Eulerian path (lineal)

Search for the Eulerian path of the graph:

Unbalanced nodes: indegree = outdegree

(Starting or ending nodes )

Balanced nodes: indegree = oudegree

(traversed nodes: )

Algorithm:

1. Construct a random path

between starting and ending nodes.

2. Add cycles from balanced nodes while possible.

Algorithm:

1. Construct a random path

between starting and ending nodes.

2. Add cycles from balanced nodes while possible.

Cost:

1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

2. Searching for the suffix-prefix matches :

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Eulerian path

Linear cost

Now, which is the limiting factor?

AAC CAA GAT TGC

ACG CGG GCC TTG

GGA ATT

GAC

Given the graph:

Repeated l-mers:

How many sequences can be assembled?

CAACGGATTGCC

CAACGGACGGATTGCC

Which is the probability of a repeat?

How the probability of a repeat can be computed?

Model: random sequence of length N with identically distributed bases (1/4),

Given 2 l-mers, the probability to match is : 4-L

Given 3 l-mers, the expected number of 2-matches is : (32)4-L

Given m l-mers, the expected number of 2-matches is: (m2)4-L

then for L = 8, m =512!

If (m2)4-L<1 then m<sqr(2·4L)

Conclusion: this technique can be applied only

to short sequences.

Genome sequences are close to random sequences?

Connect to

http://alggen.lsi.upc.edu

There are two techniques:

• Hibridizationació: provide information about l-mers

• present in DNA

• Shot gun: DNA sequences are broken into

• 100Kb-500Kb random fragments.

With the unknown sequence

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

• It is possible :

• to make some copies

• to break it into random and unsorted short segments

What can we do?

Assume

xxxxx|xxxxxxx|xxxxxxx|xxxx

xxxxxxxx|xxxxxx|xxxxxx|xxx

xxxx|xxxxxx|xxxxxx|xxxxxxx

The algorithm is:

1st. Compare all pairs searching for suffix-prefix approximate matches.

2nd. Construct the graph suffix-prefix

3th. Find the path

Given the three copies

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The shotgun brokes it into the following segments

accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt

The pairwise comparison that searchs for suffix-prefix

approximate matching can be done with:

• Dynamic programming ( quadratic cost)

• two steps:

• Find the pairs suspected to be assembled

• (Linear cost with the hash algorithm)

• Assembly them with dynamic programming.

tacctt

accttta

tttaac

taacga

accgtacc

acgatac

accgt

accg

gataca

tacaggt

Given the graph

accgtacctttaacgatacaggt

but, the Hamiltonian has exponential cost!

xxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxxxx

xxxxxxx

accgt

xxxxxxx

accg

xxxxxxx

New problems arise

• Consecutive repeats

• Lack of coverage

Some questions arisess:

• What is the percentage of coverage?

• How many contigs we have to expect?

• What is the mean length of contigs?

Given the coverage:

Shotgun: percentage of coverage

L

N

d

The probability that

Prob{X=k}= (d/L)k (1-d/L)n-k

a base was covered by k segments is given by

the binomial dsitribution (N,d / L):

N

k

Given the model

Degree of coverage N d / L

We assume that segments are randomly distributed.

Shotgun: percentage of coverage

What is the limit of the binomial distribution

n  i p 0

having np= 

Distribució de Poisson P()

Prob{X=k}= e-

k

k!

Then the probability that at least one segment covers a base is

Prob{X>0}= 1-Prob{X=0}= 1- e-

= 1- e(N d / L)

Then, with N d / L = 4.6 we obtain a 99% of coverage

and with N d / L = 6.9 weobtain a 99.9% of coverage.

Is the same procedure than shotgun sequencing…

…but with a great one advantage:

there are many graphs with a small number of nodes!

Connect to

http://alggen.lsi.upc.es