Algorismes de cerca
Download
1 / 31

Algorismes de cerca - PowerPoint PPT Presentation


  • 135 Views
  • Uploaded on

Algorismes de cerca. Algorismes de cerca: definició del problema (text,patró). depèn de què coneixem al principi:. Cerca exacta:. Només el text ----> Estructurar el text (suffix tree). Només el/s patró/ns ---> Estructurar el/els patró/ns.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Algorismes de cerca ' - billy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Algorismes de cerca
Algorismes de cerca

Algorismes de cerca: definició del problema (text,patró)

depèn de què coneixem al principi:

  • Cerca exacta:

  • Només el text ----> Estructurar el text (suffix tree)

  • Només el/s patró/ns ---> Estructurar el/els patró/ns

  • 1 patró ---> L’algorisme depèn de la llargada i ||

  • k patrons ---> L’algorisme depén de k, la llargada i ||

  • Extensions

  • Expressions regulars

  • Cerca aproximada:

depèn de la llargada del patró

  • Programació dinàmica

  • Aliniament de seqüències (parell i múltiple)

  • Algorisme de hash:Engalçament de seqüències.

  • Cerca probabilista


Sequence assembly
Sequence assembly

It is applied to the following topics:

  • DNA sequencing .

  • EST assembly


Dna sequencing
DNA sequencing

There are two techniques:

  • Hibridization: provide information about l-tuples

  • present in DNA.

  • Shotgun: DNA sequences are broken into

  • 100Kb-500Kb random fragments.


Dna sequencing1
DNA sequencing

There are two techniques:

  • Hibridization: provide information about l-mers

  • present in DNA

  • Shotgun: DNA sequences are broken into

  • 100Kb-500Kb random fragments.


Hybridization
Hybridization

Let xxxxxxxxxxxxx be the sequence we want to know,

and the hybridization technique gives us

the set of 3-mers that belong to it:

AAC GAT TGC

ACG CGG GCC TTG

GGA ATT

How can the sequence be reconstructed?


Hybridization1
Hybridization

This relation can be represented with a directed graph

AAC ACG

Given the 3-mers of the sequence:

AAC GAT TGC

ACG CGG GCC TTG

GGA ATT

As AAC and ACG belong to the sequence,

then AACG belongs to the sequence,

because the longest (not proper) suffix of AAC

matches the longest (not proper) prefix of ACG.


Hybridization2
Hybridization

Construction of the complete suffix-prefix graph

AAC GAT TGC

ACG CGG GCC TTG

GGA ATT

that gives us the unknown sequence:

AACGGATTGCC

But, is this a realistic case?


Hybridization3
Hybridization

AAC CAA GAT TGC

ACG CGG GCC TTG

GGC GGA CCG ATT

Let us introduce a more realistic case:

and the sequence is given by the Hamiltonian path

that is the path that traverses all nodes exactly once

and whose cost is NP-Complet!

Which is the cost of the hybridization method?


Hybridization cost
Hybridization: cost

Cost:

1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

2. Searching for the suffix-prefix matches :

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Hamiltonian path

NP- Complet


Excursi cost
Excursió: cost

m t = 1 mseg

10m 10t = 10 mseg

1000m 1000t = 1 seg

m t = 1mseg.

10m 100t = 100 mseg.

1000m 1000000t = 16 min

m t = 1 mseg.

10m 210 t = 1 seg

1000m 21000 t = 1030 t = 1018 anys

Linear cost: O(m)

Quadratic cost: O(m2 )

Exponencial cost: O(2m )


Hybridization cost1
Hybridization: cost

Cost:

1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

2. Searching for the suffix-prefix matches :

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Hamiltonian path

NP- Complet

How the NP-completness can be avoided?


Hybridization4
Hybridization:

AAC GAT TGC

ACG CGG GCC TTG

GGC GGA CCG ATT

AA

GA

TG

AC

GC

TT

CG

GG

CC

AT

Search for the Hamiltonian path (NP-complet)

or search for the Eulerian path (lineal)


Hybridization eulerian path
Hybridization: Eulerian path

Search for the Eulerian path of the graph:

Unbalanced nodes: indegree = outdegree

(Starting or ending nodes )

Balanced nodes: indegree = oudegree

(traversed nodes: )


Hybridization eulerian path1
Hybridization: Eulerian path

Algorithm:

1. Construct a random path

between starting and ending nodes.

2. Add cycles from balanced nodes while possible.


Hybridization cam euleri
Hybridization: camí Eulerià

Algorithm:

1. Construct a random path

between starting and ending nodes.

2. Add cycles from balanced nodes while possible.


Hybridization cost2
Hybridization: cost

Cost:

1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

2. Searching for the suffix-prefix matches :

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Eulerian path

Linear cost

Now, which is the limiting factor?


Hybridization limiting factor
Hybridization: limiting factor

AAC CAA GAT TGC

ACG CGG GCC TTG

GGA ATT

GAC

Given the graph:

Repeated l-mers:

How many sequences can be assembled?

CAACGGATTGCC

CAACGGACGGATTGCC

Which is the probability of a repeat?


Hybridization statistical model
Hybridization: statistical model

How the probability of a repeat can be computed?

Model: random sequence of length N with identically distributed bases (1/4),

Given 2 l-mers, the probability to match is : 4-L

Given 3 l-mers, the expected number of 2-matches is : (32)4-L

Given m l-mers, the expected number of 2-matches is: (m2)4-L

then for L = 8, m =512!

If (m2)4-L<1 then m<sqr(2·4L)

Conclusion: this technique can be applied only

to short sequences.


Hybridization5
Hybridization:

Genome sequences are close to random sequences?

Connect to

http://alggen.lsi.upc.edu

And follow links RESEARCH SEARCH MREPATT


Dna sequencing2
DNA sequencing

There are two techniques:

  • Hibridizationació: provide information about l-mers

  • present in DNA

  • Shot gun: DNA sequences are broken into

  • 100Kb-500Kb random fragments.


Shotgun
Shotgun

With the unknown sequence

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

  • It is possible :

  • to make some copies

  • to break it into random and unsorted short segments

What can we do?


Shotgun algorisme
Shotgun: algorisme

Assume

xxxxx|xxxxxxx|xxxxxxx|xxxx

xxxxxxxx|xxxxxx|xxxxxx|xxx

xxxx|xxxxxx|xxxxxx|xxxxxxx

The algorithm is:

1st. Compare all pairs searching for suffix-prefix approximate matches.

2nd. Construct the graph suffix-prefix

3th. Find the path


Shotgun1
Shotgun

Given the three copies

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The shotgun brokes it into the following segments

accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt


Shotgun2
Shotgun

The pairwise comparison that searchs for suffix-prefix

approximate matching can be done with:

  • Dynamic programming ( quadratic cost)

  • two steps:

  • Find the pairs suspected to be assembled

  • (Linear cost with the hash algorithm)

  • Assembly them with dynamic programming.


Shotgun3
Shotgun

tacctt

accttta

tttaac

taacga

accgtacc

acgatac

accgt

accg

gataca

tacaggt

Given the graph

accgtacctttaacgatacaggt

but, the Hamiltonian has exponential cost!


Shotgun4
Shotgun:

xxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxxxx

xxxxxxx

accgt

xxxxxxx

accg

xxxxxxx

New problems arise

  • Consecutive repeats

  • Lack of coverage


Shotgun properties of the coverage
Shotgun: properties of the coverage

Some questions arisess:

  • What is the percentage of coverage?

  • How many contigs we have to expect?

  • What is the mean length of contigs?

Given the coverage:


Shotgun percentage of coverage
Shotgun: percentage of coverage

L

N

d

The probability that

Prob{X=k}= (d/L)k (1-d/L)n-k

a base was covered by k segments is given by

the binomial dsitribution (N,d / L):

N

k

Given the model

Degree of coverage N d / L

We assume that segments are randomly distributed.


Shotgun percentage of coverage1
Shotgun: percentage of coverage

What is the limit of the binomial distribution

n  i p 0

having np= 

Distribució de Poisson P()

Prob{X=k}= e-

k

k!

Then the probability that at least one segment covers a base is

Prob{X>0}= 1-Prob{X=0}= 1- e-

= 1- e(N d / L)

Then, with N d / L = 4.6 we obtain a 99% of coverage

and with N d / L = 6.9 weobtain a 99.9% of coverage.


Assembly of ests
Assembly of ESTs

Is the same procedure than shotgun sequencing…

…but with a great one advantage:

there are many graphs with a small number of nodes!

Connect to

http://alggen.lsi.upc.es

Links RESEARCH ESSEM


ad