Algorismes de cerca
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

Algorismes de cerca PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on
  • Presentation posted in: General

Algorismes de cerca. Algorismes de cerca: definició del problema (text,patró). depèn de què coneixem al principi:. Cerca exacta:. Només el text ----> Estructurar el text (suffix tree). Només el/s patró/ns ---> Estructurar el/els patró/ns.

Download Presentation

Algorismes de cerca

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Algorismes de cerca

Algorismes de cerca

Algorismes de cerca: definició del problema (text,patró)

depèn de què coneixem al principi:

  • Cerca exacta:

  • Només el text ----> Estructurar el text (suffix tree)

  • Només el/s patró/ns ---> Estructurar el/els patró/ns

  • 1 patró ---> L’algorisme depèn de la llargada i ||

  • k patrons ---> L’algorisme depén de k, la llargada i ||

  • Extensions

  • Expressions regulars

  • Cerca aproximada:

depèn de la llargada del patró

  • Programació dinàmica

  • Aliniament de seqüències (parell i múltiple)

  • Algorisme de hash:Engalçament de seqüències.

  • Cerca probabilista


Sequence assembly

Sequence assembly

It is applied to the following topics:

  • DNA sequencing .

  • EST assembly


Dna sequencing

DNA sequencing

There are two techniques:

  • Hibridization: provide information about l-tuples

  • present in DNA.

  • Shotgun: DNA sequences are broken into

  • 100Kb-500Kb random fragments.


Dna sequencing1

DNA sequencing

There are two techniques:

  • Hibridization: provide information about l-mers

  • present in DNA

  • Shotgun: DNA sequences are broken into

  • 100Kb-500Kb random fragments.


Hybridization

Hybridization

Let xxxxxxxxxxxxx be the sequence we want to know,

and the hybridization technique gives us

the set of 3-mers that belong to it:

AACGATTGC

ACGCGGGCCTTG

GGA ATT

How can the sequence be reconstructed?


Hybridization1

Hybridization

This relation can be represented with a directed graph

AAC ACG

Given the 3-mers of the sequence:

AACGATTGC

ACGCGGGCCTTG

GGA ATT

As AAC and ACG belong to the sequence,

then AACG belongs to the sequence,

because the longest (not proper) suffix of AAC

matches the longest (not proper) prefix of ACG.


Hybridization2

Hybridization

Construction of the complete suffix-prefix graph

AACGATTGC

ACGCGGGCCTTG

GGA ATT

that gives us the unknown sequence:

AACGGATTGCC

But, is this a realistic case?


Hybridization3

Hybridization

AACCAAGATTGC

ACGCGGGCCTTG

GGCGGA CCG ATT

Let us introduce a more realistic case:

and the sequence is given by the Hamiltonian path

that is the path that traverses all nodes exactly once

and whose cost is NP-Complet!

Which is the cost of the hybridization method?


Hybridization cost

Hybridization: cost

Cost:

1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

2. Searching for the suffix-prefix matches :

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Hamiltonian path

NP- Complet


Excursi cost

Excursió: cost

m t = 1 mseg

10m 10t = 10 mseg

1000m 1000t = 1 seg

m t = 1mseg.

10m 100t = 100 mseg.

1000m 1000000t = 16 min

m t = 1 mseg.

10m 210 t = 1 seg

1000m 21000 t = 1030 t = 1018 anys

Linear cost: O(m)

Quadratic cost: O(m2 )

Exponencial cost: O(2m )


Hybridization cost1

Hybridization: cost

Cost:

1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

2. Searching for the suffix-prefix matches :

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Hamiltonian path

NP- Complet

How the NP-completness can be avoided?


Hybridization4

Hybridization:

AACGATTGC

ACGCGGGCCTTG

GGCGGA CCG ATT

AA

GA

TG

AC

GC

TT

CG

GG

CC

AT

Search for the Hamiltonian path (NP-complet)

or search for the Eulerian path (lineal)


Hybridization eulerian path

Hybridization: Eulerian path

Search for the Eulerian path of the graph:

Unbalanced nodes: indegree = outdegree

(Starting or ending nodes )

Balanced nodes: indegree = oudegree

(traversed nodes: )


Hybridization eulerian path1

Hybridization: Eulerian path

Algorithm:

1. Construct a random path

between starting and ending nodes.

2. Add cycles from balanced nodes while possible.


Hybridization cam euleri

Hybridization: camí Eulerià

Algorithm:

1. Construct a random path

between starting and ending nodes.

2. Add cycles from balanced nodes while possible.


Hybridization cost2

Hybridization: cost

Cost:

1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

2. Searching for the suffix-prefix matches :

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Eulerian path

Linear cost

Now, which is the limiting factor?


Hybridization limiting factor

Hybridization: limiting factor

AACCAAGATTGC

ACGCGGGCCTTG

GGA ATT

GAC

Given the graph:

Repeated l-mers:

How many sequences can be assembled?

CAACGGATTGCC

CAACGGACGGATTGCC

Which is the probability of a repeat?


Hybridization statistical model

Hybridization: statistical model

How the probability of a repeat can be computed?

Model: random sequence of length N with identically distributed bases (1/4),

Given 2 l-mers, the probability to match is : 4-L

Given 3 l-mers, the expected number of 2-matches is : (32)4-L

Given m l-mers, the expected number of 2-matches is: (m2)4-L

then for L = 8, m =512!

If (m2)4-L<1 then m<sqr(2·4L)

Conclusion: this technique can be applied only

to short sequences.


Hybridization5

Hybridization:

Genome sequences are close to random sequences?

Connect to

http://alggen.lsi.upc.edu

And follow links RESEARCH SEARCH MREPATT


Dna sequencing2

DNA sequencing

There are two techniques:

  • Hibridizationació: provide information about l-mers

  • present in DNA

  • Shot gun: DNA sequences are broken into

  • 100Kb-500Kb random fragments.


Shotgun

Shotgun

With the unknown sequence

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

  • It is possible :

  • to make some copies

  • to break it into random and unsorted short segments

What can we do?


Shotgun algorisme

Shotgun: algorisme

Assume

xxxxx|xxxxxxx|xxxxxxx|xxxx

xxxxxxxx|xxxxxx|xxxxxx|xxx

xxxx|xxxxxx|xxxxxx|xxxxxxx

The algorithm is:

1st. Compare all pairs searching for suffix-prefix approximate matches.

2nd. Construct the graph suffix-prefix

3th. Find the path


Shotgun1

Shotgun

Given the three copies

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The shotgun brokes it into the following segments

accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt


Shotgun2

Shotgun

The pairwise comparison that searchs for suffix-prefix

approximate matching can be done with:

  • Dynamic programming ( quadratic cost)

  • two steps:

  • Find the pairs suspected to be assembled

  • (Linear cost with the hash algorithm)

  • Assembly them with dynamic programming.


Shotgun3

Shotgun

tacctt

accttta

tttaac

taacga

accgtacc

acgatac

accgt

accg

gataca

tacaggt

Given the graph

accgtacctttaacgatacaggt

but, the Hamiltonian has exponential cost!


Shotgun4

Shotgun:

xxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxxxx

xxxxxxx

accgt

xxxxxxx

accg

xxxxxxx

New problems arise

  • Consecutive repeats

  • Lack of coverage


Shotgun properties of the coverage

Shotgun: properties of the coverage

Some questions arisess:

  • What is the percentage of coverage?

  • How many contigs we have to expect?

  • What is the mean length of contigs?

Given the coverage:


Shotgun percentage of coverage

Shotgun: percentage of coverage

L

N

d

The probability that

Prob{X=k}= (d/L)k (1-d/L)n-k

a base was covered by k segments is given by

the binomial dsitribution (N,d / L):

N

k

Given the model

Degree of coverage N d / L

We assume that segments are randomly distributed.


Shotgun percentage of coverage1

Shotgun: percentage of coverage

What is the limit of the binomial distribution

n  i p 0

having np= 

Distribució de Poisson P()

Prob{X=k}= e-

k

k!

Then the probability that at least one segment covers a base is

Prob{X>0}= 1-Prob{X=0}= 1- e-

= 1- e(N d / L)

Then, with N d / L = 4.6 we obtain a 99% of coverage

and with N d / L = 6.9 weobtain a 99.9% of coverage.


Assembly of ests

Assembly of ESTs

Is the same procedure than shotgun sequencing…

…but with a great one advantage:

there are many graphs with a small number of nodes!

Connect to

http://alggen.lsi.upc.es

Links RESEARCH ESSEM


  • Login