Master Course

Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya

Master Course Fourth lecture: Sequence assembly

Sequence assembly It is applied to the following topics: • DNA sequencing . • EST assembly

DNA sequencing There are two techniques: • Hibridization: provide information about l-tuples • present in DNA. • Shotgun: DNA sequences are broken into • 100Kb-500Kb random fragments.

DNA sequencing There are two techniques: • Hibridization: provide information about l-mers • present in DNA • Shotgun: DNA sequences are broken into • 100Kb-500Kb random fragments.

Hybridization Let xxxxxxxxxxxxx be the sequence we want to know, and the hybridization technique gives us the set of 3-mers that belong to it: AAC GAT TGC ACG CGG GCC TTG GGA ATT How can the sequence be reconstructed?

Hybridization This relation can be represented with a directed graph AAC ACG Given the 3-mers of the sequence: AAC GAT TGC ACG CGG GCC TTG GGA ATT As AAC and ACG belong to the sequence, then AACG belongs to the sequence, because the longest (proper) suffix of AAC matches the longest (proper) prefix of ACG.

Hybridization Construction of the complete suffix-prefix graph AAC GAT TGC ACG CGG GCC TTG GGA ATT that gives us the unknown sequence: AACGGATTGCC But, is this a realistic case?

Hybridization AAC CAA GAT TGC ACG CGG GCC TTG GGC GGA CCG ATT Let us introduce a more realistic case: and the sequence is given by the Hamiltonian path that is the path that traverses all nodes exactly once and whose cost is NP-Complet! Which is the cost of the hybridization method?

Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4L l-mers of length L that should be generated 2. Searching for the suffix-prefix matches : If there are m L-mers, then there are O(m2 L2 ) comparisons 3. Searching for the Hamiltonian path NP- Complet

Excursió: cost m t = 1 mseg 10m 10t = 10 mseg 1000m 1000t = 1 seg m t = 1mseg. 10m 100t = 100 mseg. 1000m 1000000t = 16 min m t = 1 mseg. 10m 210 t = 1 seg 1000m 21000 t = 1030 t = 1018 anys Linear cost: O(m) Quadratic cost: O(m2 ) Exponencial cost: O(2m )

Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4L l-mers of length L that should be generated 2. Searching for the suffix-prefix matches : If there are m L-mers, then there are O(m2 L2 ) comparisons 3. Searching for the Hamiltonian path NP- Complet How the NP-completness can be avoided?

Hybridization: AAC GAT TGC ACG CGG GCC TTG GGC GGA CCG ATT AA GA TG AC GC TT CG GG CC AT Search for the Hamiltonian path (NP-complet) or search for the Eulerian path (lineal)

Hybridization: Eulerian path Search for the Eulerian path of the graph: Unbalanced nodes: indegree = outdegree (Starting or ending nodes ) Balanced nodes: indegree = oudegree (traversed nodes: )

Hybridization: Eulerian path Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.

Hybridization: camí Eulerià Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.

Hybridization: cost Cost: 1. Finding the l-mers AAC, CAA, ACG,... : There are 4L l-mers of length L that should be generated 2. Searching for the suffix-prefix matches : If there are m L-mers, then there are O(m2 L2 ) comparisons 3. Searching for the Eulerian path Linear cost Now, which is the limiting factor?

Hybridization: limiting factor AAC CAA GAT TGC ACG CGG GCC TTG GGA ATT GAC Given the graph: Repeated l-mers: How many sequences can be assembled? CAACGGATTGCC CAACGGACGGATTGCC Which is the probability of a repeat?

Hybridization: statistical model How the probability of a repeat can be computed? Model: random sequence of length N with identically distributed bases (1/4), Given 2 l-mers, the probability to match is : 4-L Given 3 l-mers, the expected number of 2-matches is : (32)4-L Given m l-mers, the expected number of 2-matches is: (m2)4-L then for L = 8, m =512! If (m2)4-L<1 then m<sqr(2·4L) Conclusion: this technique can be applied only to short sequences.

Hybridization: Genome sequences are close to random sequences? Connect to http://alggen.lsi.upc.edu And follow links RESEARCH SEARCH MREPATT

DNA sequencing There are two techniques: • Hibridizationació: provide information about l-mers • present in DNA • Shot gun: DNA sequences are broken into • 100Kb-500Kb random fragments.

Shotgun With the unknown sequence xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx • It is possible : • to make some copies • to break it into random and unsorted short segments What can we do?

Shotgun Given the three copies xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx The shotgun brokes it into the following segments accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt

Shotgun The pairwise comparison that searchs for suffix-prefix approximate matching can be done with: • Dynamic programming ( quadratic cost) • two steps: • Find the pairs suspected to be assembled • (Linear cost with the hash algorithm) • Assembly them with dynamic programming.

Shotgun tacctt accttta tttaac taacga accgtacc acgatac accgt accg gataca tacaggt Given the graph accgtacctttaacgatacaggt but, the Hamiltonian has exponential cost!

Shotgun: xxxxx xxxxxx xxxxxx xxxxxx xxxxxxxx xxxxxxx accgt xxxxxxx accg xxxxxxx New problems arise • Consecutive repeats • Lack of coverage • …

Shotgun: properties of the coverage Some questions arisess: • What is the percentage of coverage? • How many contigs we have to expect? • What is the mean length of contigs? Given the coverage:

Shotgun: percentage of coverage L N d The probability that Prob{X=k}= (d/L)k (1-d/L)n-k a base was covered by k segments is given by the binomial dsitribution (N,d / L): N k Given the model Degree of coverage N d / L We assume that segments are randomly distributed.

Shotgun: percentage of coverage What is the limit of the binomial distribution n  i p 0 having np=  Distribució de Poisson P() Prob{X=k}= e- k k! Then the probability that at least one segment covers a base is Prob{X>0}= 1-Prob{X=0}= 1- e- = 1- e(N d / L) Then, with N d / L = 4.6 we obtain a 99% of coverage and with N d / L = 6.9 weobtain a 99.9% of coverage.

Assembly of ESTs Is the same procedure than shotgun sequencing… …but with a great one advantage: there are many graphs with a small number of nodes! Connect to http://alggen.lsi.upc.es Links RESEARCH ESSEM

Master Course

Master Course

Presentation Transcript

CAP Safety Program Master Safety Course

Master Course Sequence Analysis

Master Schedule Course/Team Requests

Master Course in Ukraine

Guidance for Master Course Scheduler Tool

Pro-Channel: Master Course Sensi Thermostat

Computer BaseD Training & Course Master Schedules

Rentrée SI5 /Master IFI Beginning of SI5/Master IFI Course

Master Course

KM-Master Course , 2005

KM-Master Course , 2005

Course: Master of Development Management (MDM)

Course: Master of Development Management (MDM)

TOP 5 MASTER COURSE IN AUSTRALIA

HCS 430 MASTER Course Real Knowledge/hcs430master.com

Best Scrum Master Training Course-Learnfly Academy

1-Month Practical Master Course

Presentation Master Engineering course

The Intuition Master Course Program

Master Energy Healer Course in California

Course Director and Master Instructor Preparation

Lips Master class Course

Master Course

Master Course

Presentation Transcript

CAP Safety Program Master Safety Course

Master Course Sequence Analysis

Master Schedule Course/Team Requests

Master Course in Ukraine

Guidance for Master Course Scheduler Tool

Pro-Channel: Master Course Sensi Thermostat

Computer BaseD Training &amp; Course Master Schedules

Rentrée SI5 /Master IFI Beginning of SI5/Master IFI Course

Master Course

KM-Master Course , 2005

KM-Master Course , 2005

Course: Master of Development Management (MDM)

Course: Master of Development Management (MDM)

TOP 5 MASTER COURSE IN AUSTRALIA

HCS 430 MASTER Course Real Knowledge/hcs430master.com

Best Scrum Master Training Course-Learnfly Academy

1-Month Practical Master Course

Presentation Master Engineering course

The Intuition Master Course Program

Master Energy Healer Course in California

Course Director and Master Instructor Preparation

Lips Master class Course

Computer BaseD Training & Course Master Schedules