sequence alignment algorithms l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Sequence alignment algorithms PowerPoint Presentation
Download Presentation
Sequence alignment algorithms

Loading in 2 Seconds...

play fullscreen
1 / 74

Sequence alignment algorithms - PowerPoint PPT Presentation


  • 242 Views
  • Uploaded on

Sequence alignment algorithms . Presented By Cary Miller Sastry Akella Daisuke Yasuda. Overview. Biological background / motivation / applications Dot matrix / dynamic programming FASTA / BLAST. biology. Biomolecules are strings from a restricted alphabet Length=4 DNA

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sequence alignment algorithms' - Sophia


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
sequence alignment algorithms
Sequence alignment algorithms

Presented By

Cary Miller

Sastry Akella

Daisuke Yasuda

overview
Overview
  • Biological background / motivation / applications
  • Dot matrix / dynamic programming
  • FASTA / BLAST
biology
biology
  • Biomolecules are strings from a restricted alphabet
    • Length=4 DNA
    • Length=20 protein
  • Proteins are the working part
proteins
Proteins
  • Protein is a linear sequence of 20 characters (amino acids)
  • Proteins do not maintain linearity
    • Folding happens
  • Folding determines overall 3-D shape
  • Shape determines function
sequence structure function
Sequence => Structure => Function
  • sequence does not reveal structure
  • Much less function
  • A sequenceARTUVEDYERRWWUHUK…
structure
Structure
  • Pic 1
  • Pic 2
function
function
  • Protein A is a constituent of muscle, skin, cartilage, or …
  • Protein B catalyzes the transformation of glucose to fructose, or …
  • How do we find proteins with similar function?
nature does not solve the same problem twice usually
Nature does not solve the same problem twice (usually)
  • Short sequence with a specific function (or shape) is called a domain
  • The same domain appears in multiple proteins
  • If we find the same domain in multiple proteins that provides a clue to function and/or structure
amino acids
Amino acids
  • Each has the same basic chemical configuration but has a functional group that makes it chemically unique
  • They occur in families
    • Some functional groups are similar
how biologists study proteins
How biologists study proteins
  • Expensive (NMR, x-ray crystallography)
  • Discovery of function is difficult
  • Few proteins are understood in detail
  • Many are known by sequence
  • Sequence is easier to get than structure or function
a biological scenario
A biological scenario
  • Biologist discovers the sequence of a new protein with unknown function
  • She has no idea of function
  • If sequence can be associated with a known protein sequence we have a clue about structure and/or function
  • Most proteins have unknown function
public databases
Public databases
  • Vast quantities of sequence, structure, function info is deposited into public databases
  • A new sequence should be compared to the database
comparing sequences
Comparing sequences
  • Alignment with exact matchABCTUVABUVABCTUVAB----UV
alignment with inexact match
Alignment with inexact match
  • InexactGARUIPPRSTGARVVBUIEEYSTGAR------UIPPRSTGARVVBUIEEYST
global vs local alignment
Global vs. local alignment
  • ABQRTASGGBV
  • ABRRRASGVBB
  • ABQRTASGGBV
  • ABQ------SGGBV
a real alignment
A real alignment
  • MyoglobinPDLRKY FKG-A ENFTA DDVQ KSDRPDTKAY FPKFG DLSTA AALK SSPK
  • Homology: common ancestry
scoring pairs of amino acids
Scoring pairs of amino acids
  • For amino acid pairs assign a score based on frequency of substitutionATRGUVXQATRCVVXTATRGVVEQAT-----VVEQ
substitution matrices
Substitution matrices
  • Pam and Blosum are standard substitution matrices
  • Also include scores for
    • Gap opening
    • Gap extension
scoring amino acid strings
Scoring amino acid strings
  • Sum the individual pair scores
  • Database is huge
    • Spurious match to random sequence is likely
      • Try your name
    • E-value is probability of getting a given score from a random sequence
alignment algorithms
Alignment algorithms
  • Dot matrix
  • Dynamic programming
  • FASTA
  • BLAST
dot matrix
Dot Matrix
  • Locating regions of similarity between two DNA or protein sequences which provide a great deal of information about the function and structure of the query sequence.
  • Similar structure indicates homology, or similar evolution, which provides critical information about the functions of these sequences.
dot matrix contd
Dot Matrix Contd..
  • A dot matrix plot is a method of aligning two sequences to provide a picture of the homology between them.
  • The dot matrix plot is created by designating one sequence to be the subject and placing it on the horizontal axis and designating the second sequence to be the query and placing it on the vertical axis of the matrix.
dot matrix contd29
Dot Matrix Contd..
  • At each position within the matrix, a point is plotted if the horizontal and vertical elements are identical.
  • Diagonal lines within the resulting matrix indicate regions of similarity. A simple dot matrix plot is shown in Figure A.
slide30

B A S K E T B A L L

BASEBALL

* * * * * ** * * *

* * * *

dot matrix with noise reduction
Dot Matrix with noise reduction
  • A certain percentage of the matches between sequence elements can be expected to be the result of the random nature of their evolution. These random matches are considered “noise" and are filtered out to enhance the diagonal lines.
dot matrix32
Dot Matrix
  • Noise Reduction

a) Noise reduction in dot matrix can be done by centering a substring of elements of the query sequence over each element in the subject sequence and determining the number of corresponding elements within this “window”.

dot matrix33
Dot Matrix

b) If the number of corresponding elements exceeds a specified threshold then a point is plotted for the center element. This is demonstrated in figure B.

dot matrix35
Dot Matrix
  • Advantages: Readily reveals the presence of insertions/deletions and direct and inverted repeats that are more difficult to find by the other, more automated methods.
  • Disadvantages:Most dot matrix computer programs do not show an actual alignment. Does not return a score to indicate how ‘optimal’ a given alignment is.
dynamic programming
Dynamic Programming
  • Dynamic programming (DP) algorithms are a general class of algorithms typically applied to optimization problems.
  • For DP to be applicable, an optimization problem must have two key ingredients:
  • a) Optimal substructure – an optimal solution to the problem contains within it optimal solutions to sub-problems.

b) Overlapping sub-problems – the pieces of larger problem have a sequential dependency.

dynamic programming37
Dynamic Programming
  • DP works by first solving every sub-sub-problem just once, and saves its answer in a table, thereby avoiding the work of re- computing the answer every time the sub-sub-problem is encountered. Each intermediate answer is stored with a score, and DP finally chooses the sequence of solution that yields the highest score.
dynamic programming39
Dynamic Programming
  • Both global and local types of alignments may be made by simple changes in the basic DP algorithm.
  • Alignments depend on the choice of a scoring system for comparing character pairs and penalty scores (e.g. PAM and BLOSUM matrixes – covered before)

Scoring functions – example:

w (match) = +2 or substitution matrix

w (mismatch) = -1 or substitution matrix

w (gap) = -3

dynamic programming40
Dynamic Programming
  • Global Alignment (Needleman-Wunsch)

a) General goal is to obtain optimal global alignment between two sequences, allowing gaps.b) We construct a matrix F indexed by i and j, one index for each sequence, where the value F(i,j) is the score of the best alignment between the initial segment x1…i of x up to xi and the initial segment y1…j of y up to yj. … We begin by initializing F(0,0) = 0. We then proceed to fill the matrix from top left to bottom right. If F(i-1, j-1), F(i-1,j) and F(i,j-1) are known, it is possible to calculate F(i,j).

dynamic programming41
Dynamic Programming

F(i,j) = max { F(i-1, j-1) + s(xi , yj );F(i-1,j) – d;F(i, j-1) – d. }

where s(a,b) is the likelihood score that residues a and b occur as an aligned pair, and d is the gap penalty.

  • Once you construct the matrix, you trace back the path that leads to F(n,m), which is by definition the best score for an alignment of x1…n to y1…m.
dynamic programming42
Dynamic Programming
  • Global Dynamic programming matrix
dynamic programming43
Dynamic Programming
  • Local alignment (Smith-Waterman)Two changes from global alignment:1. Possibility of taking the value 0 if all other options have value less than 0. This corresponds to starting a new alignment.2. Alignments can end anywhere in the matrix, so instead of taking the value in the bottom right corner, F(n,m) for the best score, we look for the highest value of F(i,j) over the whole matrix and start the trace-back from there.

F(i,j) = max { 0;F(i-1, j-1) + s(xi , yj ); F(i-1,j) – d;F(i, j-1) – d.}

dynamic programming44
Dynamic Programming
  • Local Dynamic programming matrix
dynamic programming45
Dynamic Programming
  • Advantages:Guaranteed in a mathematical sense to provide the optimal (very best or highest-scoring) alignment for a given set of scoringfunctions.
  • Disadvantages:

a) Slow due to the very large number of computational steps: O(n 2).b) Computer memory requirement also increases as the square of the sequence lengths.

Therefore, it is difficult to use the method for very long sequences.

fasta idea

G

A

A

T

T

C

A

G

T

T

A

G

1

1

1

1

1

1

1

1

1

1

1

G

1

1

1

1

1

1

1

2

2

2

2

A

1

2

2

2

2

2

2

2

2

2

2

T

1

2

2

3

3

3

3

3

3

3

3

C

1

2

2

3

3

4

4

4

4

4

4

G

1

2

2

3

3

4

4

5

5

5

5

A

1

2

3

3

3

4

5

5

5

5

6

FASTA - Idea -
  • Problem of Dynamic Programming

D.P. compute the score in a lot of useless area for optimal sequence

  • FASTA focuses on diagonal area
fasta heuristic
FASTA - Heuristic -
  • Heuristic

Good local alignment should have some exact match subsequence.

FASTA focus on this area

fasta hi level algorithm
FASTA - Hi Level Algorithm -

Hi level algorithm

Let q be a query

max  0

For each sequence, s in DB

compare q with s and compute a score, y

if max < y

max  y;

bestSequence  s ;

Return bestSequence

fasta algorithm
FASTA - Algorithm -
  • Step 1

Find all hot-spots

// Hot spots is pairs of words of length k that exactly match

Sequence 1

Hot Spots

Sequence 2

fasta algorithm51

G

A

A

T

T

C

A

G

T

T

A

G

*

*

Q

Location

G

*

*

A

2,3,7,11

A

*

*

*

*

C

6

T

*

*

*

*

G

1,8

C

*

G

*

*

T

4,5,9,10

A

*

*

*

*

FASTA - Algorithm -
  • Step 1 in detail

Use look-up Table

Query : G A A T T C A G T T A

Sequence: G G A T C G A

Dot—Matrix

Look-up Table

fasta algorithm52
FASTA - Algorithm -
  • Step 2

Score the Hot-spot and locate the ten best diagonal run.

// There is some scoring system; ex. PAM250

fasta algorithm53
FASTA - Algorithm -
  • Step 3

Combine sub-alignments into one alignment with GAP

GAP

One of local alignment

fasta algorithm54
FASTA - Algorithm -
  • Step 4

# Consider weighted direct graph.

# Let node be a sub-alignment found in step 1

# Let u and v be nodes

# Edge (u,v) exists if alignment u is before in the sequence.

# Each edge has gap penalty (negative)

# Find the maximum weight path

Sub-sequence

Edge

One Sequence

fasta algorithm55

One of Sequence

FASTA - Algorithm -
  • Step 4 in detail

GAP

Sub-alignment

Gap

-5

-3

-3

Max Weight Path

fasta algorithm56
FASTA - Algorithm -
  • Step 5

Use the dynamic programming in restricted area around the best-score alignment to find out the higher-score alignment than the best-score alignment

Width of this band is a parameter

fasta algorithm57
FASTA - Algorithm -
  • Summary of Algorithm

1: Find all hot-spots

// Hot spots is pairs of words of length k that exactly match

2: Score the Hot-spot and locate the ten best diagonal run.

3: Combine sub-alignments into one alignment

4: Score Each alignment with gap penalty and pick up the best-score alignment

5: Use the dynamic programming in restricted area around the best-score alignment to find out the alignment greater than the best-score alignment.

fasta complexity
FASTA - Complexity -
  • Complexity

# Step 1 and 2 // select the best 10 diagonal run

Let n be a sequence from DB

O(n) because Step 1 just uses look up the table

O(n) << O(mn) m,n = 100 to 200

fasta complexity59
FASTA - Complexity -

# Step 3 and 4 // compute the MAX Weight Path

Let r be the number of sub-alignments. (r = 10)

Lets be the number of edges

O(r2) < O(m*n)

n1 n2 n3

n1

n2

n3

 1% of D.P because r2 =102

and m*n >= 104

Positive Weight

-5

-3

-3

Max Weight Path

fasta complexity60
FASTA - Complexity -

# Step 5 // compute partial D.P.

Depends on the restricted area < O(mn)

Therefore, FASTA is faster than D.P.

Width of this band is a parameter

blast heuristic
BLAST - Heuristic -
  • Another Heuristic algorithm
  • Heuristic but evaluating the result statistically.

Homologous sequence are likely to contain a short high scoring word pair, a hit.

BLAST tries to extend it on the both sides to get optimal sequence.

A T T A G …………….

Sequence

Short high score Word

blast algorithm
BLAST - Algorithm -

Neighborhood Word

  • Step 1: preprocessing Query

Compile the short-hit scoring word list from query.

The length of query word,w, is 3 for brosom scoring

Threshold T is 13

blast algorithm63
BLAST - Algorithm -
  • Step 1 – 2

Create neighborhood words for each query word

Query Word

Neighborhood words

blast algorithm64
BLAST - Algorithm -
  • Step 2: Scanning DB

For each words list, identify all exact matches with DB sequences

Neighborhood Word list

Query Word

Sequences in DB

Sequence 1

Sequence 2

Step 2

Step 1

The purpose of Step 1 and 2 is as same as FASTA

blast algorithm65
BLAST - Algorithm -
  • Step 2-2

Method 1: Hash Table

Query: LAALLNKCKTPQGQRLVNQWIKQPLMD

Hash Table

Word list

blast algorithm66

S

BLAST - Algorithm -
  • Step 2-3

Method 2: Finite Automata

A,G

L

A

G

A

A

A

I

blast algorithm67
BLAST – Algorithm -
  • Step 3 (Search optimal alignment)

Let S be a score of hit-word

For each hit-word, extend ungapped alignmentin both directions.

  • Step 4 (Evaluate the alignment statistically)

Stop extension when E-value (depending on score S) become less than threshold. The hit-word is called High Scoring Segment Pair. BLAST return it

E-value = the number of HSPs having score S (or higher) expected to

occur only by chance.

 Smaller E-value, more significant in statistics

Bigger E-value , by chance

A T T A G …………….

Sequence

Hit Word

blast algorithm68
BLAST - Algorithm -
  • Step 3 -2

Definition of E-Value

The expected number of HSP with the score at least S is :

E = K*n*m*e-λS

K, λ is constant depending on model

n, m are the length of query and sequence

The probability of finding at least one such HSP is:

P = 1 - eE

 If a word is hit by chance (E-value is bigger),

P become smaler.

blast running time

Algorithm

Running Time

D.P

16.989 [s]

FASTA

0.618 [s]

BLAST

0.118 [s]

BLAST - Running Time -
  • Running Time

The length of Query : 153

DB size : 5997 sequences

PC : Pentium 4

By Dr. Takeshi Kawabata

Nara Sentan Gijyutu University

comparison of algorithm
Comparison of Algorithm
  • Dynamic Programming

1. most sensitive result

 D.P uses all information of two sequence

2. Running time is slow

 D.P compute the useless area for computing the optimal sequence.

comparison of algorithm71
Comparison of Algorithm
  • FASTA

1. Less sensitive than D.P and BLAST

 FASTA uses partial information to speed up the computaiotn.

 FASTA does not evaluatethe resultstatistically.

2. Running time is faster D.P

 the same reason as the above.

comparison of algorithms
Comparison of Algorithms
  • BLAST

1. Sensitive than FASTA

 BLAST evaluate the result statistically.

2.Faster than FASTA

 Because BLAST evaluate the entire DB with the same threshold based on statistics. BLAST eliminate noises and reduces the running time.

fasta vs blast
FASTA vs BLAST

BLAST

Compare the query and sequences in DB

with the same threshold.

FASTA

compare the query and a sequence one by one

And compare the each result.

DB

DB

Query

conclusion

Algorithm

Sensitivity

Running Time

D.P

1

3

FASTA

3

2

BLAST

2

1

Conclusion