Module A: Fundamental Algorithms in Sequence Analysis

1 / 46

# Module A: Fundamental Algorithms in Sequence Analysis - PowerPoint PPT Presentation

Module A: Fundamental Algorithms in Sequence Analysis. Section 1: Sequence Alignments Srinivas Aluru. Biology easily has 500 years of exciting problems to work on -Donald E. Knuth. Biological Data. DNA: Self-replicating

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Module A: Fundamental Algorithms in Sequence Analysis' - elina

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Module A: Fundamental Algorithms in Sequence Analysis

Section 1:

Sequence Alignments

Srinivas Aluru

-Donald E. Knuth

Biological Data

DNA:

• Self-replicating
• Codes for proteins

Proteins:

• Perform most functions in living organisms

BBSI Summer School - Iowa State University

O

O

C

O

P

O

HN

C

CH2

O

O

C

CH

C

C

N

O

C

C

H

OH

H

DNA: Sequence of nucleotides

Nucleotide: Deoxyribose sugar + Phosphate + Base

Nucleotides: A, T, G, and C

CH3

5’

1’

4’

3’

2’

BBSI Summer School - Iowa State University

5’ 3’

5’

P

P

P

3’

A

C

G

T

G

C

3’

P

P

P

5’

3’ 5’

BBSI Summer School - Iowa State University

For computational purposes,

DNA = A sequence over alphabet {A,C,G,T}

5’ A T T C G G G A A T G C A T G C C A 3’

3’ T A A G C C C T T A C G T A C G G T 5’

BBSI Summer School - Iowa State University

Proteins: Chains of amino acid residues.

There are 20 different amino acids.

Functions:

• Tissue building blocks (Structure proteins)

• Catalysts (enzymes)

• Oxygen transport

• Antibody defense

BBSI Summer School - Iowa State University

Example

RNA:

AUG GGA GAG CUA UGA

Protein:

Met Gly Glu Leu STOP

BBSI Summer School - Iowa State University

Challenges in Computational Biology
• Obtain the genome of an organism.
• Identify and annotate genes.
• Find the sequences, three dimensional structures, and functions of proteins.
• Find sequences of proteins that have desired three dimensional structures.
• Compare DNA sequences and proteins sequences for similarity.
• Study the evolution of sequences and species.

BBSI Summer School - Iowa State University

Sequence Comparison Caveats

Magenta regions are structurally equivalent with enterotoxin (top left).

http://www.sbg.bio.ic.ac.uk/AH/explanation.html

BBSI Summer School - Iowa State University

Pairwise Sequence Alignment

Problem: Find similarity between two sequences.

Variations:

• Given two sequences, find if parts of them are similar (local alignment).
• Given a large sequence and a short sequence, find if the short sequence is similar to a stretch of the long sequence.

BBSI Summer School - Iowa State University

Alignments
• Show one sequence placed above another such that similarity is revealed

A: C A T - T C A - C

B: C - T C G C A G C

Example:

BBSI Summer School - Iowa State University

Measuring Similarity

Score: A measure of alignment quality

C A T - T C A - C

C - T C G C A G C

--------------------------------

10 -5 10 -5 -2 10 10 -5 10

Total = 33

BBSI Summer School - Iowa State University

Pairwise Global Alignment

T[i,j] = Score of optimally aligning first i

bases of s with first j bases of t.

BBSI Summer School - Iowa State University

Calculating Alignments

Case 1: Match s[i] w/ t[j]

i - 1

i

s: C A T T C A C

t: C - T T C A G

j -1

j

Case 2: Match t[j] w/ gap

i

s: C A T T C A C -

t: C - T T C A - G

j -1

j

Case 3: Match s[i] w/ gap

i - 1

i

s: C A T T C A - C

t: C - T T C A G -

j

BBSI Summer School - Iowa State University

-5

-10

-15

-20

-25

-30

-35

λ C T C G C A G C

0 -5 -10 -15 -20 -25 -30 -35 -40

λ

10

5

C

A

T

T

C

A

C

+10 for match, -2 for mismatch, -5 for gap

BBSI Summer School - Iowa State University

*

*

λ C T C G C A G C

λ

C

A

T

T

C

A

C

Traceback yields both optimal alignments in this example

BBSI Summer School - Iowa State University

End-gap free alignment
• We often don’t want to penalize gaps at the start or end of the alignment, especially when comparing short and long sequences
• Same as global alignment, except:
• Initialize with zeros (free gaps at start)
• Locate max in the last row/column (free gaps at end)

BBSI Summer School - Iowa State University

0 0 0 0 0 0 0 0 0

0

0

5 8 5 8 5 20 15 10

0

0 15 10 5 6 15 18 13

0

-2 10 13 8 3 10 13 16

0

10 5 20 15 18 13 8 23

5 8 15 18 13 28 23 18

0

0

0 3 10 25 20 23 38 33

λ C T C G C A G C

λ

10 5 10 5 10 5 0 10

C

A

T

T

C

A

G

+10 for match, -2 for mismatch, -5 for gap

BBSI Summer School - Iowa State University

Local Alignment

T [i, j] = Score of optimally aligning a suffix

of s with a suffix of t.

Initialize top row and leftmost column to zero.

BBSI Summer School - Iowa State University

λ

C

A

T

T

C

A

C

+1 for a match, -1 for a mismatch, -5 for a gap

BBSI Summer School - Iowa State University

Some Results
• Most pairwise sequence alignment problems can be solved in O(mn) time.
• Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88].
• Two highly similar sequences can be aligned in O(dn) time, where d is a measure of the distance between the sequences [Landau86].

BBSI Summer School - Iowa State University

Reducing space requirements
• O (mn) tables are often the limiting factor in computing large alignments
• There is a linear space technique that only doubles the time required [Hirschberg77]

BBSI Summer School - Iowa State University

0 5 8 5 8 5 20 15 10

λ C T C G C A G C

0 0 0 0 0 0 0 0 0

λ

0 10 5 10 5 10 5 0 10

C

A

T

T

C

A

G

IDEA: We only need the previous row to calculate the next

BBSI Summer School - Iowa State University

Linear-space Alignments

mn + ½ mn + ¼ mn + 1/8 mn + 1/16 mn + … = 2 mn

BBSI Summer School - Iowa State University

Affine Gap Penalty Functions

Gap penalty = h + gk

where

k = length of a maximal sequence of gaps

h = gap opening penalty

g = gap continuation penalty

BBSI Summer School - Iowa State University

PAM matrices
• Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify evolutionary change within a protein sequence [Dayhoff78].
• A PAM unit is the amount of evolution which will on average change 1% of the amino acids within a protein sequence.

BBSI Summer School - Iowa State University

PAM250 scoring matrix

BBSI Summer School - Iowa State University

BLOSUM matrices
• Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff92].
• For example BLOSUM62 is derived from sequence alignments with no more than 62% identity.

BBSI Summer School - Iowa State University

Comparison
• PAM is based on an evolutionary model using phylogenetic trees
• BLOSUM assumes no evolutionary model, but rather conserved “blocks” of proteins

BBSI Summer School - Iowa State University

Multiple Sequence Alignment

VTISCTGSSSNIGAGNHVKWYQQLPG

VTISCTGTSSNIGSITVNWYQQLPG

LRLSCSSSGFIFSSYAMYWVRQAPG

LSLTCTVSGTSFDDYYSTWVRQPPG

PEVTCVVVDVSHEDPQVKFNWYVDG

AALGCLVKDYFPEPVTVSWNSG-

VSLTCLVKGFYPSDIAVEWESNG-

BBSI Summer School - Iowa State University

Induced Pairwise Alignment

S1 S - T I S C T G - S - N I

S2 L - T I – C N G S S - N I

S3 L R T I S C S G F S Q N I

Induced pairwise alignment of S1andS2:

S1 S T I S C T G - S N I

S2 L T I – C N G S S N I

BBSI Summer School - Iowa State University

Sum-of-Pairs Scoring Function

Score of multiple alignment

where

BBSI Summer School - Iowa State University

Multiple Alignment

Run-time of dynamic programming solution

= O(2k nk)

where n = length of each sequence

k = number of sequences

Space, O(nk), is prohibitively large!

Example: 6 sequences of length 100  6.4X1013

calculations!

BBSI Summer School - Iowa State University

Carillo-Lippman Heuristic

L = Lower bound on multiple alignment score

If

Then T[i1,i2,…,ik] cannot be on an optimal

path.

BBSI Summer School - Iowa State University

Multiple Alignment to a Phylogenetic Tree
• A tree showing the evolutionary relationship between sequences is available.
• Compute multiple alignment such that for each edge (i,j) in the tree

Induced alignment between Siand Sj.

= Optimal alignment between Siand Sj.

BBSI Summer School - Iowa State University

Examples

Primates

Darwin’s Finches

http://members.aol.com/darwinpage/trees.htm

BBSI Summer School - Iowa State University

Multiple Alignment to a Tree
• Build the multiple alignment incrementally.
• To add a new sequence, an edge should connect it in the tree to a sequence already incorporated in the multiple alignment.
• Insert the new sequence according to its optimal alignment with the other sequence connected by the edge.
• Adjust other sequences in the multiple alignment.
• Run-time = time for k pairwise alignments.

BBSI Summer School - Iowa State University

Searching Biological Databases

BLAST (Basic Local Alignment Search Tool)

http://www.ncbi.nlm.nih.gov

• BLASTN (DNA)
• BLASTP (Protein)
• BLASTX (DNA against Protein)
• PSI-BLAST (Position Specific Iterative BLAST)

BBSI Summer School - Iowa State University

Multiple Alignment Software
• Clustalw (http://www.ebi.ac.uk/clusalw)
• MSA (http://softlib.rice.edu/softlib/msa.html)
• HMMER (http://hmmer.wustl.edu/)
• SAM (http://www.cse.ucsc.edu/research/ compbio/sam.html)

BBSI Summer School - Iowa State University

References
• M. O. Dayhoff, R.M. Schwartz, and B.C. Orcutt, A model of evolutionary change in proteins, Atlas of Protein Sequence and Structure, 5:345-352, 1978.
• S. Henikoff and J. G. Henikoff, Amino acid substitution matrices from protein blocks, Proc. Natl. Academy Science, 89:10915-10919, 1992.
• D.S. Hirschberg, Algorithms for the longest common subsequence problem, J. ACM, 24:664-675, 1977.
• G.M. Landau and U. Vishkin, Efficient string matching with k mismatches, Theoretical Computer Science, 43:239-249, 1986.
• E. Myers and W. Miller, Optimal alignments in linear space. Computer Applications in the Biosciences, 4(1):11–17, 1988.

BBSI Summer School - Iowa State University