module a fundamental algorithms in sequence analysis l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Module A: Fundamental Algorithms in Sequence Analysis PowerPoint Presentation
Download Presentation
Module A: Fundamental Algorithms in Sequence Analysis

Loading in 2 Seconds...

play fullscreen
1 / 46

Module A: Fundamental Algorithms in Sequence Analysis - PowerPoint PPT Presentation


  • 184 Views
  • Uploaded on

Module A: Fundamental Algorithms in Sequence Analysis. Section 1: Sequence Alignments Srinivas Aluru. Biology easily has 500 years of exciting problems to work on -Donald E. Knuth. Biological Data. DNA: Self-replicating

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Module A: Fundamental Algorithms in Sequence Analysis' - elina


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
module a fundamental algorithms in sequence analysis

Module A: Fundamental Algorithms in Sequence Analysis

Section 1:

Sequence Alignments

Srinivas Aluru

biological data
Biological Data

DNA:

  • Self-replicating
  • Codes for proteins

Proteins:

  • Perform most functions in living organisms

BBSI Summer School - Iowa State University

slide4

O

O

C

O

P

O

HN

C

CH2

O

O

C

CH

C

C

N

O

C

C

H

OH

H

DNA: Sequence of nucleotides

Nucleotide: Deoxyribose sugar + Phosphate + Base

Nucleotides: A, T, G, and C

CH3

5’

1’

4’

3’

2’

BBSI Summer School - Iowa State University

slide5

5’ 3’

5’

P

P

P

3’

A

C

G

T

G

C

3’

P

P

P

5’

3’ 5’

BBSI Summer School - Iowa State University

slide7
For computational purposes,

DNA = A sequence over alphabet {A,C,G,T}

5’ A T T C G G G A A T G C A T G C C A 3’

3’ T A A G C C C T T A C G T A C G G T 5’

BBSI Summer School - Iowa State University

slide8
Proteins: Chains of amino acid residues.

There are 20 different amino acids.

Functions:

• Tissue building blocks (Structure proteins)

• Catalysts (enzymes)

• Oxygen transport

• Antibody defense

BBSI Summer School - Iowa State University

example
Example

RNA:

AUG GGA GAG CUA UGA

Protein:

Met Gly Glu Leu STOP

BBSI Summer School - Iowa State University

challenges in computational biology
Challenges in Computational Biology
  • Obtain the genome of an organism.
  • Identify and annotate genes.
  • Find the sequences, three dimensional structures, and functions of proteins.
  • Find sequences of proteins that have desired three dimensional structures.
  • Compare DNA sequences and proteins sequences for similarity.
  • Study the evolution of sequences and species.

BBSI Summer School - Iowa State University

sequence comparison caveats
Sequence Comparison Caveats

Magenta regions are structurally equivalent with enterotoxin (top left).

http://www.sbg.bio.ic.ac.uk/AH/explanation.html

BBSI Summer School - Iowa State University

pairwise sequence alignment
Pairwise Sequence Alignment

Problem: Find similarity between two sequences.

Variations:

  • Given two sequences, find if parts of them are similar (local alignment).
  • Given a large sequence and a short sequence, find if the short sequence is similar to a stretch of the long sequence.

BBSI Summer School - Iowa State University

alignments
Alignments
  • Show one sequence placed above another such that similarity is revealed

A: C A T - T C A - C

B: C - T C G C A G C

Example:

BBSI Summer School - Iowa State University

measuring similarity
Measuring Similarity

Score: A measure of alignment quality

C A T - T C A - C

C - T C G C A G C

--------------------------------

10 -5 10 -5 -2 10 10 -5 10

Total = 33

BBSI Summer School - Iowa State University

pairwise global alignment
Pairwise Global Alignment

T[i,j] = Score of optimally aligning first i

bases of s with first j bases of t.

BBSI Summer School - Iowa State University

calculating alignments
Calculating Alignments

Case 1: Match s[i] w/ t[j]

i - 1

i

s: C A T T C A C

t: C - T T C A G

j -1

j

Case 2: Match t[j] w/ gap

i

s: C A T T C A C -

t: C - T T C A - G

j -1

j

Case 3: Match s[i] w/ gap

i - 1

i

s: C A T T C A - C

t: C - T T C A G -

j

BBSI Summer School - Iowa State University

slide21

-5

-10

-15

-20

-25

-30

-35

λ C T C G C A G C

0 -5 -10 -15 -20 -25 -30 -35 -40

λ

10

5

C

A

T

T

C

A

C

+10 for match, -2 for mismatch, -5 for gap

BBSI Summer School - Iowa State University

slide22

*

*

λ C T C G C A G C

λ

C

A

T

T

C

A

C

Traceback yields both optimal alignments in this example

BBSI Summer School - Iowa State University

end gap free alignment
End-gap free alignment
  • We often don’t want to penalize gaps at the start or end of the alignment, especially when comparing short and long sequences
  • Same as global alignment, except:
    • Initialize with zeros (free gaps at start)
    • Locate max in the last row/column (free gaps at end)

BBSI Summer School - Iowa State University

slide24

0 0 0 0 0 0 0 0 0

0

0

5 8 5 8 5 20 15 10

0

0 15 10 5 6 15 18 13

0

-2 10 13 8 3 10 13 16

0

10 5 20 15 18 13 8 23

5 8 15 18 13 28 23 18

0

0

0 3 10 25 20 23 38 33

λ C T C G C A G C

λ

10 5 10 5 10 5 0 10

C

A

T

T

C

A

G

+10 for match, -2 for mismatch, -5 for gap

BBSI Summer School - Iowa State University

local alignment
Local Alignment

T [i, j] = Score of optimally aligning a suffix

of s with a suffix of t.

Initialize top row and leftmost column to zero.

BBSI Summer School - Iowa State University

slide26

λ C T C G C A G C

λ

C

A

T

T

C

A

C

+1 for a match, -1 for a mismatch, -5 for a gap

BBSI Summer School - Iowa State University

some results
Some Results
  • Most pairwise sequence alignment problems can be solved in O(mn) time.
  • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88].
  • Two highly similar sequences can be aligned in O(dn) time, where d is a measure of the distance between the sequences [Landau86].

BBSI Summer School - Iowa State University

reducing space requirements
Reducing space requirements
  • O (mn) tables are often the limiting factor in computing large alignments
  • There is a linear space technique that only doubles the time required [Hirschberg77]

BBSI Summer School - Iowa State University

slide29

0 5 8 5 8 5 20 15 10

λ C T C G C A G C

0 0 0 0 0 0 0 0 0

λ

0 10 5 10 5 10 5 0 10

C

A

T

T

C

A

G

IDEA: We only need the previous row to calculate the next

BBSI Summer School - Iowa State University

linear space alignments
Linear-space Alignments

mn + ½ mn + ¼ mn + 1/8 mn + 1/16 mn + … = 2 mn

BBSI Summer School - Iowa State University

affine gap penalty functions
Affine Gap Penalty Functions

Gap penalty = h + gk

where

k = length of a maximal sequence of gaps

h = gap opening penalty

g = gap continuation penalty

BBSI Summer School - Iowa State University

pam matrices
PAM matrices
  • Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify evolutionary change within a protein sequence [Dayhoff78].
  • A PAM unit is the amount of evolution which will on average change 1% of the amino acids within a protein sequence.

BBSI Summer School - Iowa State University

pam250 scoring matrix
PAM250 scoring matrix

BBSI Summer School - Iowa State University

blosum matrices
BLOSUM matrices
  • Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff92].
  • For example BLOSUM62 is derived from sequence alignments with no more than 62% identity.

BBSI Summer School - Iowa State University

comparison
Comparison
  • PAM is based on an evolutionary model using phylogenetic trees
  • BLOSUM assumes no evolutionary model, but rather conserved “blocks” of proteins

BBSI Summer School - Iowa State University

multiple sequence alignment
Multiple Sequence Alignment

VTISCTGSSSNIGAGNHVKWYQQLPG

VTISCTGTSSNIGSITVNWYQQLPG

LRLSCSSSGFIFSSYAMYWVRQAPG

LSLTCTVSGTSFDDYYSTWVRQPPG

PEVTCVVVDVSHEDPQVKFNWYVDG

ATLVCLISDFYPGAVTVAWKADS

ATLVCLISDFYPGAVTVAWKADS

AALGCLVKDYFPEPVTVSWNSG-

VSLTCLVKGFYPSDIAVEWESNG-

BBSI Summer School - Iowa State University

induced pairwise alignment
Induced Pairwise Alignment

S1 S - T I S C T G - S - N I

S2 L - T I – C N G S S - N I

S3 L R T I S C S G F S Q N I

Induced pairwise alignment of S1andS2:

S1 S T I S C T G - S N I

S2 L T I – C N G S S N I

BBSI Summer School - Iowa State University

sum of pairs scoring function
Sum-of-Pairs Scoring Function

Score of multiple alignment

where

BBSI Summer School - Iowa State University

multiple alignment
Multiple Alignment

Run-time of dynamic programming solution

= O(2k nk)

where n = length of each sequence

k = number of sequences

Space, O(nk), is prohibitively large!

Example: 6 sequences of length 100  6.4X1013

calculations!

BBSI Summer School - Iowa State University

carillo lippman heuristic
Carillo-Lippman Heuristic

L = Lower bound on multiple alignment score

If

Then T[i1,i2,…,ik] cannot be on an optimal

path.

BBSI Summer School - Iowa State University

multiple alignment to a phylogenetic tree
Multiple Alignment to a Phylogenetic Tree
  • A tree showing the evolutionary relationship between sequences is available.
  • Compute multiple alignment such that for each edge (i,j) in the tree

Induced alignment between Siand Sj.

= Optimal alignment between Siand Sj.

BBSI Summer School - Iowa State University

examples
Examples

Primates

Darwin’s Finches

http://members.aol.com/darwinpage/trees.htm

BBSI Summer School - Iowa State University

multiple alignment to a tree
Multiple Alignment to a Tree
  • Build the multiple alignment incrementally.
  • To add a new sequence, an edge should connect it in the tree to a sequence already incorporated in the multiple alignment.
  • Insert the new sequence according to its optimal alignment with the other sequence connected by the edge.
  • Adjust other sequences in the multiple alignment.
  • Run-time = time for k pairwise alignments.

BBSI Summer School - Iowa State University

searching biological databases
Searching Biological Databases

BLAST (Basic Local Alignment Search Tool)

http://www.ncbi.nlm.nih.gov

  • BLASTN (DNA)
  • BLASTP (Protein)
  • BLASTX (DNA against Protein)
  • PSI-BLAST (Position Specific Iterative BLAST)

BBSI Summer School - Iowa State University

multiple alignment software
Multiple Alignment Software
  • Clustalw (http://www.ebi.ac.uk/clusalw)
  • MSA (http://softlib.rice.edu/softlib/msa.html)
  • HMMER (http://hmmer.wustl.edu/)
  • SAM (http://www.cse.ucsc.edu/research/ compbio/sam.html)

BBSI Summer School - Iowa State University

references
References
  • M. O. Dayhoff, R.M. Schwartz, and B.C. Orcutt, A model of evolutionary change in proteins, Atlas of Protein Sequence and Structure, 5:345-352, 1978.
  • S. Henikoff and J. G. Henikoff, Amino acid substitution matrices from protein blocks, Proc. Natl. Academy Science, 89:10915-10919, 1992.
  • D.S. Hirschberg, Algorithms for the longest common subsequence problem, J. ACM, 24:664-675, 1977.
  • G.M. Landau and U. Vishkin, Efficient string matching with k mismatches, Theoretical Computer Science, 43:239-249, 1986.
  • E. Myers and W. Miller, Optimal alignments in linear space. Computer Applications in the Biosciences, 4(1):11–17, 1988.

BBSI Summer School - Iowa State University