Lecture 2 sequence alignment ibgp 705
This presentation is the property of its rightful owner.
Sponsored Links
1 / 48

Lecture 2: Sequence Alignment IBGP 705 PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

Lecture 2: Sequence Alignment IBGP 705 . Kun Huang Department of Biomedical Informatics Ohio State University. Major issues in genomics Homology Alignment as an optimization problem Dynamical programming BLAST Scoring matrices Multiple sequence alignment. Charles Darwin

Download Presentation

Lecture 2: Sequence Alignment IBGP 705

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lecture 2 sequence alignment ibgp 705

Lecture 2: Sequence Alignment IBGP 705

Kun Huang

Department of Biomedical Informatics

Ohio State University


Lecture 2 sequence alignment ibgp 705

  • Major issues in genomics

    • Homology

  • Alignment as an optimization problem

  • Dynamical programming

  • BLAST

  • Scoring matrices

  • Multiple sequence alignment


Lecture 2 sequence alignment ibgp 705

Charles Darwin

(1809-1882)

“I think …”


Homology

Homology

A Working Definition:

Sequences or structures which share a common ancestor


Homology defined

Homology Defined

"The same organ in different animals under a variety of form and function."

Sir Richard Owen, Lectures on the Comparative Anatomy and Physiology of the Invertebrate Animals, 1843.

"The mechanism of homology is heredity."

Allan Boyden, Homology and Analogy: A century after the definitions of "homologue" and "analogue" of Richard Owen, 1943.

"Homology is a relation bearing on recency of common ancestry.“

Olivier Rieppel, Homology and logical fallacy, 1993.


Lecture 2 sequence alignment ibgp 705

Sequence Homology

  • Genes in separate species derived from the

    same ancestral genes in the last common ancestor of those two species are orthologs.

  • Related genes resulted from a gene duplication event within a single genome--and are likely to have diverged in their function--are paralogs.

  • Both orthologs and paralogs are homologs, a general term to cover both types of relationships


Recognizing sequence homology

Recognizing Sequence Homology

  • Relies primarily on understanding random sequence similarity

  • Only by knowing what random similarity looks like can we tell when two sequences are significantly similar

  • Understanding mutational regularity and sequence evolution increases the significance

    • Closely-related: Transitions/transversions

    • Distantly-related: PAM mutation probabilities

  • Even distantly-related sequences can be recognized

  • "Significant Similarity" is not a definition of homology.


Lecture 2 sequence alignment ibgp 705

Databases

  • GenBank

  • EMBL

  • DDBJ

  • SWISSPROT


Lecture 2 sequence alignment ibgp 705

  • Major issues in genomics

    • Homology

    • Format

    • Search

  • Alignment as an optimization problem

  • Dynamical programming

  • BLAST

  • Tools and examples


Lecture 2 sequence alignment ibgp 705

Aligning Text Strings

T C A T G

C A T T G

2 matches, 0 gaps

T C A T G

| |

C A T T G

4 matches, 1 inserted gaps

T C A - T G

| | | |

C A T T G

3 matches, 2 end gaps

T C A T G

| | |

C A T T G

4 matches, 1 inserted gaps

T C A T - G

| | | |

C A T T G

Optimal solution: with respect to what criteria / cost function?


Lecture 2 sequence alignment ibgp 705

Alignment as An Optimization Problem

  • Optimization criteria / cost function

  • Parameters to be adjusted

  • Search algorithm / process

    • Exhaustive testing

    • Suboptimal solutions

    • Computational cost / complexity

  • Statistical significance


Lecture 2 sequence alignment ibgp 705

Alignment as An Optimization Problem

  • Optimization criteria / cost function

    • What sort of alignment should be considered

    • Scoring system (maximize the score)

      • Additive model

      • Based on probability compared with random sequence (PAM, BLOSUM)

      • Assumption of independence

      • More complicated cases

        • Gap penalty – linear (s = -gd)

        • affine (s = -d – (g-1)e)


Lecture 2 sequence alignment ibgp 705

Alignment as An Optimization Problem

  • Optimization criteria / cost function

    • What sort of alignment should be considered

    • Scoring system

      • Additive model

      • Based on probability compared with random sequence (PAM, BLOSUM)

      • Assumption of independence

      • More complicated cases

        • Gap penalty – linear (s = -gd)

        • affine (s = -d – (g-1)e)


Lecture 2 sequence alignment ibgp 705

Alignment as An Optimization Problem

  • Parameters to be adjusted

    • Shift

    • Number of gaps

    • Position of gaps

3 matches, 2 end gaps

T C A T G

| | |

C A T T G

4 matches, 1 inserted gaps

T C A - T G

| | | |

C A T T G


Lecture 2 sequence alignment ibgp 705

Alignment as An Optimization Problem

  • Search algorithm / process

    • Exhaustive testing

      • Try all possible configuration of parameters.

      • E.g., sequence a with length m, sequence b with length n. Try all m+n shifts (if we use the O(.) annotation, then the running time is O(m+n)).

0 matches, 0 gaps

T C A T G

C A T T G

3 matches, 2 end gaps

T C A T G

| | |

C A T T G

2 matches, 0 gaps

T C A T G

| |

C A T T G


Lecture 2 sequence alignment ibgp 705

Alignment as An Optimization Problem

  • Search algorithm / process

    • Computational cost / complexity

      • What if we allow gaps?

0 matches, 0 gaps

T C A T G

C A T T G

3 matches, 2 end gaps

T C A T G

| | |

C A T T G

2 matches, 0 gaps

T C A T G

| |

C A T T G


Many possible alignments to consider

Many possible alignments to consider

  • Without gaps, there are are n+m possible alignments between sequences of length n and m

  • Once we start allowing gaps, there are many possible arrangements to consider:abcbcd abcbcd abcbcd | | | | | | | | || || abc--d a--bcd ab--cd

  • This becomes a very large number when we allow mismatches, since we then need to look at every possible pairing between elements: there are roughly nm possible alignments.


Exponential computations get big fast

Exponential computations get big fast

  • If n=m=100, there are 100100 = 10200 = 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 different alignments.

  • And 100 amino acids is a small protein!


Lecture 2 sequence alignment ibgp 705

Alignment as An Optimization Problem

  • Statistical significance

    • Not only are there many possible gapped alignments, but introducing too many gaps makes nonsense alignments possible:s--e-----qu---en--ce sometimesquipsentice

    • Need to distinguish between alignments that occur due to homology, and those that could be expected to be seen just by chance.

    • Define a score function that accounts for statistical significance (logarithmic scale – multiplication of odds becomes addition of scores).


Lecture 2 sequence alignment ibgp 705

  • Major issues in genomics

    • Homology

  • Alignment as an optimization problem

  • Dynamical programming

  • BLAST

  • Scoring matrices

  • Multiple sequence alignment


Dot matrix sequence comparison

Dot matrix sequence comparison

  • Write one sequence across top of matrix, the other across left side, then put a dot where character on line i equals one in column j

  • Examples below: DNA and amino acid sequences of the phage  cI (vertical axis) and phage P22 c2 (horizontal axis) repressors


Dynamic programming

Dynamic programming

  • The name comes from an operations research task, and has nothing to do with writing programs. Programming – use tabular structure for computing.

  • The key idea is to start aligning the sequences left to right; once a prefix is optimally aligned, nothing about the remainder of the alignment changes the alignment of the prefix.

  • We construct a matrix of possible alignment scores (nxm2 calculations worst case) and then "traceback" to find the optimal alignment.

  • Called Needleman-Wunch (for global matching) or Smith-Waterman (for local matching)


Dynamic programming1

18

3

20

5

25

B

2

A

11

Dynamic programming

  • The name comes from an operations research task, and has nothing to do with writing programs. Programming – use tabular structure for computing.

17


Dynamic programming matrix

Dynamic programming matrix

  • Each cell has the score for the best aligned sequenceprefix up to that position.

  • Example:

    ATGCT vs. ACCGCT

    Match: +2, mismatch: 0, gap: -1

Matching matrix, NOT the dynamical programming matrix!


Dynamic programming matrix1

A T _

A _C

A T

A C

A T

A C

A _T

A C_

Dynamic programming matrix

0 1 2 3 4 5 6

0

1

2

3

4

5

6

7


Optimal alignment by traceback

Optimal alignment by traceback

  • We “traceback” a path that gets us the highest score. If we don't have “end gap” penalties, then takeany path from thelast row or columnto the first.

  • Otherwise we needto include the top and bottom corners

0 1 2 3 4 5 6

0

1

2

3

4

5

6

7

AT - GCT

ACCGCT

A - TGCT

ACCGCT


Dynamic programming2

Dynamic programming

  • Global alignment – an alignment of two or more sequences that matches as many characters as possible in all of the sequences. Needleman-Wunch algorithm

  • Local alignment – an alignment that includes only the best matching, highest-scoring regions in two or more sequences. Smith-Waterman algorithm

  • Difference – all the scores are kept in the dynamical programming matrix for global alignment; only the positive scores are kept in the dynamical programming matrix for local alignment, the negative ones are converted to zero.


Lecture 2 sequence alignment ibgp 705

  • Major issues in genomics

    • Homology

  • Alignment as an optimization problem

  • Dynamical programming

  • BLAST

  • Scoring matrices

  • Multiple sequence alignment


Lecture 2 sequence alignment ibgp 705

Sequence alignment (BLAST)

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

http://www.ncbi.nlm.nih.gov/blast/


Lecture 2 sequence alignment ibgp 705

BLAST – Algorithm Intuition

The BLAST algorithm.The BLAST algorithm is a heuristic search method that seeks words of length W (default = 3 in blastp) that score at least T when aligned with the query and scored with a substitution matrix. Words in the database that score T or greater are extended in both directions in an attempt to find a locally optimal ungapped alignment or HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. HSPs that meet these criteria will be reported by BLAST, provided they do not exceed the cutoff value specified for number of descriptions and/or alignments to report.


Lecture 2 sequence alignment ibgp 705

BLAST – Algorithm Intuition

Databases are pre-indexed by the words.

Without gaps:

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J., J. Mol. Biol. (1990) 215:403-410

With gaps:

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., Lipman, D. J., Nucleic Acids Research (1997) 25(17):3389-3402

http://www.compbio.dundee.ac.uk/ftp/preprints/review93/Figure10.pdf


Lecture 2 sequence alignment ibgp 705

BLAST – Scoring Matrices

DNA scoring matrix (substitution matrix)

ATGC

A5-4-4-4

T-45-4-4

G-4-45-4

C-4-4-45

ATTTAGCCG

ACTTGGCCT

5 55 555

Score = 5X6+(-4)X3=18

http://www.ncbi.nlm.nih.gov/BLAST/matrix_info.html#matrix


Lecture 2 sequence alignment ibgp 705

BLAST – Parameters

  • Word size – for MegaBlast, can work between w=16 and 64.

  • Expected – statistical based notion, compare the matched sequence with random sequence (the likelihood). The larger the score, the smaller the expected value, the more significant the result.

  • Percent Identity, match/mismatch scores.


Lecture 2 sequence alignment ibgp 705

BLAST – Program Selection

  • Nucleotide

  • Quickly search for highly similar sequences (megablast)

  • Quickly search for divergent sequences (discontiguous megablast)

  • Nucleotide-nucleotide BLAST (blastn)

  • Search for short, nearly exact matches

  • Search trace archives with megablast or discontiguous megablast

  • Protein

  • Protein-protein BLAST (blastp)

  • Position-specific iterated and pattern-hit initiated BLAST (PSI- and PHI-BLAST)

  • Search for short, nearly exact matches

  • Search the conserved domain database (rpsblast)

  • Protein homology by domain architecture (cdart)


Lecture 2 sequence alignment ibgp 705

  • Major issues in genomics

    • Homology

  • Alignment as an optimization problem

  • Dynamical programming

  • BLAST

  • Scoring matrices

  • Multiple sequence alignment


How we pick substitution scores

How we pick substitution scores?

  • Problem: how to score each residue pair in two aligned sequences?

    • intuitively the score should reflect the likelihood that two sequences are aligned because they related as opposed to a random alignment

  • We assume the letter a occurs independently with frequency qa

    • random match scenario: the likelihood that the letters a and b occur by chance is qaqb

    • true match scenario: the aligned pair a,b occurs with joint probability pab


Substitution scores

Substitution scores

  • The likelihood that the residue pair (a, b) occurs as an aligned pair, as opposed to an unaligned pair, is pab/qaqb

  • If we use the logarithm of this ratio (log-odds ratio), the score for the entire alignment is additive


Substitution matrices

Substitution matrices

  • As we wanted, the total score is the sum of individual scores s(a, b) for each aligned pair

  • The s(a, b) can be arranged in a matrix called substitution or score matrix

    • for proteins, this is a 20 x 20 matrix

  • A number of such matrices have been proposed, each computed according to different strategies


Clusters of amino acids

Clusters of amino acids

Hydrophobic

Polar

A

T

Q

D

P

V

G

S

N

E

A

Acid

I

L

Y

F

K

M

H

C

W

R

Basic

Aromatic


Lecture 2 sequence alignment ibgp 705

Alignment as An Optimization Problem

  • Optimization criteria / cost function

    • What sort of alignment should be considered

    • Scoring system

      • Additive model

      • Based on probability compared with random sequence (PAM, BLOSUM)

      • Assumption of independence

      • More complicated cases

        • Gap penalty – linear (s = -gd)

        • affine (s = -d – (g-1)e)


Lecture 2 sequence alignment ibgp 705

Scoring Matrices

  • DNA is relatively easy to choose and protein is harder.

  • PAM (Percent Accepted Mutation) matrices: predicted matrices, most sensitive for alignments of sequences with evolutionary related homologs. The greater the number in the matrix name, the greater the expected evolutionary (mutational) distance, i.e. PAM30 would be used for alignments expected to be more closely related in evolution than an alignment performed using the PAM250 matrix.

  • BLOSUM (Blocks Substitution Matrix): calculated matrices, most sensitive for local alignment of related sequences, ideal when trying to identify an unknown nucleotide sequence. BLOSUM62 is the default matrix set be the BLAST search tool.


Lecture 2 sequence alignment ibgp 705

  • Major issues in genomics

    • Homology

  • Alignment as an optimization problem

  • Dynamical programming

  • BLAST

  • Scoring matrices

  • Multiple sequence alignment


Multiple sequence alignment

Multiple Sequence Alignment

  • Phylogenetics

  • Metagenomics

  • Proteins that are similar in sequence are often similar in structure and function

  • Sequence changes more rapidly in evolution than does structure and function.


Clustalw

ClustalW

  • A progressive method based on phylogenetic analysis

  • A phylogenetic tree is created using a pairwise distance matrix and nearest-neighbor algorithm

  • The most closely-related pairs of sequences are aligned using dynamic programming

  • Each of the alignments is analyzed and a profile of it is created

  • Alignment profiles are aligned progressively for a total alignment

  • W in ClustalW refers to a weighting of scores depending on how far a sequence is from the root on the phylogenetic tree (See p. 154 of Bioinformatics by Mount.)


Problems with progressive method

Problems with Progressive Method

  • Highly sensitive to the choice of initial pair to align. If they aren’t very similar, it throws everything off.

  • It’s not trivial to come up with a suitable scoring matrix or gap penalties.


Lecture 2 sequence alignment ibgp 705

ClustalW

http://www.ebi.ac.uk/clustalw/

http://align.genome.jp


Lecture 2 sequence alignment ibgp 705

ClustalW (Example)

>query MKNTLLKLGVCVSLLGITPFVSTISSVQAERTVEHKVIKNETGTISISQLNKNVWVHTELGYFSGEAVPS NGLVLNTSKGLVLVDSSWDDKLTKELIEMVEKKFKKRVTDVIITHAHADRIGGMKTLKERGIKAHSTALT AELAKKNGYEEPLGDLQSVTNLKFGNMKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSASSKDLGNV ADAYVNEWSTSIENVLKRYGNINLVVPGHGEVGDRGLLLHTLDLLK>gi|2984094 MGGFLFFFLLVLFSFSSEYPKHVKETLRKITDRIYGVFGVYEQVSYENRGFISNAYFYVADDGVLVVDALSTYKLGKELIESIRSVTNKPIRFLVVTHYHTDHFYGAKAFREVGAEVIAHEWAFDYISQPSSYNFFLARKKILKEHLEGTELTPPTITLTKNLNVYLQVGKEYKRFEVLHLCRAHTNGDIVVWIPDEKVLFSGDIVFDGRLPFLGSGNSRTWLVCLDEILKMKPRILLPGHGEALIGEKKIKEAVSWTRKYIKDLRETIRKLYEEGCDVECVRERINEELIKIDPSYAQVPVFFNVNPVNAYYVYFEIENEILMGE>gi|115023|sp|P10425|MKKNTLLKVGLCVSLLGTTQFVSTISSVQASQKVEQIVIKNETGTISISQLNKNVWVHTELGYFNGEAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHADRIGGITALKERGIKAHSTALTAELAKKSGYEEPLGDLQTVTNLKFGNTKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSAEAKNLGNVADAYVNEWSTSIENMLKRYRNINLVVPGHGKVGDKGLLLHTLDLLK>gi|115030|sp|P25910|MKTVFILISMLFPVAVMAQKSVKISDDISITQLSDKVYTYVSLAEIEGWGMVPSNGMIVINNHQAALLDTPINDAQTEMLVNWVTDSLHAKVTTFIPNHWHGDCIGGLGYLQRKGVQSYANQMTIDLAKEKGLPVPEHGFTDSLTVSLDGMPLQCYYLGGGHATDNIVVWLPTENILFGGCMLKDNQATSIGNISDADVTAWPKTLDKVKAKFPSARYVVPGHGDYGGTELIEHTKQIVNQYIESTSKP>gi|282554|pir||S25844 MTVEVREVAEGVYAYEQAPGGWCVSNAGIVVGGDGALVVDTLSTIPRARRLAEWVDKLAAGPGRTVVNTH FHGDHAFGNQVFAPGTRIIAHEDMRSAMVTTGLALTGLWPRVDWGEIELRPPNVTFRDRLTLHVGERQVE LICVGPAHTDHDVVVWLPEERVLFAGDVVMSGVTPFALFGSVAGTLAALDRLAELEPEVVVGGHGPVAGP EVIDANRDYLRWVQRLAADAVDRRLTPLQAARRADLGAFAGLLDAERLVANLHRAHEELLGGHVRDAMEI FAELVAYNGGQLPTCLA


  • Login