Amino acid scoring matrices
Download
1 / 16

Amino Acid Scoring Matrices - PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on

Amino Acid Scoring Matrices. Jason Davis. Overview. Protein synthesis/evolution Computational sequence alignment Smith-Waterman Algorithm BLAST Amino Acid Scoring Matrices PAM – Point Accepted Mutations BLOSUM – BLOck SUbstitution Matrix mPAM Metric Conversions. Proteins.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Amino Acid Scoring Matrices' - hunter-galloway


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Overview
Overview

  • Protein synthesis/evolution

  • Computational sequence alignment

    • Smith-Waterman Algorithm

    • BLAST

  • Amino Acid Scoring Matrices

    • PAM – Point Accepted Mutations

    • BLOSUM – BLOck SUbstitution Matrix

    • mPAM

  • Metric Conversions


Proteins
Proteins

  • 3-dimensional stuctures

  • Composed of amino acids chained together

    • Can be represented as a 2-dimensional sequence

    • 20 different amino acids exist

    • Usually 100-1500 amino acids long

  • Have many different shapes and functions

    • Function depends on both 3d shape and aa sequence


Protein synthesis
Protein Synthesis

  • DNA: strand composed of 4 different base pairs

    • A, T, C, G

  • 20 amino acids: 3 base pairs needed to encode each amino acid

    • Degenerate coding

Protein

Signalling

Transcription/Translation


Protein evolution
Protein Evolution

  • Protein ‘families’

    • Set of homologous proteins

    • Same function, different composition

    • Similar structure

  • Identifying families

    • Pairwise sequence alignment

    • Multiple sequence alignment

      • NP-hard

    • Other approaches

      • Structural, experimental


Pairwise sequence alignment
Pairwise Sequence Alignment

  • Input

    • 2 sequences p, q of lengths m,n

    • 20x20 Amino Acid Substitution Matrix

    • Insertion (gap) cost

  • Global Alignment

    • Find optimal set of insertions such that the resulting alignment (length < m+n) is optimal w.r.t. amino acid substitution matrix

    • Difficult, less useful

  • Local Alignment

    • Find significant ‘hotspot’ in the alignment


Sequence alignment algorithms
Sequence Alignment Algorithms

  • Dynamic Programming Approaches

    • Global and Local variations

    • Provably Optimal

    • O(nm) space and time

      • ‘banded’ heuristics can reduce the state space

    • FSA extensions allow varying penalties for gap openings and gap extensions

  • Heuristics Approaches

    • Blast, Fasta

    • Sublinear time – look for statistical significance in small local alignments between sequences


Substitution matrices pam
Substitution Matrices - PAM

  • Dayhoff, Schwartz, Orcutt (1978)

  • Step 1: extrapolate mutation probabilites from 1 step in evolutionary time

    • Pick a set of protein families (71)

    • Restrict proteins in each family to sequences with similarity above a certain threshold (>85%)

    • Build a phylogenetic tree for each family

    • Extrapolate frequencies Aab that amino acids a, b evolved from same amino acid

      • Aab and Aba assumed to be the same

    • Convert frequencies to probabilities

      • p(a|b) = Bab = Aab/∑cAac


Substitution matrices pam 2
Substitution Matrices – PAM (2)

  • Step 2 – Infer greater evolutionary times

    • Dayhoff defined a PAM1 matrix to have 1% expected substitutions

      • For each row, scale off-diagonalsand adjust diagonals to keep the matrix row stochastic

    • To infer larger evolutionary times, we can view formed matrix C as a 20-state Markov Chain

      • Cn is the result of performing n-steps in the Markov Process


Substitution matrices pam 3
Substitution Matrices – PAM (3)

  • Create odds ratio of

    • 1) the event that 2 amino acids i,j, evolved from the same ancestor, x

      • fi = observed frequency of amino acid i

      • p(i,j have same ancestor) = ∑xfx Pr{x→i} Pr{x→j} = ∑xfx (CN)ix (CN)jx= ∑x (CN)ix fx (CN)jx= ∑x (CN)ix fj (CN)xj= fj (C2N)ij

    • 2) the event that the 2 amino acids align at random

      • p(independent alignment of i,j) = fi * fj

    • Final log odds ratio:

      • Dij = average[log((CN)ij / fi), log(CN)ji / fj))

      • The log allows for an additive model

      • Final numbers are rounded to nearest integer


Pam250
PAM250

  • Different values on the diagonal correspond do mutability potential


Blosum
BLOSUM

  • Henikoff & Henikoff, 1992

  • Uses aligned, ungapped blocks within protein families that have similarity greater than some level L%

  • qa = ∑bAab / ∑c,d Acd

  • pab = Aab / ∑c,d Acd

  • S(a,b) = log(pab / qaqb)

  • Final entries are rounded

  • Blosum62 (L=62), Blosum50 (L=50)

  • More direct approach, usually yields better results


Log odds similarity matrix properties
Log-Odds Similarity Matrix Properties

  • Negative numbers needed for Smith-Waterman local alignment algorithm

  • Nice probabilistic interpretation

    • Amino acid substitutions assumed independent

  • Attempts to metricize these matrices

    • Taylor, Jones 93: used various algebraic manipulations to arrive at a metric matrix with minimal disortion

    • Dij = a – Sij

      • Larger values of a yielded better metrics at the cost of high dimensionality

      • Constant Shift Embedding

  • Linial, et. al. constructed a near metric over aligned segments of length 50

    • D(u,v) = S(u,u) + S(v,v) – 2*S(u,v)

    • 10-7error rate


mPAM

  • Metric substitution model

  • Measures the expected time per 250 mutations among 100 amino acids

    • Same rate as PAM250

    • Exponential distribution assumed: f(t) = 1 – e-λt

    • Given pairwise substitution rates p(a,b)

      • Solve for λ: f(1) = 1-e- λ = p(a,b)

      • Expected time t of an event occuring in an exponential distribution is 1/ λ

      • mPAM(a,b) = round(1/ λ)

    • Two values needed to be adjusted to form a metric

      • Rounding error?


Mpam 2
mPAM (2)

  • Seller’s Theorem:

    • If a pairwise alignment is found using a metric, resulting alignment scores are also metrics

  • Optimized for BLAST-like lookup

    • Smaller alignments

  • Difficult to compare with other similarity matrices

    • Dynamic programming algorithms rely on negative values in the similarity matrix

      • Probabilistic interpretation: larger positive alignments are statistically significant


Mpam disadvantages
mPAM Disadvantages

  • d(x,x) = 0

    • This does not capture the relative mutability among different amino acids

      • PAM/BLOSUM capture this with different positive values along the diagonal

  • Do amino acids substitute according to an exponential distribution?

  • Amino Acid Substitution may be inherently non-metric

  • Comparison to BLOSUM?


ad