dna sequences alignment measurement n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
DNA sequences alignment measurement PowerPoint Presentation
Download Presentation
DNA sequences alignment measurement

Loading in 2 Seconds...

play fullscreen
1 / 24

DNA sequences alignment measurement - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

DNA sequences alignment measurement. Lecture 13 . Introduction. Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement of alignment gaps . Measurement of aligned sequences. When aligning sequences (DNA/AA ) it is assumed that:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'DNA sequences alignment measurement' - basil


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction
Introduction
  • Measurement of “strength” alignment
  • Nucleic acid and amino acid substitutions
  • Measurement of alignment gaps
measurement of aligned sequences
Measurement of aligned sequences
  • When aligning sequences (DNA/AA ) it is assumed that:
    • they have a common ancestor;
    • the differences between the sequences is the result of mutations
    • important areas like coding sequences (CDS) will be conserved. There is a bias “against” mutations in these areas
    • Furthermore there is a bias in the types of mutations: substitutions more likely that insertions/deletions….
  • The dot plot gives a visual representation of sequence alignment regions. But how do we measure the strength of these alignments.
measurement of aligned sequences1
Measurement of aligned sequences
  • One way is to count the mismatches: the “difference” between the sequences.
    • Hamming distance; :
      • The distance corresponds to mismatches for strings of equal length.
        • agtc
        • cgta Distance is 2 (give another example)
  • If the sequences (strings) are not of equal length then use:
    • The Levenshtein distance: is the minimum number of edit operations (alter/ insert/delete) to required to turn one string into another:
      • ag- tcc
      • cgctca what is the levensthein distance?
  • The latter technique has the advantage of allowing the inclusions of gaps
measurement of matching
Measurement of matching
  • But what about the biological plausibility of these approaches to measuring “differences” between sequences (strings):
  • DNA sequences (string mismatches) are different:
    • due to the probability of substitution; insertions, deletions is not the same.
    • Certain types of mutations like inversions; translocations; duplications …. Complicate the assessment of similarity; e.g. how would you treat tandem repeats; inverted repeats….
nucleic acid mutations
Nucleic Acid mutations
  • In sequence alignment we are trying to determine have the differences (similarity) occurred due to:
    • chance (random mutations)
    • They had a common origin (degree of conservatism)
  • One approach would be to count the percentage of matches but there is now a need to include the bias associated with possible substitutions.
  • However, similarity does not necessarily imply common ancestor or visa versa Zvelebil and Baum (2008 p. 74) suggest this can occur in convergent evolution/divergent evolution.
  • So the results need to be contextualised the findings of alignment tests. (bat and bird both have wings…)
alignment scoring methods
Alignment Scoring methods
  • In general sequences are given a score at each matching position and the one with the largest score is optimal and is chosen; however suboptimal may also need to be considered.
  • The most basic approach is obtained by measuring the percentage of similarity.
  • Given that not all “changes” occur with equal chance there is a need to develop:
    • A nucleotide substitution matrix
nucleotide scoring matrix
Nucleotide scoring Matrix
  • While it is know that certain mutations are more likely to occur than others: e.g. transitions a<->g is more common than transversions c<->t.
  • However since the probability of such difference is insignificant in relation to the chance of a mutation itself the differences are mostly ignored. The following shows a typical scoring matrix for nucleotides.

Adapted from Baxevanis p. 303

nucleic acid scoring matrix
Nucleic acid scoring Matrix
  • The values are based on the probability of a type of substitution occurring (expected value); this includes a nucleotide substituting with itself.
  • These expected values are calculated by getting the ratio of :
    • number of “observed changes” /number of changes “due to chance”
  • These values are obtained by examining large numbers of DNA sequences.
nucleic acid scoring matrix1
Nucleic acid scoring Matrix
  • Then calculate 10*log 10 (“expected value”).
  • This ensures that adjacent nucleotides expected values can now be added as opposed to being multiplied in determining the alignment score.
nucleic acid scoring matrix2
Nucleic acid scoring Matrix
  • A expected value greater of 1 indicates the substitution has the same change of occurrence as it is was occurring randomly.
  • A value greater than 1 indicates a bias in favour or the substitution
  • A values less than 1 indicates a bias against the substitution.
  • A value of 5 will give what expected value?
measuring protein similarity
Measuring Protein similarity
  • Deriving a matrix for proteins is more complex because:
  • There are 20 amino acids so much higher set of substitutions.
  • The amino acids have properties that affect the structure and so the protein functionality.
  • Therefore substitutions can be conserved or semi-conserved
  • Observations shows that conserved substitutions
      • e.g. Hydrophobic <-> hydrophobic mutations are more common
      • semi conserved; e.g. hydrophilic <-> hydrophobic
the pam matrix
The PAM matrix
  • The PAM, developed in the 1970’s by M. Dayhoff, was derived from all types of sequences both in conserved and un-conserved regions of the protein.
  • The PAM (point accepted mutation) Matrix is a measure of the chances of one amino acid being substitute by another.
  • The PAM “1” matrix was “derived” from very similar sequences (close in terms of evolution ) and represented one point “accepted” mutation per 100 residues.
  • It assumed that a substitution was independent of previous mutations at this position.
  • This meant it could be used to derive matrices over longer “evolutionary distances” or more mutations per 100 residues.
the pam matrix1
The PAM MATRIX
  • The number of times the matrix is “multiplied” by itself represents the number of mutations per 100 residues
  • So a PAM 100 represents 100 mutations per 100 residues: this of course does mean that the 100 residues changed.
  • The most common the PAM 250 represents 250 per 100 residues.
  • The original PAM 1 was derived in a similar way to the nucleotide substitution matrix:
    • 10 log10(expected value)
    • (the number of observed changes were divided by number of changes based on AA frequency ( as if changes occurred by chance) : expected value.)
dayhoff pam 250 matrix
Dayhoff PAM (250) Matrix
  • A H<->N has a value of 2 or an expected value 1.6 (occurred 1.6 times more often that would be expected.
  • A same amino acid substitution can range from C <-> C has a high score 12; while V <-> V (4)
  • A hydrophobic <-> Hydrophobic V<->A (0); while V<-> I is (4)
  • A hydrophilic <-> hydrophilic K <-> T (0); K<-> R (3)
  • A hydrophilic <-> hydrophobic: K <-> V (-2)
blosum matrix
Blosummatrix
  • Another matrix the Blosum (Blocks Substitution ) Matrix used a larger data set than the PAM
  • This allowed the Bolsumto look at mutations within “blocks” of conserved sequences as opposed to point mutations on individual sequences in both conserved and variable.
  • The bolsumnumber ; e.g. 62 matrix, unlike the PAM number, e.g.120, indicates that the substitution probabilities are derived from sequences with that much similarity or conservation.
  • So the Blosum 62 is obtained from blocks sharing 62% conservation .
  • In addition these results are obtained directly as opposing to being derived as in the case of the PAM matrix.
the blosum and pam matrices
The Blosum and PAM matrices

The two matrices opposite are the Blosum 62 and PAM 120

The following shows scores from the Blosum 62 matrix

Same resisdues:

C<->C (9)

V<->V (4)

Hydrophobic to hydrophobic

V<->A (O)

V<-> I (3)

Hydrophilic to Hydrophilic

K <-> T (-1)

K<-> R (2)

Hydrophobic to hydrophilic

K<-> V (-2)

The trend is the same.

Adapted from Zvelebil e baum p 82

pam and bolsum matrices
PAM and Bolsum Matrices
  • In the PAM matrix the as the number increases so does evolutionary distance while it is the reverse it the Bolsum Matrix.
  • According to Baxevanis (2003) the following represents the equivalence and most appropriate use of both matrices
    • PAM250 and the Bolsum 45
    • PAM160 and the Bolsum 62
pam and bolsum matrix
PAM and Bolsum Matrix

Adapted from Baxevanis 2005

An excellent review of scoring matrices can be found at : Henikoff and Henikoff 2000

dot plot matrix imperfect match
Dot plot Matrix: imperfect match
  • Some alignments require gaps to increase the matching score; the gaps are used represent inclusion/deletion mutations
  • The diagram shows that most of the 2 sequences are aligned. Where there are gaps indicates areas of non-alignment or mismatches: gaps or substitutions

Adapted from: dotplot example

measurement of alignment gaps
Measurement of alignment gaps
  • Gaps represents insertions and deletions
  • Baxevanis (2005) suggest that no more than “one gap in 20 pairs is a good rule of thumb”.
  • Gaps in alignments are penalised; given a negative scoring value.
  • The penalty associated with the using gaps is dependent on
    • Opening the gap (introducing an insertion or deletion)
    • Extending the gap (as opposed to opening a new gap)
    • The length of the gap (the number of deletions/insertions).
gap penalties
Gap penalties
  • There is no overall agreement on what values should be assigned to gap penalties (Zvelebil e Baum 2008).
  • The purpose of an insertion is to increase the strength of the alignment.
  • So choosing a high score will eliminate sequences with gaps while of the score is too low then alignments with more and larger gaps will be chosen.
  • The value should also be dependent on how closely “related” the alignments must be :
    • So sequences with a very strict match would use a high gap score.
    • Alignment between distantly related species would use a low gap score.
potential exam questions
Potential Exam Questions
  • What is the purpose of measuring the strength of an alignment (3 marks)
  • Explain two differences between analysing a string (sequence) and a DNA string. (4 marks)
  • Describe how you would measure the similarity between two DNA sequences (10 marks)
  • Discuss the use of gap penalties in a sequence alignment score (13 marks)
references
References
  • Baxevanis A.D. 2005 Bioinformatics: a practical guide to the analysis of genes and proteins chapter 11; Wiley
  • Lesk, A. 2008; Introduction to bioinformatics, 3rd edition, oxford university press
  • Zvelebil e Baum (2008) Understanding Bioinformatics