Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
An alignment is an evolutionarily meaningful comparison of two or more sequences (DNA, RNA, or proteins).In the case of two DNA sequences, an alignment consists of a series of paired bases, one base from each sequence. There are three types of pairs:(1) matches = the same nucleotide appears in both sequences. (2) mismatches = different nucleotides are found in the two sequences. (3) gaps = a base in one sequence and a null base in the other. GCGGCCCATCAGGTAGTTGGTG-G GCGTTCCATC--CTGGTTGGTGTG ***..***** .*.******* *
Alignment: A hypothesis concerning positionalhomology among residues in a sequence. Positional homology = A pair of nucleotides from two aligned sequences that have descended from one nucleotide in the ancestor of the two sequences. GCGGCCCATCAGGTAGTTGGTG-G GCGTTCCATC--CTGGTTGGTGTG ***..***** .*.******* *
Positional homology = A pair of nucleotides from two aligned sequences that have descended from one nucleotide in the ancestor of the two sequences. GCGGCCCATCAGGTAGTTGGTG-G GCGTTCCATC--CTGGTTGGTGTG ***..***** .*.******* * These two nucleotides are derived from the ancestor of cats and armadillos.
Homology:The term was coined by Richard Owen in 1843. Definition: Similarity resulting from common ancestry.
Homology: A qualitative statment • Homology designates a relationship of common descent between entities • Two genes are either homologs or not • it doesn’t make sense to say “two genes are 43% homologous.” • it doesn’t make sense to say “Linda is 43% pregnant.”
Homology By comparing homologous characters, we can reconstruct the evolutionary events that have led to the formation of the extant sequences from the common ancestor.
Homology When dealing with sequences, we are interested in POSITIONAL HOMOLOGY. We identify positional homology by ALIGNMENT.
ACTGGGCCCAAATC A ACTGGGCCCAAATC ACTGGGCCCAAATC G A 1 insertion 1 substitution 1 deletion 1 substitution ACTGGCCCAGATC ACAGGGCCACAAATC Correct alignment Incorrect alignment ACT-GGCC-CAGATC ACAGGGCCACAAATC **.-****-**.*** ACTGGCCCAGATC-- ACAGGGCCACAAATC **.**.***.*..--
unknown unknown unknown ACTGGCCCAGATC ACAGGGCCACAAATC Correct alignment? Incorrect alignment? ACT-GGCC-CAGATC ACAGGGCCACAAATC **.-****-**.*** ACTGGCCCAGATC-- ACAGGGCCACAAATC **.**.***.*..--
Sequence alignment = The identification of the location of deletion or insertions that might have occurred in either of the two lineages since their divergence from a common ancestor. Insertion+Deletion=IndelorGap
- Two DNA sequences: A and B.- Lengths are m and n, respectively. - The number of matched pairs is x. - The number of mismatched pairs is y. - Total number of bases in gaps is z.
An gap indicates that a deletion or an insertion has occurred in one of the two lineages. GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG
The alignment is the first step in many evolutionary and functional studies. Errors in alignment tend to amplify in later computational stages.
Methods of alignment:1. Manual2. Dot matrix3. Algorithmic (scoring matrices and gap penalties)
Manual alignment.When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection. GCG-TCCATCAGGTAGTTGGTGTG GCGTTCCATCAGGTGGTTGGTGTG *** **********.*********
Advantages of manual alignment: (1) use of a powerful and trainable tool (the brain, well…, some brains).(2) ability to integrate additional data, e.g., domain structure, biological function (e.g., 3D structure).
Disadvantages of manual alignment: 1. Subjectivity = the inability to formally specify the algorithm.2. Irreproducibility = the inability of two researchers to reach the same result. 3. Unscalability = the inability to apply the method to long sequences. 4. Incommensurability = the inability to compare the results to those derived from other methods.
The dot-matrix method: The two sequences are written out as column and row headings of a two-dimensional matrix. A dot is put in the dot-matrix plot at a position where the nucleotides in the two sequences are identical.
The alignment is defined by a path from the upper-left element to the lower-right element.
There are 4 possible steps in the path: (1) a diagonal step through a dot = match. (2) a diagonal step through an empty element of the matrix = mismatch. (3) a horizontal step = a gap in the sequence on the top of the matrix. (4) a vertical step = a gap in the sequence on the left of the matrix.
A dot matrix may become cluttered. With DNA sequences, ~25% of the elements will be occupied by dots by chance alone.
window size =1 stringency = 1 alphabet size = 4 The number of spurious matches is determined by: window size, stringency, & alphabet size.
window size =1 stringency = 1 alphabet size = 4 window size = 3 stringency = 2 alphabet size = 4
window size = 1 stringency = 1 alphabet size = 20
Dot-matrix methods:Advantages: May unravel information on the evolution of sequences.
The vertical gap indicates that a coding region corresponding to ~75 amino acids has either been deleted from the human gene or inserted into the bacterial gene. Window size = 60 amino acids; Stringency = 24 matches Advantages: Highlighting Information
Window size = 60 amino acids; Stringency = 24 matches Advantages: Highlighting Information The two diagonally oriented parallel lines most probably indicate that a small internal duplication has occurred in the bacterial gene.
The true alignment between two sequences is the one that reflects accurately the evolutionary relationships between the sequences. Since the true alignment is unknown, in practice we look for the optimal alignment, which is the one in which the numbers of mismatches and gaps are minimized according to certain criteria.
Unfortunately, reducingthenumberofmismatchesresultsinanincreaseinthenumberof gaps, andviceversa.
a = matches b = mismatches g = nucleotides in gaps d = gaps
The scoring scheme comprises a gap penalty and a scoring matrix, M(a,b), that specifies the score for each type of match (a = b) or mismatch (ab). The units in a scoring matrix may be the nucleotides in the DNA or RNA sequences, the codons in protein-coding regions, or the amino acids in protein sequences.
If you want to know the secrets behind the black box of sequence alignment, you will have to take a class in BIOINFORMATICS.
Multiple Sequence Alignment is infinitely more complicated than pairwise alignment
Multiple Sequence Alignment does not have an exact optimal solution. It is solved heuristically.
A Multiple Sequence Alignment GCGGCTCA TCAGGTAGTT GGTG-G Spinach GCGGCCCA TCAGGTAGTT GGTG-G Rice GCGTTCCA TC--CT-GTT GGTGTG Mosquito GCGTCCCA TCAGCTAGTT GTTG-G Monkey GCGGCGCA TTAGCTAGTT GGTG-A Human ***...** *.--.*-*** *.**-.