1 / 64

Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment

Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment. Leming Zhou School of Health and Rehabilitation Sciences Department of Health Information Management. Outline. Pairwise sequence alignment Multiple sequence alignment Phylogenetic tree. Similarity Search.

vida
Download Presentation

Genomics and Personalized Care in Health Systems Lecture 3 Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomics and Personalized Care in Health SystemsLecture 3 Sequence Alignment Leming Zhou School of Health and Rehabilitation Sciences Department of Health Information Management

  2. Outline • Pairwise sequence alignment • Multiple sequence alignment • Phylogenetic tree

  3. Similarity Search • Find statistically significant matches to a protein or DNA sequence of interest. • Obtain information on inferred function of the gene • Sequence identity/similarity is a quantitative measurement of the number of nucleotides / amino acids which are identical /similar in two aligned sequences • Calculated from a sequence alignment • Can be expressed as a percentage • In proteins, some residues are chemically similar but not identical

  4. Sequence Alignment • A linear, one-to-one correspondence between some of the symbols in one sequence with some of the symbols in another sequence • Four possible outcomes in aligning two sequences • Identity; mismatch; gap in one sequence; gap in the other sequence • May be DNA or protein sequences.

  5. Evolutionary Basis of Alignment The simplest molecular mechanisms of evolution are substitution, insertion, and deletion If a sequence alignment represents the evolutionary relationship of two sequences, residues that are aligned but do not match represent substitutions Residues that are aligned with a gap in the sequence represent insertions or deletions

  6. Alignment Algorithms • Sequences often contain highly conserved regions • These regions can be used for an initial alignment

  7. Alignments • Two sequences Seq 1: ACGGACT Seq 2: ATCGGATCT • There may be multiple ways of creating the alignment. Which alignment is the best? A – C – G G – A C T | | | | | A T C G G A T - C T A T C G G A T C T | | | | | | A – C G G – A C T

  8. Optimal vs. Correct Alignment • For a given group of sequences, there is no single “correct” alignment, only an alignment that is “optimal” according to some set of calculations • This is partly due to: • the complexity of the problem, • limitations of the scoring systems used, • our limited understanding of life and evolution • Success of the alignment will depend on the similarity of the sequences. If sequence variation is great it will be very difficult to find an optimal alignment

  9. Optimal Alignment Every alignment has a score Chose alignment with highest score Must choose appropriate scoring function Scoring function based on evolutionary model with insertions, deletions, and substitutions Use substitution score matrix – contains an entry for every amino acid pair

  10. Gaps • Positions at which a letter is paired with a null are called gaps. Gap scores are typically negative. • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. • Biologically, it should in general be easier for a sequence to accept a different residue in a position, rather than having parts of the sequence deleted or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions)

  11. Gaps in Sequence Alignment • Gap can occur • Before the first character of a string • Inside a string • After the last character of a string CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA-

  12. Gap penalties • There is no suitable theory for gap penalties. • The simplest gap penalty is a constant penalty for each gap • The most common type of gap penalty is the affine gappenalty: g = a + bx • a is the gap opening penalty • b is the gap extension penalty • x is the number of gapped-out residues. • More likely contiguous block of residues inserted or deleted • Scoring scheme should penalize new gaps more • Typical values, e.g. a = 10 and b = 1 for BLAST.

  13. Pairwise Sequence Alignment

  14. Pairwise Alignment • The process of lining up two sequences to achieve maximal levels of identity or conservation for the purpose of assessing the degree of similarity and the possibility of homology • It is used to • Decide if two genes are related structurally or functionally • Find the similarities between two sequences with same evolutionary background • Identify domains or motifs that are shared between proteins • Analyze genomes • Identify genes, search large databases, determine overlaps of sequences (DNA assembly)

  15. DNA and Protein Sequences • DNA alphabet: {A, C, G, T}+ • Four discrete possibilities – it’s either a match or a mismatch • Protein alphabet: {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}+ • 20 possibilities which fall into several categories • Residues can be similar without being identical • In some cases, protein sequence is more informative • Codons are degenerate: changes in the third position often do not alter the amino acid that is specified • In some cases, DNA alignments are appropriate • To confirm the identity of a cDNA; to study noncoding regions of DNA; to study DNA polymorphisms, …

  16. Translating a DNA Sequence into Proteins • DNA sequences can be translated into protein, and then used in pairwise alignments • One DNA sequence can be translated into six potential proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

  17. DNA Alignment Score CGAAGACTTGAGCTGAT || |||| ||| |||| CGCAGACATGA-CTGAC Match Gap Mismatch

  18. Alignment Scoring Scheme • Possible scoring scheme: • match: +5 • mismatch: -3 • indel: –4 • Example: G A A T T C A G T T A | | | | | | G G A – T C – G - — A + - + - + + - + - - + 5 3 5 4 5 5 4 5 4 4 5 S = 5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 – 4 + 5 = 11

  19. Amino Acid Sequence Alignment • No exact match/mismatch scores • Match state score calculated by table lookup • Lookup table is substitution matrix (or scoring matrix)

  20. Substitution Matrix • A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. • Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. • Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. • The two major types of substitution matrices are Point-Accepted Mutations (PAM) and BLOcks Substituion Matrix (BLOSUM).

  21. Sequence Alignment Algorithms • Dynamic Programming: • Needleman-Wunsch Global Alignment (1970) • Smith-Waterman Local Alignment (1981) • Guaranteed to find the best scoring • Slow, especially used to compare with a large database • Heuristics • FASTA, BLAST : heuristic approximations to Smith-waterman • Fast and results comparable to the Smith-Waterman algorithm

  22. Dynamic Programming • Solve optimization problems by dividing the problem into independent subproblems • Sequence alignment has optimal substructure property • Subproblem: alignment of prefixes of two sequences • Each subproblem is computed once and stored in a matrix • Optimal score: built upon optimal alignment computed to that point • Aligns two sequences beginning at ends, attempting to align all possible pairs of characters • Alignment contains matches, mismatches and gaps • Scoring scheme for matches, mismatches, gaps • Highest set of scores defines optimal alignment between sequences

  23. The Big O Notation • Computational complexity of an algorithm is how its execution time increases as the problem is made larger (e.g. more sequences to align) • The big-O notation • If we have a problem size n, then an algorithm takes O(n) time if the time increases linearly with n. If the algorithm needs time proportional to the square of n, then it is O(n2) • More example, here c is a constant: • O(c) utopian • O(log n) excellent • O(n) very good • O(n2) not so good • O(n3) pretty bad • O(cn) disaster

  24. Drawbacks to DP Approaches • Compute intensive • Memory intensive • Complexity of DP Algorithm • Time O(nm); space O(nm) • where n, m are the lengths of the two sequences. • Space complexity can be reduced to O(n) by not storing the entries of dynamic programming table that are no longer needed for the computation (keep current row and the previous row only) • A fast heuristic (BLAST) will be discussed next week

  25. Two Sequences >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA TGCTTCTGACATAGTTGTGTTGACTCACAAACTCAGAAACAGACACCATGGTGCACCTGACTGATGCTGA GAAGGCTGCTGTTAATGGCCTGTGGGGAAAGGTGAACCCTGATGATGTTGGTGGCGAGGCCCTGGGCAGG CTGCTGGTTGTCTACCCTTGGACCCAGAGGTACTTTGATAGCTTTGGGGACCTGTCCTCTGCCTCTGCTA TCATGGGTAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGATAAACGCCTTCAATGATGGCCTGAAACA CTTGGACAACCTCAAGGGCACCTTTGCTCATCTGAGTGAACTCCACTGTGACAAGCTGCATGTGGATCCT GAGAACTTCAGGCTCCTGGGCAATATGATTGTGATTGTGTTGGGCCACCACCTGGGCAAGGAATTCACCC CCTGTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTA AACCTCTTTTCCTGCTCTTGTCTTTGTGCAATGGTCAATTGTTCCCAAGAGAGCATCTGTCAGTTGTTGT CAAAATGACAAAGACCTTTGAAAATCTGTCCTACTAATAAAAGGCATTTACTTTCACTGC

  26. Pairwise Sequence Alignment • FASTA: http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi • DNA vs. DNA comparison • Default parameters: • Match: +5 • Mismatch: -4 • Gap open penalty: -12 • Gap extension penalty: -4 • BLAST search will be covered next week

  27. Multiple Sequence Alignment

  28. Multiple Sequence Alignment • Multiple sequence alignment (MSA) is a generalization of Pairwise Sequence Alignment: instead of aligning two sequences, n (>2) sequences are aligned simultaneously • A multiple sequence alignment is obtained by inserting gaps (“-”) into sequences such that the resulting sequences have all length L and can be arranged in a matrix of n rows and L columns where each column represents a homologous position • MSA applies both to DNA and protein sequences

  29. Why Do We Need MSA? • MSA can help to develop a sequence “finger print” which allows the identification of members of distantly related protein family (motifs) • Formulate & test hypotheses about protein 3-D structure • MSA can help us to reveal biological facts about proteins, e.g.: how protein function has changed or evolutionary pressure acting on a gene • Crucial for genome sequencing: • Random fragments of a large molecule are sequenced and those that overlap are found by a multiple sequence alignment program. • To establish homology for phylogenetic analyses • Identify homologous sequences in other organisms

  30. Multiple Sequence Alignment • Difficulty: introduction of multiple sequences increases combination of matches, mismatches, gaps • In pairwise alignments, one has a 2D matrix with the sequences on each axis. The number of operations required to locate the best “path” through the matrix is approximately proportional to the product of the lengths of the two sequences • A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a DP algorithm in N dimensions. Algorithmically, this is not difficult to do

  31. Example fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA

  32. MSA • How do we generate a multiple alignment? Given a pairwise alignment, just add the third, then the fourth, and so on, until all have been aligned. Does it work? • It is not self-evident how these sequences are to be aligned together. • It depends not only on the various alignment parameters but also on the order in which sequences are added to the multiple alignment

  33. Dynamic Programming for MSA • Dynamic programming with two sequences • Relatively easy to code • Guaranteed to obtain optimal alignment • An extension of the pairwise sequence alignment • Alignment of K sequences • K(K-1)/2 possible sequence comparisons • Alignment algorithms operate in a similar manner as pairwise alignment but now the distance matrix is K dimensional and the weight function compares K letters

  34. Time Complexity of Optimal MSA • Space complexity (hyperlattice size): O(nk) for k sequences each n long. • Computing a hyperlattice node: O(2k). • Time complexity: O(2knk). • Find the optimal solution is exponential in k (non-polynomial, NP-hard).

  35. Heuristics for Optimal MSA • Reduction of space and time • Heuristic alignment – not guaranteed to be optimal • Alignment provides a limit to the volume within which optimal alignments are likely to be found • Heuristics: • Progressive alignments (ClustalW)

  36. Progressive Alignment • Works by progressive alignment: it aligns a pair of sequences then aligns the next one onto the first pair • Most closely related sequences are aligned first, and then additional sequences and groups of sequences are added, guided by the initial alignments • Uses alignment scores to produce a guide tree • Aligns the sequences sequentially, guided by the relationships indicated by the tree • If the order is wrong and merge distantly related sequences too soon , errors in the alignment may occur and propagate • Gap penalties can be adjusted based on specific sequence

  37. CLUSTALW • http://www.ebi.ac.uk/clustalw/ • Perform pairwise alignments of all sequences • Use alignment scores to produce a guide tree • Align sequences sequentially, guided by the tree • Enhanced Dynamic Programming used to align sequences • Genetic distance determined by number of mismatches divided by number of matches • Gaps are added to an existing profile in progressive methods • CLUSTALW incorporates a statistical model in order to place gaps where they are most likely to occur

  38. ClustalW MSA Procedure All Pairwise Alignments Dendrogram Similarity Matrix Cluster Analysis From Higgins(1991) and Thompson(1994).

  39. Three Protein Sequences >sp|P25454|RAD51_YEAST DNA repair protein RAD51 OS=Saccharomyces cerevisiae GN=RAD51 PE=1 SV=1 MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNGSGDGGGLQEQAEAQGEMEDEAYDEAALGSFVPIEKLQVNGITMADVKKLRESGLHTAEAVAYAPRKDLLEIKGISEAKADKLLNEAARLVPMGFVTAADFHMRRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLCHTLAVTCQIPLDIGGGEGKCLYIDTEGTFRPVRLVSIAQRFGLDPDDALNNVAYARAYNADHQLRLLDAAAQMMSESRFSLIVVDSVMALYRTDFSGRGELSARQMHLAKFMRALQRLADQFGVAVVVTNQVVAQVDGGMAFNPDPKKPIGGNIMAHSSTTRLGFKKGKGCQRLCKVVDSPCLPEAECVFAIYEDGVGDPREEDE >sp|P25453|DMC1_YEAST Meiotic recombination protein DMC1 OS=Saccharomyces cerevisiae GN=DMC1 PE=1 SV=1 MSVTGTEIDSDTAKNILSVDELQNYGINASDLQKLKSGGIYTVNTVLSTTRRHLCKIKGLSEVKVEKIKEAAGKIIQVGFIPATVQLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMSHTLCVTTQLPREMGGGEGKVAYIDTEGTFRPERIKQIAEGYELDPESCLANVSYARALNSEHQMELVEQLGEELSSGDYRLIVVDSIMANFRVDYCGRGELSERQQKLNQHLFKLNRLAEEFNVAVFLTNQVQSDPGASALFASADGRKPIGGHVLAHASATRILLRKGRGDERVAKLQDSPDMPEKECVYVIGEKGITDSSD >sp|P48295|RECA_STRVL Protein recA OS=Streptomyces violaceus GN=recA PE=3 SV=1 MAGTDREKALDAALAQIERQFGKGAVMRMGDRTQEPIEVISTGSTALDIALGVGGLPRGRVVEIYGPESSGKTTLTLHAVANAQKAGGQVAFVDAEHALDPEYAKKLGVDIDNLILSQPDNGEQALEIVDMLVRSGALDLIVIDSVAALVPRAEIEGEMGDSHVGLQARLMSQALRKITSALNQSKTTAIFINQLREKIGVMFGSPETTTGGRALKFYASVRLDIRRIETLKDGTDAVGNRTRVKVVKNKVAPPFKQAEFDILYGQGISREGGLIDMGVEHGFVRKAGAWYTYEGDQLGQGKENARNFLKDNPDLADEIERKIKEKLGVGVRPDAAKAEAATDAAAADTAGTDDAAKSVPAPASKTAKATKATAVKS

  40. An Alignment from ClustalW sp|P25454|RAD51_YEAST MSQVQEQHISESQLQYGNGSLMSTVPADLSQSVVDGNGNGSSEDIEATNG 50 sp|P25453|DMC1_YEAST ---------------------MSVTGTEIDSDTAKN-------------- 15 sp|P48295|RECA_STRVL ------------MAGTDREKALDAALAQIERQFGKG-------------- 24 :... :::. . .. sp|P25454|RAD51_YEAST SGDGGGLQEQAEAQGEMEDEAYDEAALGSFVPIEKLQVNGITMADVKKLR 100 sp|P25453|DMC1_YEAST -----------------------------ILSVDELQNYGINASDLQKLK 36 sp|P48295|RECA_STRVL -------------------------------AVMRMGDRTQEPIEVISTG 43 .: .: :: . sp|P25454|RAD51_YEAST ESGLHTAEAVAYAPRKDLLEIKG-ISEAKADKLLNEAARLVPMG----FV 145 sp|P25453|DMC1_YEAST SGGIYTVNTVLSTTRRHLCKIKG-LSEVKVEKIKEAAGKIIQVG----FI 81 sp|P48295|RECA_STRVL STALDIALGVGGLPRGRVVEIYGPESSGKTTLTLHAVANAQKAGGQVAFV 93 . .: . * .* : :* * *. *. . ... * *: sp|P25454|RAD51_YEAST TAADFHMRRSELICLTTGSKNLDTLLGGGVETGSITELFGEFRTGKSQLC 195 sp|P25453|DMC1_YEAST PATVQLDIRQRVYSLSTGSKQLDSILGGGIMTMSITEVFGEFRCGKTQMS 131 sp|P48295|RECA_STRVL DAEHALDPEYAKKLGVDIDNLILSQPDNGEQALEIVDML--VRSGALDLI 141 * . .: : : ..* : .*.::: .* * ::

  41. Phylogenetic Analysis

  42. Evolution • At the molecular level, evolution is a process of mutation with selection. • Molecular evolution is the study of changes in genes and proteins throughout different branches of the tree of life. • Phylogeny is the inference of evolutionary relationships. • Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are also used for phylogenetic analyses. Page 358

  43. Phylogenetic Trees • Phylogenetic trees are trees that describe the “relations” among species (genes, sequences) • Evolutionary relationships are shown as branches • Sequences most closely related drawn as neighboring branches • Length and nesting reflects degree of similarity between any two items (sequences, species, etc.) • Objective of Phylogenetic Analysis: determine branch length and figure out how the tree should be drawn • Dependent upon good multiple sequence alignment programs • Group sequences with similar patterns of substitutions

  44. Uses of Phylogenetic Analysis • Phylogeny can answer questions such as: • How many genes are related to the gene I am working on? • Are humans really closest to chimps and gorillas? • How related are chicken, dog, mouse to zebrafish? • Where and when did HIV originate? • What is the history of life on earth? • Given a set of genes, determine genes likely to have equivalent functions • Follow changes occurring in a rapidly changing species • Example: influenza • Study rapidly changing genes in influenza genome, predict next year’s strain and develop flu vaccination accordingly

  45. Difficulties With Phylogenetic Analysis • Horizontal or lateral transfer of genetic material (for instance through viruses) makes it difficult to determine phylogenetic origin of some evolutionary events • Genes selective pressure can be rapidly evolving, masking earlier changes that had occurred phylogenetically • Two sites within comparative sequences may be evolving at different rates • Rearrangements of genetic material can lead to false conclusions • Duplicated genes can evolve along separate pathways, leading to different functions

  46. Rooted Trees • One sequence (root) defined to be common ancestor of all other sequences • Root chosen as a sequence thought to have branched off earliest • A rooted tree specifies evolutionary path for each sequence • A tree can be rooted using an outgroup (that is, a sequence known to be distantly related from all other sequences). past 9 7 8 6 2 3 present 4 5 1 http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

  47. Unrooted Tree • Indicates evolutionary relationship without revealing location of oldest ancestry 1 5 7 8 2 6 4 3

  48. http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

  49. 4 Steps of Phylogenetic Analysis • Molecular phylogenetic analysis may be described in four steps: • Selection of sequences for analysis • Multiple sequence alignment • Tree building • Tree evaluation

  50. Selection of Sequences (1/2) • For phylogeny, DNA can be more informative. • Protein-coding sequences has synonymous and nonsynonymous substitutions. Thus, some DNA changes do not have corresponding protein changes. • Some substitutions in a DNA sequence alignment can be directly observed: single nucleotide substitutions, sequential substitutions, coincidental substitutions. • Additional mutational events can be inferred by analysis of ancestral sequences. These changes include parallel substitutions, convergent substitutions, and back substitutions. • Pseudogenes and noncoding regions may be analyzed using DNA Page 371

More Related