490 likes | 577 Views
Learn how to quantify sequence similarity, discover functions, and identify disease causes through pairwise sequence alignment. Understand the importance of aligning sequences, detecting changes, and using dynamic programming for solutions.
E N D
Lecture 2 Pairwise Sequence Alignment
WHAT? • Given any two sequences (DNA or protein) Seq 1: CATATTGCAGTGGTCCCGCGTCAGGCT Seq 2: TAAATTGCGTGGTCGCACTGCACGCT we are interested to know to what extent they are similar? CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT
Discover function • Study evolution • Find crucial features within a sequence • Identify cause of diseases
Discover function Sequences that are similar probably have the same function
in the genome Find crucial features ? • Regions in the sequences that are strongly conserved between different sequences can indicate their functional importance High Low
Identify cause of disease • Comparison of sequences between individuals can detect changes that are related to diseases
Sickle Cell Anemia • Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin Image source: http://www.cc.nih.gov/ccc/ccnews/nov99/
Healthy Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
Diseased Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GGTGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPVEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH
Indel (replication slippage) TCCGT TCGAGT TCAGT TCGT Sequence Modifications • Three types of changes • Substitution (point mutation) • Insertion • Deletion TCAGT
In order to align two sequences we need a quantitive model to evaluate similarity between sequences. How do we quantitate sequence similarity ? For example : A and A , score= 2 A and T , score= -1
Total score +4 A weak match Substitutions Only Modelnot including indels • Sequences compared base-by-base • Count the number of matches and mismatches • For example :Matches score +2, Mismatches score -1 TTCGTCGTAGTCGGCTCGACCTGGTACGTCTAGCGAGCGTGATCCT 9 matches +18 14 mismatches -14
Total score +24 A strong match Including Indels • Create an ‘alignment’ • Count matches within alignment • Indels are scored as mismatches -1 TT-CGTCGTAGTCG-GC-TCGACC-TGGTACGTC-TAG-CGAGCGT-GATCCT- 17 matches +34 2 mismatches - 2 8 indels - 8
TT-CGTCGTAGTCG-GC-TCGACC-TGGTACGTC-TAG-CGAGCGT-GATCCT- +24 -TTCGT-CGTAGTC-GGCTCG-ACCTGGTAC-GTCTA-GCGAGCGT-GATCC-T 0 Choosing an Alignment • Many different alignments are possible • Should consider all possible • Take the best score found • There may be more than one best alignment
Why is it hard ? Alignment requires an algorithm that performs a number of comparisons roughly proportional to the square of the average sequence length n2.
Dynamic Programming • A method for reducing a complex problem to a set of identical sub-problems • The best solution to one sub-problem is independent from the best solution to the other sub-problem
Dynamic Programming • A method for reducing a complex problem to a set of identical sub-problems • The best solution to one sub-problem is independent from the best solution to the other sub-problem
What does it mean? If a path from X→Z passes through Y, the best path from X→Y is independent of the best path from Y→Z
Sequence Global Alignment Needleman-Wunsch Sequences: A = ACGCTG, B = CATGT A C G C T G 1 2 3 4 5 6 C 1 A 2 T 3 G 4 T Z 5
Score of best alignment between AC and CATG …between ACG and CATG -1 2 …between AC and CATGT Calculate score between ACG and CATGT -2 ? Example Sequences: A = ACGCTG, B = CATGT Match:+2, Other:-1
Example Align the next letter in the sequences Insertion in the first sequence (del) 3 5 - 5 Insertion in the Second sequence 3 -
-1 from before plus -1 for mismatch of G against T-2 2 from before plus -1 for mismatch of – against T1 -2 from before plus -1 for mismatch of G against –-3 Cell gets highest score of -2,1,-31 1 Example -1 2 -2 Sequences: A = ACGCTG, B = CATGT
Example -1 2 -2 Sequences: A = ACGCTG, B = CATGT
A -
ACGCTG ------
----- CATGT
A C
AC -C
ACG -C-
ACGC ---C ACGC -C--
ACG -CA
ACGCTG- -C-ATGT
ACGCTG- -CA-TGT
-ACGCTG CATG-T-
Needleman-Wunsch Global Alignment • Compare entire sequence against another • Global alignment score is bottom right cell
DorothyHodkin DorothyCrowfootHodkin Dorothy Hodkin DorothyCrowfootHodkin DOROTHY DOROTHY HODGKIN HODGKIN Global alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN Local alignment:
Local AlignmentSmith-Waterman • Best score for aligning part of sequences • Often beats global alignment score Global Alignment ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT Local Alignment CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT
Global vs. Local alignment Alignment of two Genomic sequences >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Mouse DNA CATGCGTCTGACgctttttgctagcgatatcggactATCGATATA
Global vs. Local alignment Alignment of two Genomic sequences Global Alignment Human:CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA Mouse:CATGCGTCTGACgct---ttttgctagcgatatcggactATCGAT-ATA ****** ***** * *** * ****** *** Human:CATGCGACTGAC Mouse:CATGCGTCTGAC Human:ATCGATCATA Mouse:ATCGAT-ATA Local Alignment
Global vs. Local alignment Alignment of two Genomic DNA and mRNA >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA
Global vs. Local alignment Alignment of two Genomic DNA and mRNA Global Alignment DNA: CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA mRNA:CATGCGACTGAC---------------------------ATCGATCATA ************ ********** DNA: CATGCGACTGAC mRNA:CATGCGACTGAC DNA: ATCGATCATA mRNA:ATCGATCATA Local Alignment