390 likes | 530 Views
Pairwise Sequence Alignment Exercise 2. Motivation.
E N D
Motivation ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…
What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE |||| ||||| ||| |||| || MVHLTPEEKTAVNALWGKVNVDAVGGE
Why sequence alignment? Predict characteristics of a protein – Premised on: similar sequence (or structure) similar function
Local vs. Global Global alignment: forces alignment in regions which differ • Global alignment – finds the best alignment across the whole two sequences. • Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local alignment concentrates on regions of high similarity ADLG CDRYFQ |||| |||| | ADLG CDRYYQ
Evolutionary changes in sequences Three types of changes: • Substitution – a replacement of one (or more) sequence letter by another: • Insertion - an insertion of a letter or several letters to the sequence: • Deletion - deleting a letter (or more) from the sequence: AAGA AACA AAG A T A A GA Insertion + Deletion Indel
Choosing an alignment: • Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better?
Exercise: compute both alignment scores • Match: +1 • Mismatch: -2 • Indel: -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-
Scoring systems: accounting for biological context • Which is true about the scores in a pairwise alignment of nucleotide sequences? • Tr > Tv > 0 • Tr < Tv < 0 • 0 > Tr > Tv • 0 > Tv > Tr Tr = Transition Tv = Transversion
Scoring systems: accounting for biological context • Which is true about the scores in a pairwise alignment of amino-acid sequences? • Asp->Asn > Asp->Glu • Arg->His > Ala->Phe • Arg->His < Thr->Met
Substitutions matrices • Nucleic acids: • Transition-transversion • Amino acids: • Evolutionary (empirical data) based: (PAM, BLOSUM) • Physico-chemical properties based (Grantham, McLachlan)
PAM matrices • Family of matrices PAM 80, PAM 120, PAM 250 • The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based • Greater numbers denote greater distances
PAM - limitations • Based on only one original dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased
BLOSUM matrices • Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped local alignments) • BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity • BLOSUM62 represents closer sequences than BLOSUM45
Substitution matrices exercise • Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment: • Human – chimp • Human - yeast • Human – fish PAM options: PAM60 PAM120 PAM250 BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80
PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences • BLOSUM62 for general use • BLOSUM80 for close relations • BLOSUM45 for distant relations • PAM120 for general use • PAM60 for close relations • PAM250 for distant relations
Gap penalty AAGCGAAATTCGAAC A-G-GAA-CTCGAAC AAGCGAAATTCGAAC AGG---AACTCGAAC • Which alignment is more likely? • Which alignment has a higher score?
BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool)engine for local alignment • Does not use an exact algorithm but a heuristic
Bl2Seq - query • blastn – nucleotide blastp – protein
Bl2seq results Dissimilarity Low complexity Gaps Similarity Match
BLAST – programs Query: DNA Protein Database: DNA Protein
Blast scores: • Bits score– A score for the alignment according to the number of similarities, identities, etc. • Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a random database of a particular size. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog
Fasta format – multiple sequences >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH
Searching for remote homologs • Sometimes BLAST isn’t enough • Large protein family, and BLAST only finds close members. We want more distant members • PSI-BLAST
PSI-BLAST • Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results
PSI-BLAST • Advantage: PSI-BLAST looks for seq’s that are close to the query, and learns from them to extend the circle of friends • Disadvantage: if we obtained a WRONG hit, we will get to unrelated sequences (contamination). This gets worse and worse each iteration
PSI-BLAST Which one(s) of the following is/are correct? • PSI-BLAST is expected to give more hits than BLAST • PSI-BLAST is an iterative search method • PSI-BLAST is faster than BLAST • Each iteration of PSI-BLAST can only improve the results of the previous iteration