Pairwise Sequence Alignment Exercise 2

Pairwise Sequence AlignmentExercise 2

Motivation ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACGTGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAGGAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACTGATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCAGAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAGGTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACAACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGTCATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGCATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTTTCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACAATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTTTCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTACTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAGGGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGGTTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAACAAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGTCTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAAGGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTACCATTGA || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE…

What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE |||| ||||| ||| |||| || MVHLTPEEKTAVNALWGKVNVDAVGGE

Why sequence alignment? Predict characteristics of a protein – Premised on: similar sequence (or structure) similar function

Local vs. Global Global alignment: forces alignment in regions which differ • Global alignment – finds the best alignment across the whole two sequences. • Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local alignment concentrates on regions of high similarity ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

Evolutionary changes in sequences Three types of changes: • Substitution – a replacement of one (or more) sequence letter by another: • Insertion - an insertion of a letter or several letters to the sequence: • Deletion - deleting a letter (or more) from the sequence: AAGA  AACA AAG A T A A GA Insertion + Deletion Indel

Choosing an alignment: • Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Which alignment is better?

Exercise: compute both alignment scores • Match: +1 • Mismatch: -2 • Indel: -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA-

Scoring systems: accounting for biological context • Which is true about the scores in a pairwise alignment of nucleotide sequences? • Tr > Tv > 0 • Tr < Tv < 0 • 0 > Tr > Tv • 0 > Tv > Tr Tr = Transition Tv = Transversion

Scoring systems: accounting for biological context • Which is true about the scores in a pairwise alignment of amino-acid sequences? • Asp->Asn > Asp->Glu • Arg->His > Ala->Phe • Arg->His < Thr->Met

Substitutions matrices • Nucleic acids: • Transition-transversion • Amino acids: • Evolutionary (empirical data) based: (PAM, BLOSUM) • Physico-chemical properties based (Grantham, McLachlan)

PAM matrices • Family of matrices PAM 80, PAM 120, PAM 250 • The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based • Greater numbers denote greater distances

PAM - limitations • Based on only one original dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased

BLOSUM matrices • Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped local alignments) • BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity • BLOSUM62 represents closer sequences than BLOSUM45

Substitution matrices exercise • Pick the best substitution matrix (PAM and BLOSUM) for each pairwise alignment: • Human – chimp • Human - yeast • Human – fish PAM options: PAM60 PAM120 PAM250 BLOSUM options: BLOSUM45 BLOSUM62 BLOSUM80

PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences • BLOSUM62 for general use • BLOSUM80 for close relations • BLOSUM45 for distant relations • PAM120 for general use • PAM60 for close relations • PAM250 for distant relations

Gap penalty AAGCGAAATTCGAAC A-G-GAA-CTCGAAC AAGCGAAATTCGAAC AGG---AACTCGAAC • Which alignment is more likely? • Which alignment has a higher score?

Web servers for pairwise alignment

BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool)engine for local alignment • Does not use an exact algorithm but a heuristic

Back to NCBI

BLAST – bl2seq

Bl2Seq - query • blastn – nucleotide blastp – protein

Bl2seq results

Bl2seq results Dissimilarity Low complexity Gaps Similarity Match

BLAST – programs Query: DNA Protein Database: DNA Protein

BLAST – Blastp

Blastp - results

Blastp – results (cont’)

Blast scores: • Bits score– A score for the alignment according to the number of similarities, identities, etc. • Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a random database of a particular size. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog

Blastp – acquiring sequences

blastp – acquiring sequences

Fasta format – multiple sequences >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH

Searching for remote homologs • Sometimes BLAST isn’t enough • Large protein family, and BLAST only finds close members. We want more distant members • PSI-BLAST

PSI-BLAST • Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results

PSI-BLAST • Advantage: PSI-BLAST looks for seq’s that are close to the query, and learns from them to extend the circle of friends • Disadvantage: if we obtained a WRONG hit, we will get to unrelated sequences (contamination). This gets worse and worse each iteration

PSI-BLAST Which one(s) of the following is/are correct? • PSI-BLAST is expected to give more hits than BLAST • PSI-BLAST is an iterative search method • PSI-BLAST is faster than BLAST • Each iteration of PSI-BLAST can only improve the results of the previous iteration

BLAST – PSI-Blast

PSI-Blast - results

Pairwise Sequence Alignment Exercise 2

Pairwise Sequence Alignment Exercise 2

Presentation Transcript

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment (I)

Pairwise sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Lecture 2 Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise Sequence Alignment Part 2

Pairwise Sequence Alignment (II)

Pairwise Sequence Alignment

Pairwise Sequence Alignment (cont.)

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise sequence alignment (practice)

Pairwise Sequence Alignment (II)