Lesson 3

Lesson 3 Aligning sequences and searching databases

Homology • Similarity between objects due to a common ancestry

Sequence homology • Similarity between sequences that results from a common ancestor VLSPAVKWAKVGAHAAGHG VLSEAVLWAKVEADVAGHG • Basic assumption: Sequence homology → similar structure/function

Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

G G G G G G G1,G2 G G G Homology • Ortholog – homolog with similar function (via speciation) • Paralog – homolog which arose from gene duplication Orthologs – 2 homologs from different species Paralogs – 2 homologs within the same species

How close? • Rule of thumb: • Proteins are homologous if over 25% identical (length >100) • DNA sequences are homologous if over 70% identical

Twilight zone • < 20% identity in proteins – may be homologous and may not be…. • (Note that 5% identity will be obtained completely by chance!)

Why sequence alignment? Predict characteristics of a protein – use the structure/function of known proteins for predicting the structure/function of an unknown proteins

Sequence modifications Sequences change in the course of evolution due to random mutations Three types of mutations: • Insertion - an insertion of a nucleotide or several nucleotides to the sequence. AAGA AAGTA • Deletion– a deletion of a nucleotide (or more) from the sequence. AAGA AGA • Substitution– a replacement of a nucleotide by another. AAGA AACA Insertion or Deletion ? ->Indel

Local vs. Global Global alignment: forces alignment in regions which differ • Global alignment– finds the best alignment across the entire two sequences. • Local alignment– finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local alignment will return only regions of good alignment ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

When global and when local?

Global alignment • PTK2 protein tyrosine kinase 2 of human and rhesus monkey

Protein tyrosine kinase domain

Protein tyrosine kinase domain • Human PTK2 and leukocyte tyrosine kinase • Both function as tyrosine kinases, in completely different contexts • Ancient duplication

Global alignment of PTK and LTK X

Local alignment of PTK and LTK

Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA One possible alignment: AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-

AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2mismatches 4 indels (gap) 10 perfect matches

Choosing an alignment: • Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Which alignment is better?

Alignment scoring - scoring of sequence similarity: • Assumes independence between positions • Each position is considered separately • Scores each position • Positive if identical (match) • Negative if different (mismatch) or gap (indel) • Total score = sum of position scores • Can be positive or negative

Example - naïve scoring system: • Perfect match: +1 • Mismatch: -2 • Indel (gap): -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Score: =(+1)x10 + (-2)x2 + (-1)x4= 2 Score: =(+1)x9 + (-2)x2 + (-1)x6 = -1 Higher score  Better alignment

Scoring system: • The choice of +1,-2, and -1 scores is quite arbitrary • Different scoring systems  different alignments • Scoring systems implicitly represent a particular theory of evolution • Some mismatches are more plausible • Transition vs. Transversion • LysArg ≠ LysCys • Gap extension ≠ Gap opening

Scoring matrix • Representing the scoring system as a table or matrix nn (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) • symmetric

Match DNA scoring matrices • Uniform substitutions between all nucleotides: Mismatch

DNA scoring matrices Can take into account biological phenomena such as: • Transition-transversion

Amino-acid scoring matrices • Take into account physico-chemical properties

Amino-acid substitutions matrices • Actual substitutions: • Based on empirical data • Commonly used by many bioinformatics programs • PAM & BLOSUM

Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E M G Y E E In the fourth column E and D are found in 7 / 8

PAM Matrix - Point Accepted Mutations • Based on a database of 1,572 changes in 71 groups of closely related proteins (85% identity) • Alignment was easy • Counted the number of the substitutions per amino-acid pair (20 x 20) • Found that common substitutions occurred between chemically similar amino acids

PAM Matrices • Family of matrices PAM 80, PAM 120, PAM 250 • The number on the PAM matrix represents evolutionary distance • Larger numbers are for larger distances

Example: PAM 250 Similar amino acids have greater score

PAM - limitations • Based only on a single, and limited dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased

BLOSUM • Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset • BLOSUM observes significantly more replacements than PAM, even for infrequent pairs

BLOSUM:BlocksSubstitutionMatrix • Based on BLOCKS database • ~2000 blocks from 500 families of related proteins • Families of proteins with identical function • Blocks are short conserved patterns of 3-60 aa without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC

BLOSUM • Each block represents a sequence alignment with different identity percentage • For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix

BLOSUM Matrices • BLOSUMn is based on sequences that share at least n percent identity • BLOSUM62 represents closer sequences than BLOSUM45

Example : Blosum62 derived from block where the sequences share at least 62% identity

PAM vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences

Scoring system = substitution matrix + gap penalty

Gap penalty • We penalize gaps • Scoring for gap opening & gap extension: • Gap-extension penalty < gap-open penalty

Optimal alignment algorithms • Needleman-Wunsch (global) • Smith-Waterman (local)

Alignment Search Space • The “search space” (number of possible gapped alignments) for optimally aligning two sequences is exponential in the length of the sequences (n). • If n=100, there are 100100 = 10200 = 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 different alignments! • Average protein length is about n=250!

Searching databases

Searching a sequence database • Using a sequence as a query to find homologous sequences in a sequence database

Query sequence: DNA or protein? • For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. • Which is preferable?

Protein is better! • Selection (and hence conservation) works (mostly) on the protein level:CTTTCA = Leu-SerTTGAGT=Leu-Ser

Query type • Nucleotides: a four letter alphabet • Amino acids: a twenty letter alphabet • Two random DNA sequences will, on average, have 25% identity • Two random protein sequences will, on average, have 5% identity

Conclusions • Using the amino-acid sequence is preferable for homology search • Why use a nucleotide sequence after all? • No ORF found, e.g. newly sequenced genome • No similar protein sequences were found • Specific DNA databases are available (EST)

Some terminology • Query sequence - the sequence with which we are searching • Hit– a sequence found in the database, suspected as homologous

How do we search a database? • Assume we perform pairwise alignment of the query against all the sequences in the database • Exact pairwise alignment is O(mn) ≈ O(n2)(m – length of sequence 1, n – length of sequence 2)

Lesson 3

Lesson 3

Presentation Transcript

Lesson 3

Lesson 3

Lesson 3-3

Lesson 3

Lesson 3

Lesson 3

Lesson 3

LESSON 3-3

Lesson 3

Lesson 3-3

Lesson 3 - 3

LESSON 3-3

Lesson 3

Lesson 3

Lesson 3

Lesson 3

Lesson 3-3

Lesson # 3

LESSON 3 –3