1 / 65

Practical Session 2

Practical Session 2. +. Table of contents. Scoring matrices PAM BLOSUM Intro to Python. Aligning Protein Sequences. Classification Clustering of families Annotations (functional and structural). Aligning Protein Sequences. Proteins consist of 20 amino acids.

dennisj
Download Presentation

Practical Session 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practical Session 2 +

  2. Table of contents • Scoring matrices • PAM • BLOSUM • Intro to Python

  3. Aligning Protein Sequences • Classification • Clustering of families • Annotations (functional and structural)

  4. Aligning Protein Sequences • Proteins consist of 20 amino acids. Task given: align two protein sequences. • Can the previous alignment algorithms be used? • How do amino acids differ from one another?

  5. Aligning Protein Sequences When evaluating the probability of one amino acid mutating to another, we need to consider: • Mutational Distance • Chemical properties - similarity/difference • Evolutionary time

  6. Mutational Distance Assume we start with Methionine, which is encoded by a single codon: ATG Thr (Threonine) is encoded by AC[ACGT] In order to mutate Met to Thr, one SNP (single nucleotide point) mutation is enough ATG ACG

  7. Mutational Distance Assume we start with Methionine, which is encoded by a single codon: ATG Thr (Threonine) is encoded by AC[ACGT] In order to mutate Met to Thr, one SNP (single nucleotide point) mutation is enough • 3 point mutations are required to mutate Met to His - encoded by CA[TC] • Therefore, His is more distant to Met.

  8. Amino acids’ chemical properties • Size • Structure • Polarity • Charge • Acidity (pKa) • These properties affect • mutation probabilities

  9. Amino acids’ chemical properties • Mutations which change functionality (chemical properties) of the protein, should be less likely to occur.

  10. Evolutionary time • Time is another aspect which needs attention. • Does longer time permits less or more mutation? • How can that be included in the scoring system ?

  11. Evolutionary Substitution Matrix • A substitution matrix contains values proportional to the probability that amino acid mutates into amino acid for all pairs of amino acids.

  12. Evolutionary Substitution Matrix • A substitution matrix contains values proportional to the probability that amino acid mutates into amino acid for all pairs of amino acids. • Based on empirical observations • Assumption: frequent substitutions reflect “safe variations” and thus should be given a higher score, while infrequent mutations are probably detrimental and thus should be given lower score. • The two major types of substitution matrices are PAM and BLOSUM.

  13. PAM MatricesPAM – Percent/Point Accepted Mutations. • The first widely used scoring scheme used for amino acid alignment. • Devised by Margaret Oakley Dayhoff and Co. in 1978.

  14. PAM – point accepted mutation • Substitution of an amino acid in a protein with another amino acid, which is accepted by the process of natural selection. • Silent or lethal mutations are not point accepted mutations

  15. PAM Matrices • PAM matrices are noted as PAMn matrices • PAM1 represents the time period over which we expect 1% of the amino acids to undergo point accepted mutations

  16. Constructing PAM Matrices • Examined 1572 substitutions in 71 families of proteins (71 phylogenetic trees) • The proteins sequences were at least 85% identical

  17. Constructing PAM Matrices • Calculating - the amount of observed cases when amino acid mutated to amino acid .

  18. Constructing PAM Matrices • - the amount of observed cases when amino acid mutated to amino acid • is the number amino acid appearances • is a constant • , the probability of mutating to is:

  19. Constructing PAM Matrices For clarity, the values have been multiplied by 10000

  20. Constructing PAM Matrices The diagonal represents the probability to still observe the same residue after 1 PAM. Therefore the diagonal represents the 99% of the case of non-mutation. For clarity, the values have been multiplied by 10000

  21. Deriving PAMn matrices • represents the evolutionary time in which 1% of amino acids mutated • represents the evolutionary time in which 250% of amino acids mutated • represent sequences of approximately 20% sequence similarity • How can that be? • Each amino acid can mutate more than once

  22. Deriving PAMn matrices

  23. Constructing PAM Matrices • An amino acid’s () frequency: • is the number amino acid appearances • is the total sequences length (all alignments) Dayhof group computed matrix in the 1970s. In 1991 recomputed by Jones group: used a much larger set of proteins, but still got a very similar values for relative frequencies of substitutions.

  24. From probabilities to scores • So far, we have obtained a probability matrix, but we would like a scoring matrix. Observed frequency Expected frequency by chance

  25. Constructing PAM Matrices Observed frequency • Using log has convenient practical consequences: • A positive score () characterizes the accepted mutations • A negative score () characterizes the unfavorable mutations • Another property of the log-odd scores is that they can be added to produce the score of an alignment: T A H G K Y S D G D Expected frequency by chance

  26. Choosing the right PAM matrix • Correspondence between the observed percent of amino acid difference and the evolutionary distance (in PAM)

  27. Choosing the right PAM matrix • PAM120 matrix is the most appropriate for database searches • PAM200 matrix is the most appropriate for comparing two specific proteins with suspected homology • Higher is more appropriate for more distant proteins

  28. The model’s assumptions • Only mutations are allow – no indels. • Sites evolve independently – mutation in one site, has no effect on another. • Evolution model:Next mutation is dependent on current state and is independent on previous mutations.

  29. Problem PAM matrices work quite well for closely related sequences, especially during short evolutionary time. However, they seems to lack the ability to represent more distant/divergent sequences, on a larger evolutionary time scale.

  30. BLOSUM(BLOcksSUbstitutions Matrix) Devised by Henikoff & Henikoff in 1992.

  31. BLOSUM(BLOcksSUbstitutions Matrix) • Used to score alignments of evolutionary divergent (different) sequences. • As the name hints, the scores are extracted from local “blocks” of conserved sequences. • Unlike , the in represents the maximal similarity between the sequences and all BLOSUM are computed by observations.

  32. BLOSUM(BLOcksSUbstitutions Matrix) • BLOSUM 62 is the default matrix for the standard protein BLAST program • BLOSUM 62 is derived from Blocks containing >62% identity in ungapped sequence alignment

  33. Constructing BLOSUM Henikoff and Henikoff developed a database of >2,000 blocks “blocks” based on sequences from >500 groups of related proteins with shared subsequences AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC

  34. Why blocks? • Don’t want insertions and deletions to complicate estimation of substitution probabilities • Interested in detecting conserved regions of protein sequences, so restrict attention to these regions when computing the scoring matrix

  35. Constructing BLOSUM Intuitively, is the ( of the) ratio of: • The number of times amino acids and appear together in the same column • Divided by the expected number of times to see pairs in the same column if the placement of amino acids and were random throughout BLOCKS. 

  36. BLOSUM62

  37. Differences between PAM and BLOSUM

  38. Intro to Python

  39. Why Python? *By CodeEval - a platform used by developers to showcase their skills. 

  40. Why Python? • Quick development • Easy to learn • Huge community • Fast enough for most applications • Capable of interacting with most of the other languages and platforms

  41. Strings http://www.codeskulptor.org/ s = 'hi‘ print s[1]         printlen(s)       print s + ' there' pi = 3.14 text ='The value of pi is ' + pi text = 'The value of pi is '  + str(pi)   s = 3 # i # 2       # hi there # does not work # yes

  42. String Slices • s[1:4] • 'ell' -- chars starting at index 1 and extending up to but not including index 4 • s[1:] • 'ello' -- omitting either index defaults to the start or end of the string • s[:] • 'Hello' -- omitting both always gives us a copy of the whole thing (this is the pythonic way to copy a sequence like a string or list) • s[1:100] • 'ello' -- an index that is too big is truncated down to the string length • s[-1] • 'o' -- last char (1st from the end) • s[-3:] • 'llo' -- starting with the 3rd char from the end and extending to the end of the string.

  43. If statement if speed >= 80:print'License and registration please'if mood == 'terrible'or speed >= 100:print'You have the right to remain silent.'elif mood == 'bad'or speed >= 90:print"I'm going to have to write you a ticket."write_ticket()else:print"Let's try to keep it under 80 ok?" • Note there are no {} or ; Indentation is very important!

  44. Lists • my_list = [1,2,3,4,5,6,7,8,9,10] • my_list[1:5] # [2, 3, 4, 5] • my_list[::2] • [1, 3, 5, 7, 9] • my_list[::-1] • reverse [10, 9, 8, 7, 6, 5, 4, 3, 2, 1] • Lists can contain different types of variables: • pi = ['pi', 3.14159, True]

  45. Lists are dynamic • students = ['Itay',9255587, 'Alon',744554] • students.append('Michal')# ['Itay',9255587, 'Alon',744554, 'Michal'] • students[0:2] = [‘Noa‘] # [‘Noa’, 'Alon',744554, 'Michal']

  46. Range range(10) # returns an ordered list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] range(0,10,2) #[0, 2, 4, 6, 8] ## print the numbers from 0 through 99foriin range(100):printi

More Related