1 / 64

Applied Bioinformatics

Week 08. Applied Bioinformatics. Theory I. Protein Sequences Protein Families Protein Domains Computer Learning Garbage in -> Garbage out Prediction based on learned Examples. Protein Sequence. Primary Sequence consisting of 20 amino acids Secondary Structure consists of 3 types

soyala
Download Presentation

Applied Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Week 08 Applied Bioinformatics

  2. Theory I • Protein Sequences • Protein Families • Protein Domains • Computer Learning • Garbage in -> Garbage out • Prediction based on learned Examples

  3. Protein Sequence • Primary Sequence consisting of 20 amino acids • Secondary Structure consists of 3 types • Helix – Strand – Coil • Tertiary structure Combinations of secondary structures • Unlimited number of combinations possible • But limited number of motives found • Architectures are build hierarchicaly • Quaternary structure • AKA protein-protein interactions are not part of this course

  4. http://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htmhttp://njms2.umdnj.edu/biochweb/education/bioweb/PreK2010/AminoAcids.htm http://www.usermeds.com/medications/amino-acids http://www.weightlossandnutritionsecrets.com/all-about-amino-acids/

  5. PRIDE • The PRIDE PRoteomics IDEntifications database is a centralized, standards compliant, public data repository for proteomics data • It contains experimental evidence for its entries • http://www.ebi.ac.uk/pride//

  6. Protein Sequences • Swissprot = UniProtKB • http://www.expasy.ch/sprot • http://www.ebi.ac.uk/swissprot/ • As in Genebank for nucleotide sequences we need a unique identifier for each protein sequence • Let’s look at EBI now

  7. UniProtKB • The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. (KB: Knowledge Base) • Often manually reviewed and annotated information

  8. UniProt Including splice variants and isoforms

  9. Protein Information Clicking on the Member name (Accession Number) will provide detailed information about the protein

  10. Protein Information

  11. Protein Information

  12. Protein Information

  13. Protein Information

  14. Protein Information

  15. Protein Information

  16. Protein Information

  17. Machine Learning • For example clustering • UniRef90 • UniRef50

  18. Learning • Many Facts -> Rules/Knowledge • Learning = Deducing rules from facts • Computer/Machine learning? • Same idea

  19. Computer Learning • Neural Networks • Support Vector Machines • Naive Bayes Classifiers • Self Organizing Maps • Decision Trees • And many other algorithms

  20. Data • Training data needs to be chosen carefully • Example sub cellular targeting of proteins • What needs to be predicted? • Localization • Leader peptide cleavage site • Where does the data come from • Best would be sequences validated by experimental results • How many? • Difficult to answer this one • More is good, but rare events will not be learned well • Better is manual editing choosing many possibilities and not over representing some of them in the dataset

  21. Data • Yes! preparing the dataset is crucial and takes most of the time • Applying the learner will not take long • All outcomes of the samples need to be known (target, cleavage site) • Negative examples are just as important • Divide the dataset into two parts • One will be used for learning • The other for validating the learned rules

  22. Validation • The dataset can be automatically divided into different training and validation sets • This can be performed many times and the best result (rule set) can later be used to predict new sequences • That’s machine learning in brief • We just touched the surface of it

  23. Classification General Idee

  24. Practical Considerations • You want to predict the sub cellular target of a protein • Which species are you working with? • Which species did the training data come from? • You can try a few known examples • Read the publication • How precise is the prediction • For localization • For prediction of the leader peptide • If possible, try different approaches

  25. Clustering (Machine Learning) • Basically same idea as in MSA • Similar sequences are aligned first • Similar datasets are clustered first • The initial clusters are combined into super clusters (hierarchical clustering) • Similar to forming a guide tree • New measurements can be assigned to known clusters • Information can be inferred

  26. Protein Families • Based on • Clusters of protein sequences • Domains (basically blocks of above) • Many domains are annotated • Good place to find these is • http://www.ebi.ac.uk/InterProScan

  27. Practice I

  28. Protein Information • In many cases we would like to get additional information about a protein • Molecular mass • pI • Subcellular targeting • http://www.expasy.org/tools • Many calculations, etc for proteins

  29. Tools at Expasy • Prediction/ Characterizing Tools • Pattern and Profile searches • PTM predictions • Topology Prediction • Structure • Primary (Analysis) • Secondary (Prediction) • Tertiary (Prediction, Analysis) • …

  30. Protein Information

  31. Localization • You want to predict the sub cellular localization of a protein

  32. Let’s tackle this problem • Get a protein from swissprot • O82533 (Gene: AtFtsZ2-1) • Annotation: Chloroplast targeting • Try a few prediction tools to see if you can confirm the annotation

  33. Localization Prediction • Choose tools from Expasy for example • ChloroP • SignalP • Predotar

  34. Theory II • Substitution Matrices

  35. First Substitution Matrices • Substitution Matrices • Sequence relationships may be hidden by changes in sequence • Mutations • Evolution • Approximate matches are needed

  36. Selectionist Model • Some mutations are neutral • Not disturbing the function much • Not disturbing the structure much • These accumulate over time (evolution) • Some mutations are disruptive • L <> Q • Frameshift insertions or deletions

  37. More elaborate Matrices • Format • Table 20 X 20 • Probability of change for each combination • Symmetric • 190 distinct entries + 20 • Examples • Unitary • GCM • BLOSUM • PAM

  38. Genetic Code Matrix • Considers the minimum number of base changes (0,1,2,3) • Are amino acids different in only one base chemically significantly different? • Not a very good matrix • Although mutation on the genetic level • Selection is on the protein level • A priori • Example • Jukes Cantor Model

  39. Amino Acid Substitutions • A priori • driven by amino acid properties • Size • Hydrophobicity • Charge • ... • Determined from example

  40. PAM matrices • Percent Accepted Mutation: Unit of evolutionary change for protein sequences [Dayhoff78]. • A PAM unit is the amount of evolution that will on average change 1% of the amino acids within a protein sequence.

  41. PAM matrices: Assumptions • Only mutations are allowed • Sites evolve independently • Evolution at each site occurs according to a simple (“first-order”) Markov process • Next mutation depends only on current state and is independent of previous mutations • Mutation probabilities are given by a substitution matrixM = [mXY], where mxy = Prob(X Y mutation) = Prob(Y|X)

  42. The PAM Family Define a family of substitution matrices — PAM 1, PAM 2, etc. — where PAM n is used to compare sequences at distance n PAM. PAM n = (PAM 1)n Do not confuse with scoring matrices! Scoring matrices are derived from PAM matrices to yield log-odds scores.

  43. Generating PAM matrices • Idea: Find amino acids substitution statistics by comparing evolutionarily close sequences that are highly similar • Easier than for distant sequences, since only few insertions and deletions took place. • Computing PAM 1 (Dayhoff’s approach): • Start with highly similar aligned sequences, with known evolutionary trees (71 trees total). • Collect substitution statistics (1572 exchanges total). • Let mij= observed frequency (= estimated probability) of amino acid Aimutating into amino acid Ajduring one PAM unit • Result: a 20× 20 real matrix where columns add up to 1.

  44. Dayhoff’s PAM matrix All entries  104

  45. Calculate a substitution frequency matrix

  46. PAM250 (log ods)

  47. BLOSUM matrices • Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff & Henikoff92]. • For example BLOSUM62 is derived from sequence alignments with no more than 62% identity.

  48. BLOSUM Scoring Matrices • BLOck SUbstitution Matrix • Based on comparisons of blocks of sequences derived from the Blocks database • The Blocks database contains multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins (local alignment versus global alignment) • BLOSUM matrices are derived from blocks whose alignment corresponds to the BLOSUM-,matrix number

  49. Conserved blocks in alignments AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC

  50. Constructing BLOSUM r • To avoid bias in favor of a certain protein, first eliminate sequences that are more than r% identical • The elimination is done by either • removing sequences from the block, or • finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster. • BLOSUM r is the matrix built from blocks with no more the r% of similarity • E.g., BLOSUM62 is the matrix built using sequences with no more than 62% similarity. • Note: BLOSUM 62 is the default matrix for protein BLAST

More Related