1 / 40

Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004

Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences. Today’s Topics. Hidden Markov Models (HMMs) Predicting sub-cellular localization of proteins Predicting post-translation modification sites Using Standalone tools

jemma
Download Presentation

Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Tools and Algorithms in Bioinformatics Chittibabu Guda Summer, 2004 UCSD Extension, Department of Biosciences

  2. Today’s Topics • Hidden Markov Models (HMMs) • Predicting sub-cellular localization of proteins • Predicting post-translation modification sites • Using Standalone tools • Current Trends in Bioinformatics

  3. Hidden Markov Models

  4. HMMs for biological sequences • Hidden Markov model is a statistical model and has been mostly developed for speech recognition. • The most popular use of HMM in molecular biology is as a ‘probabilistic profile’ of a protein family, which is called a profile HMM. • Apart from this, HMMs are also used for multiple sequence alignment, gene prediction (ORF finding), and protein structure prediction • Advantages are, statistically sound, no sequence ordering or gap penalties are required • Limitations are, large number of similar sequences are required to get good models

  5. Stochastic modeling of biological sequences For Example, Profile is a position-specific scoring matrix. • Given this model the probability of CGGSV is: • 0.8 * 0.4 * 0.8* 0.6* 0.2 = 0.031 • Since multiplication of fractions is computationally expensive and prone to floating point errors, a transformation into the logarithmic world is used. • The score is calculated by taking the logs of all amino acid probabilities and adding them up. • ln(0.8) + ln(0.4) + ln(0.8) + ln(0.6) + ln(0.2) = -3.48

  6. Stochastic modeling of biological sequences But with this expression it is not possible to distinguish between the highly implausible sequence TGCT- - AGG and the consensus sequence ACAC - - ATC

  7. The HMM architecture • S-start; E-end • m- main state (matches/mismatches) • i - insert state • d - delete state A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C

  8. Parameters used in HMM building M N – F L S M N – F L S M N K Y L T M Q – W - T • Transition probability: Tij (average 0.333) • Emission probability: Ei (average 0.05) i m m d • Since the probabilities are very small numbers, they are converted to log odds scores and added to get the overall probability score

  9. Markov modeling of biological sequences A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C

  10. Markov modeling of biological sequences P(s)*100 A C A - - - A T G 3.3 T C A A C T A T C 0.0075 A C A C - - A G C 1.2 A G A - - - A T C 3.3 A C C G - - A T C 0.59 A C A C - - A T C 4.7 P(ACACATC)= 0.047 Obtained by taking the product of probabilities for residues in each state and the transitions.

  11. Sequence Alignment and Database Search using HMMER Multiple Alignment Build a Profile HMM Database search Query against Profile HMM database (PFAM database) Multiple alignments

  12. HMMSEARCH Results (on voltage-gated ion channel proteins database)

  13. PFAM http://pfam.wustl.edu • Protein Family Database created using HMMs • Pfam-A contains functionally annotated families (~7500) • Pfam-B contains unannotated families (~107000) • All protein sequences were clustered into families based on sequence identity • For each family, non-redundant, full-domain seed members were selected to represent the family • Seed multiple alignments were built using ClustalW and manual checking • HMM models were built using hmmbuild (suite of programs called HMMER) • Using these models more family members were added in an iterative process of adding new members to multiple alignment and updating the HMM Model until no more new members are found

  14. How to build and use Profile HMMs • Get a family of seed sequences in multiple alignment • Build a Hidden Markov Model using hmmbuild • Use HMM as a query to find remote homologues in the sequence database using hmmsearch • Add new sequences to the seed alignment using hmmalign and update the model, iteratively • Get the consensus sequence of the model using hmmemit • Query HMM with new query sequences to find if the sequences are related to the Model using hmmpfam

  15. SledgeHMMER web server • Accessible at http://SledgeHMMER.sdsc.edu • Pfam database is the largest protein functional domain database built by Hidden Markov Models • This server provides quick access to pre-calculated Pfam results for 1.2 million (entire SP+TrEMBL databases) protein sequences • Sequences are compared with PERL MD5 hexadecimal hashing methods • Web server is implemented in PERL/CGI interface

  16. Predicting sub-cellular localization of proteins

  17. Different cellular compartments (modified from Voet & Voet, Biochemistry; Weinheim, New York, Basel, Wiley-VCH 1992)

  18. Methods to predict sub-cellular location • Based on amino acid composition • Based on signal or target peptides • PSORT • TargetP • Based on domain occurrence patterns • MITOPRED • Based on lexical analysis

  19. Amino acid compositional differences in different sub-cellular locations

  20. PSORT (http://psort.ims.u-tokyo.ac.jp/) • PSORT program works based on a comprehensive knowledge of protein sorting • Different parameters relevant to different groups of species are determined • Bacterial sequences • N-terminal signal sequence (Positive - H region)/cleavage site • Transmembrane segments • Lipoprotein Analysis • Amino Acid composition

  21. PSORT continued … • Eukaryotic sequences (Yeast/Animal/Plant) • N-terminal signal sequence (Positive-H region)/cleavage site • Transmembrane segments and Membrane topology • Mitochondrial targeting signals and AAC of NT-20 amino acids • Nuclear localization signals (NLS) • Peroxysome matrix targeting sequences (PTSs) (S/A/C)(K/R/H/)L • Chloroplast targeting signals • Endoplasmic Reticulum signals (KDEL or HDEL-yeast) • Vesicular, liposomal, vacuolar proteins etc.

  22. MITOPRED (http://mitopred.sdsc.edu) • A new method based on Pfam domain occurrence patterns, amino acid composition (AAC) and pI value differences between mitochondrial and non-mitochondrial proteins • Eukaryotic cells have multiple compartments and hence a set of pathways are localized to a specific compartment. Thus, a protein family involved in a specific pathway is expected in a specific compartment • A knowledge base is developed by studying the occurrence and co-occurrence patterns of different Pfam domain in different cellular compartments • The method compares the Pfam domains found in the query sequence against the knowledge-base and assigns a score, depending on which compartment it belongs to • Independent scores are calculated based on the AAC, pI values of the query sequence by comparing them to the average values in different locations • Final prediction is based on the combined score from AAC, pI and Pfam scores

  23. More in Cytoplasmic More in Mitochondrial

  24. pI value differences in different sub-cellular locations

  25. Flowchart showing MITOPRED procedure

  26. MITOPRED Web Server • Accessible at http://mitopred.sdsc.edu • Implemented using PERL/CGI interface • Pre-calculated predictions are available for all eukaryotic proteins from Swiss-prot and TrEmbl databases (~500000) • Genome-scale predictions can be downloaded for yeast, C.elegans, Drosophila, human, mouse and Arabidopsis species • Provides data for the Mitoproteome database accessible at http://www.mitoproteome.org

  27. Prediction of sub-cellular location by lexical analysis • Separate SP proteins into different sub-cellular classes based on annotation • In each class, extract all unique keywords for each sequence • The total # of keywords in all classes is equal to the feature space (N) • Generate a binary vector for each sequence in each class where the length of the vector is equal to N, 1 if the keyword is present and 0 if its absent. • For the Unknown protein, generate a binary vector similar to above, based on its key words. From this, generate sub-vectors of size 2k-1 (where k is equal to the number of key words in the unknown) by flipping the 1s to 0s. • Based on the sub-vectors, retrieve all proteins with matching binary vectors from all classes. • The unknown belongs to the class that contributes the most number of sequences in the retrieved group. • This program works better, if the number of keywords are more as well as the family size is bigger.

  28. Flow diagram of lexical analysis method (From Nair R, Rost Burkhard, Bioinformatics 18:S78-S86, 2000)

  29. Predicting Post-translational Modification Sites of Proteins

  30. General Method for PTM site Prediction • PROSITE provides consensus patterns for a lot of PTM sites, however in most cases these patterns are very short and the true modifications occur based on the structural or environmental context in the protein fold • Because of this reason, methods based on reg expressions or local alignment methods produce large number of false positives • In almost all methods used in PTM site prediction, artificial neural networks (ANNs) are used. • General procedure: • Prepare datasets experimentally-known to possess a type of PTM site • Separate the dataset into training and testing data • Train a network using training data and test it with the testset. This process is iterated until the model is well refined • Sufficient number of training sequences and good quality data are important for the success of any neural network method

  31. Different Post-translational modifications (PTMs) • Glycosylation • ASN(N)-glycosylation (NetNGlyc) • O-glycosylation (NetOGlyc) • Sulfation (Sulfinator) • Phosphorylation (NetPhos) • Myristoylation (NMT)

  32. Prediction of Glycosylation Sites (NetNGlyc, NetOGlyc) • Glycoproteins are specially synthesized molecules by covalent attachment of oligosaccharides to certain proteins at the ASN(N-glycosylation) or Ser or Thr (O-glycosylation) residues. • These are usually exported to extra-cellular destinations like mucin in alimentary tract or glycoprotein harmones in the anterior pitutory gland. • N-glycosylation • O-glycosyltion • No consensus pattern • SEA domain is associated with it

  33. Prediction of Sulfation Sites • Protein tyrosine sulfation is an important post-translational modification for proteins that go through the secretory pathway. It regulates several protein-protein interactions and modulates the binding affinity of TM peptide receptors • Based on the rules described above, HMMs could be trained to build models for predicting proteins sequences with patterns that abide these rules

  34. Sulfinator Algorithm (http://us.expasy.org/tools/sulfinator/) • Sulfinator employs four different HMMs to recognize N-terminal (HMM-N), Internal (HMM-I), C-terminal (HMM-C) and in Y-clusters (HMM-Y)

  35. Prediction of Phosphorylation Sites (NetPhos (http://www.cbs.dtu.dk/services/NetPhos/) • Protein kinases, a very large family of enzymes catalyze phosphorylation • NetPhos produces neural network predictions for serine (S), threonine (T) or tyrosine (Y) phosphorylation sites in eukaryotic proteins that affect a multitude of cellular signaling processes • Y-kinase Phosphorylation • S or T-Phosphorylation in Caesin Kinase II • Since these are very short patterns, the amino acids surrounding a phosphorylated residue are significant in determining whether a particular site is phosphorylated or not

  36. Standalone Tools

  37. Local Installation of tools and databases • NCBI-Toolkit • Formatting and using BLAST • CD-HIT • CLUSTALW • HMMER package

  38. Current Trends in Bioinformatics

  39. Reductionistic Approach Cell Integrative Approach Components Biology Systems Biology Structure Function Genomics Transcriptomics Proteomics Metabolomics

  40. Highway network system in San Antonio

More Related