1 / 16

Computational Molecular Biology

Computational Molecular Biology. Bin Liu Intelligent Computing Research Center TEL: 18038100727 bliu@insun.hit.edu.cn binliu@hitsz.edu.cn. Protein and Amino Acids. Protein Primary structure – a string over alphabet {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}.

Download Presentation

Computational Molecular Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Molecular Biology Bin Liu Intelligent Computing Research Center TEL: 18038100727 bliu@insun.hit.edu.cn binliu@hitsz.edu.cn

  2. Protein and Amino Acids • Protein • Primary structure – a string over alphabet {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

  3. Remote homology detection based on oligomer distances • Oligomer: in chemistry, an oligomer (is Greek for "a few") is a molecular complex that consists of a few monomer units, in contrast to a polymer that, at least in principle, consists of a nearly unlimited number of monomers. Dimers, trimers, and tetramers are, for instance, oligomers respectively composed of two, three and four monomers. • Protein remote homology detection: the objective is to predict structural or functional properties of proteins by means of homologies, i.e. based on sequence similarity with phylogenetically related proteins, for which these properties are known.

  4. Support Vector Machines (SVMs) • support vector machines (SVMs) are supervised learning models with associated learning algorithms used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

  5. Feature vector • In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis.

  6. SVMs

  7. Protein Remote Homology Detection • Background • Problem definition:classification problem: Sequence similarity are from high to low The schematic plot of the hierarchy for the SCOP database

  8. The benchmark (Liao and Noble, 2002)

  9. A tree (or 3-row) graph to show the remote homology system on the SCOP benchmark

  10. Our description

  11. oligomer distance histograms (ODH)

  12. position-specific scoring matrix(PSSM)

  13. Cross-validation • In literatures, the following three cross-validation methods are often used to evaluate the quality of a predictor • Self-consistency; • Independent test • n-fold cross-validation • Jackknife cross-validation

  14. Overfitting • Commonly, the results obtained by Self-consistency outperform others. If the results of Self-consistency is similar as those of Jackknifie cross-validation (Leave-one-out) and n-fold cross-validation, then the predictive system is stable, otherwise, it is not stable, overfitting.

  15. Strange results???

  16. Dataset issues (Lin et al. 2011)

More Related