Computational Molecular Biology

Computational Molecular Biology Bin Liu Intelligent Computing Research Center TEL: 18038100727 bliu@insun.hit.edu.cn binliu@hitsz.edu.cn

Protein and Amino Acids • Protein • Primary structure – a string over alphabet {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

Remote homology detection based on oligomer distances • Oligomer: in chemistry, an oligomer (is Greek for "a few") is a molecular complex that consists of a few monomer units, in contrast to a polymer that, at least in principle, consists of a nearly unlimited number of monomers. Dimers, trimers, and tetramers are, for instance, oligomers respectively composed of two, three and four monomers. • Protein remote homology detection: the objective is to predict structural or functional properties of proteins by means of homologies, i.e. based on sequence similarity with phylogenetically related proteins, for which these properties are known.

Support Vector Machines (SVMs) • support vector machines (SVMs) are supervised learning models with associated learning algorithms used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

Feature vector • In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis.

。 SVMs

Protein Remote Homology Detection • Background • Problem definition：classification problem: Sequence similarity are from high to low The schematic plot of the hierarchy for the SCOP database

The benchmark (Liao and Noble, 2002)

A tree (or 3-row) graph to show the remote homology system on the SCOP benchmark

Our description

oligomer distance histograms (ODH)

position-specific scoring matrix(PSSM)

Cross-validation • In literatures, the following three cross-validation methods are often used to evaluate the quality of a predictor • Self-consistency； • Independent test • n-fold cross-validation • Jackknife cross-validation

Overfitting • Commonly, the results obtained by Self-consistency outperform others. If the results of Self-consistency is similar as those of Jackknifie cross-validation (Leave-one-out) and n-fold cross-validation, then the predictive system is stable, otherwise, it is not stable, overfitting.

Strange results???

Dataset issues (Lin et al. 2011)

Computational Molecular Biology