530 likes | 676 Views
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. Outline. Introduction Protein Subcellular Localization Document Classification PSLDoc Term and its weighting scheme Feature Reduction SVM learning
E N D
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis
Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion
Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion
Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion
Vector Space Model • Salton’s Vector Space Model • Represent each document by a high-dimensional vector in the space of words Documents Vectors Gerald Salton
Term-document matrix is mn matrix wheremis number of terms and n is number of documents Term-Document Matrix document term
Term Weighting by TFIDF • The term frequency (tf) in the given document dgives a measure of the importance of the term tiwithin the particular document with ni being the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms • The inverse document frequency (idf) is obtained by dividing the number of all documents by the number of documents containing the term ti, • |D| : total number of document in the corpus • : number of documents where the term ti appears tfidf = tf*idf
Predicted by 1 Nearest-Neighbor based on Cosine Similarity • similarity between document and query
Feature Reduction • a best choice of axes – shows most variation in the data. => Found by linear algebra: Singular Value Decomposition (SVD) True plot in k dimensions Reduced-dimensionality plot
Singular Value Decomposition 40 Term-document matrix Reduced feature size = 40 features
Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion
The Terms of Proteins - Gapped-dipeptides* • Let XdZ denote the amino acid coupling pattern of amino acid types X and Z that are separated by d amino acids If d= 20, there are 8400 (=20*20*21) features for a vector *Liang HK, Huang CM, Ko MT, Hwang JK. The Amino Acid-Coupling Patterns in Thermophilic Proteins. Proteins: Structure, Function and Bioinformatics (2005), 59, 58-63.
Term Weighting Scheme – TF Position Specific Score Matrix (1/2) • Position Specific Score Matrix (PSSM) : A PSSM is constructed from a multiple alignment of the highest scoring hits in the BLAST search
Term Weighting Scheme – TF Position Specific Score Matrix (2/2) • The weight of XdZ : where f(i,Y) denotes the normalized value of the PSSM entry at the ith row and the column corresponding to amino acid typeY • An example W(M2D,P) = f(1,M) × f(4,D) + f(2,M) × f(5,D) + … + f(78,M) × f(81,D) = 0.99995×0.04743 + 0.11920×0.00247 +…+ 0.00669×0.26894
Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion
Feature Reduction - Probabilistic Latent Semantic Analysis (1/3)
Feature Reduction - Probabilistic Latent Semantic Analysis (2/3) • A joint probability between a term w and a document d can be modeled as: Latent variable z (“small” #states) Concept expression probabilities Document-specific mixing proportions • The parameters could be estimated by maximum-likelihood function through EM algorithm.
Feature Reduction - Probabilistic Latent Semantic Analysis (3/3)
Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion
Classifier – Support Vector Machines • Support Vector Machines (SVM) • LIBSVMsoftware • Five 1-v-rest SVM classifiers corresponding to five localization sites. • Kernel: Radial Basis Function (RBF) • Parameter selection • c (cost) and γ(gamma) are optimized • five-fold cross-validation SVMCP v.s. -CP SVMIM v.s. -IM SVMPP v.s. -PP SVMOM v.s. -OM SVMEC v.s. -EC *Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
System Architecture PSLDoc Protein Subcellular Localization prediction by Document classification
Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion
Data set (1/3) • Gram-negative bacteria : PS1444 • ePSORTdb version 2.0 Gram-negative • 1444 proteins PSHigh783 Pairwise Sequence identity > 30% PSLow661
Data set (2/3) • Eukaryotic proteins, 7579 proteins, 12 localization sites Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 2003;19(13):1656-1663.
Data set (3/3) • Human data set, 2197 proteins, 9 localization sites Scott MS, Thomas DY, Hallett MT. Predicting subcellular localization via protein motif co-occurrence. Genome Res 2004;14(10A):1957-1966.
Evaluation • Accuracy (Acc) • l = 5is the number of total localization sites • Niare the number of proteins in localization site I • Matthew’s correlation coefficient (MCC)
Simple Prediction Methods (1/2) • 1NN_TFIDF : 1NN + gapped-dipeptides + TFIDF • 1NN_TFPSSM : 1NN + gapped-dipeptides + PSSM
Simple Prediction Methods (2/2) • 1NN_PSI-BLASTps , 1NN_PSI-BLASTnr • 1NN_ClustalW Training Database PSI-BLAST PSI-BLAST PSSM NCBI nr Database Training Database Query Protein Similar Protein PSSM ClustalW
The comparison of 1NN_TFIDF and 1NN_TFPSSM on the PSHigh783and PSLow661 data sets.
Comparison of 1NN_TFPSSM, 1NN_ClustalW, 1NN_PSI-BLASTps and 1NN_PSI-BLASTnr
Evaluation and Results *HYBIRD combines the results of CELLO II and ALIGN.
Prediction Confidence • The confidence of the final predicted class • Prediction Confidence =the largest probability - the second largest probability Largest Second Prediction Confidence = SVMCP – SVMOM
Prediction Threshold (3/3) *The threshold is set such that the coverage is similar with PSLT.
Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion
Gapped-peptide signature • The size of topics = 80
Gapped-peptide signature • The site-topic preferenceof the topic z for a localization site l = average { P(z|d)| d (a protein) belongs to l class} Acc.=90 Acc.=89
Gapped-peptide signature • Distance = 13 (The size of gapped-dipeptides = 5,600)
Gapped-peptide signature • For each localization site, ten preferred topics according to site-preference confidence ( = the largest site-topic preference - the second largest site-topic preference) • For each topic, five most frequent gapped-dipeptides are selected.
Gapped-dipeptide signatures reflecting motifs relevant to protein localization sites • In the integral membrane proteins, in which helix-helix interactions are stabilized by aromatic residues. Specifically, the aromatic motif (WXXW or W2W) is involved in the dimerization of transmembrane domains by π-π interactions. • In the outer membrane class, where the C-terminal signature sequence is recognized by the assembly factor, OMP85, regulating the insertion and integration of OM proteins in the outer membrane of gram-negative bacteria. The C-terminal signature sequence contains a Phe (F) at the C-terminal position, preceded by a strong preference for a basic amino acid (K, R). => R0F
The amino acid compositions of single residues and gapped-dipeptide signatures for each localization site
The grouped amino acid compositions of single residues and gapped-dipeptide signature Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR), and A (aromatic: FYW)
Gapped-dipeptide signatures and their amino acid compositions for each localization site Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR), and A (aromatic: FYW)
Gapped-dipeptide signatures and their amino acid compositions for each localization site • IM has a high percentage of non-polar amino acids (60%) and no charged (0%) amino acids. • The physico-chemical properties of the lipid bilayer, in which non-polar amino acids are favored in the transmembrane domains of IM proteins. • Charged amino acids are disfavored due to the penalty incurred in energy terms in the assembly of IM proteins. • CP and EC classes have a high percentage of charged and polar amino acids, respectively. • The role of charged amino acids in the cytoplasm is probably related to pH homeostasis in which they act as buffers, whereas secreted proteins in the EC classes may require more polar amino acids for promoting interactions in the solvent environment.