Outline

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

Outline • Introduction • Protein Subcellular Localization • Document Classification • PSLDoc • Term and its weighting scheme • Feature Reduction • SVM learning • Evaluation and Results • Discussion

Protein Subcellular Localization

Document Classification

Vector Space Model • Salton’s Vector Space Model • Represent each document by a high-dimensional vector in the space of words Documents Vectors Gerald Salton

Vectors in Term Space

Term-document matrix is mn matrix wheremis number of terms and n is number of documents Term-Document Matrix document term

Term Weighting by TFIDF • The term frequency (tf) in the given document dgives a measure of the importance of the term tiwithin the particular document with ni being the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms • The inverse document frequency (idf) is obtained by dividing the number of all documents by the number of documents containing the term ti, • |D| : total number of document in the corpus • : number of documents where the term ti appears tfidf = tf*idf

Predicted by 1 Nearest-Neighbor based on Cosine Similarity • similarity between document and query

Feature Reduction •  a best choice of axes – shows most variation in the data. => Found by linear algebra: Singular Value Decomposition (SVD) True plot in k dimensions Reduced-dimensionality plot

Singular Value Decomposition 40 Term-document matrix Reduced feature size = 40 features

The Terms of Proteins - Gapped-dipeptides* • Let XdZ denote the amino acid coupling pattern of amino acid types X and Z that are separated by d amino acids If d= 20, there are 8400 (=20*20*21) features for a vector *Liang HK, Huang CM, Ko MT, Hwang JK. The Amino Acid-Coupling Patterns in Thermophilic Proteins. Proteins: Structure, Function and Bioinformatics (2005), 59, 58-63.

Term Weighting Scheme – TF Position Specific Score Matrix (1/2) • Position Specific Score Matrix (PSSM) : A PSSM is constructed from a multiple alignment of the highest scoring hits in the BLAST search

Term Weighting Scheme – TF Position Specific Score Matrix (2/2) • The weight of XdZ : where f(i,Y) denotes the normalized value of the PSSM entry at the ith row and the column corresponding to amino acid typeY • An example W(M2D,P) = f(1,M) × f(4,D) + f(2,M) × f(5,D) + … + f(78,M) × f(81,D) = 0.99995×0.04743 + 0.11920×0.00247 +…+ 0.00669×0.26894

Feature Reduction - Probabilistic Latent Semantic Analysis (1/3)

Feature Reduction - Probabilistic Latent Semantic Analysis (2/3) • A joint probability between a term w and a document d can be modeled as: Latent variable z (“small” #states) Concept expression probabilities Document-specific mixing proportions • The parameters could be estimated by maximum-likelihood function through EM algorithm.

Feature Reduction - Probabilistic Latent Semantic Analysis (3/3)

Classifier – Support Vector Machines • Support Vector Machines (SVM) • LIBSVMsoftware • Five 1-v-rest SVM classifiers corresponding to five localization sites. • Kernel: Radial Basis Function (RBF) • Parameter selection • c (cost) and γ(gamma) are optimized • five-fold cross-validation SVMCP v.s. -CP SVMIM v.s. -IM SVMPP v.s. -PP SVMOM v.s. -OM SVMEC v.s. -EC *Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

System Architecture PSLDoc Protein Subcellular Localization prediction by Document classification

Data set (1/3) • Gram-negative bacteria : PS1444 • ePSORTdb version 2.0 Gram-negative • 1444 proteins PSHigh783 Pairwise Sequence identity > 30% PSLow661

Data set (2/3) • Eukaryotic proteins, 7579 proteins, 12 localization sites Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 2003;19(13):1656-1663.

Data set (3/3) • Human data set, 2197 proteins, 9 localization sites Scott MS, Thomas DY, Hallett MT. Predicting subcellular localization via protein motif co-occurrence. Genome Res 2004;14(10A):1957-1966.

Evaluation • Accuracy (Acc) • l = 5is the number of total localization sites • Niare the number of proteins in localization site I • Matthew’s correlation coefficient (MCC)

Simple Prediction Methods (1/2) • 1NN_TFIDF : 1NN + gapped-dipeptides + TFIDF • 1NN_TFPSSM : 1NN + gapped-dipeptides + PSSM

Simple Prediction Methods (2/2) • 1NN_PSI-BLASTps , 1NN_PSI-BLASTnr • 1NN_ClustalW Training Database PSI-BLAST PSI-BLAST PSSM NCBI nr Database Training Database Query Protein Similar Protein PSSM ClustalW

The comparison of 1NN_TFIDF and 1NN_TFPSSM on the PSHigh783and PSLow661 data sets.

Comparison of 1NN_TFPSSM, 1NN_ClustalW, 1NN_PSI-BLASTps and 1NN_PSI-BLASTnr

Evaluation and Results *HYBIRD combines the results of CELLO II and ALIGN.

Evaluation and Results

Prediction Confidence • The confidence of the final predicted class • Prediction Confidence =the largest probability - the second largest probability Largest Second Prediction Confidence = SVMCP – SVMOM

Prediction Threshold (1/3)

Prediction Threshold (2/3)

Prediction Threshold (3/3) *The threshold is set such that the coverage is similar with PSLT.

Gapped-peptide signature • The size of topics = 80

Gapped-peptide signature • The site-topic preferenceof the topic z for a localization site l = average { P(z|d)| d (a protein) belongs to l class} Acc.=90 Acc.=89

Gapped-peptide signature • Distance = 13 (The size of gapped-dipeptides = 5,600)

Gapped-peptide signature • For each localization site, ten preferred topics according to site-preference confidence ( = the largest site-topic preference - the second largest site-topic preference) • For each topic, five most frequent gapped-dipeptides are selected.

Gapped-peptide signature

Gapped-dipeptide signatures reflecting motifs relevant to protein localization sites • In the integral membrane proteins, in which helix-helix interactions are stabilized by aromatic residues. Specifically, the aromatic motif (WXXW or W2W) is involved in the dimerization of transmembrane domains by π-π interactions. • In the outer membrane class, where the C-terminal signature sequence is recognized by the assembly factor, OMP85, regulating the insertion and integration of OM proteins in the outer membrane of gram-negative bacteria. The C-terminal signature sequence contains a Phe (F) at the C-terminal position, preceded by a strong preference for a basic amino acid (K, R). => R0F

The amino acid compositions of single residues and gapped-dipeptide signatures for each localization site

The grouped amino acid compositions of single residues and gapped-dipeptide signature Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR), and A (aromatic: FYW)

Gapped-dipeptide signatures and their amino acid compositions for each localization site Amino acid groups: N (non-polar: AIGLMV), P (polar: CNPQST), C (charged: DEHKR), and A (aromatic: FYW)

Gapped-dipeptide signatures and their amino acid compositions for each localization site • IM has a high percentage of non-polar amino acids (60%) and no charged (0%) amino acids. • The physico-chemical properties of the lipid bilayer, in which non-polar amino acids are favored in the transmembrane domains of IM proteins. • Charged amino acids are disfavored due to the penalty incurred in energy terms in the assembly of IM proteins. • CP and EC classes have a high percentage of charged and polar amino acids, respectively. • The role of charged amino acids in the cytoplasm is probably related to pH homeostasis in which they act as buffers, whereas secreted proteins in the EC classes may require more polar amino acids for promoting interactions in the solvent environment.

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: