120 likes | 218 Views
This study introduces a new method to enhance the discrimination between homologs and non-homologs in protein sequences. By optimizing the score function, the researchers aimed to improve the alignment accuracy and homology detection. The existing methods for calculating homology scores, such as substitution matrices and gap penalties, were analyzed for their limitations. The theory behind Z-score for alignment significance and the optimization process of the new score function were detailed. The study utilized a training set of protein pairs with low sequence identity to optimize the score function and gap penalties. The results demonstrated that the optimized score function, named OPTIMA, successfully improved the average confidence values compared to standard matrices. This approach provides a more effective tool for discriminating between homologs and non-homologs in protein sequences and overcomes the limitations of current score functions.
E N D
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al
Introduction • New method to calculate a score function, aiming to optimize the ability to discriminate between homologs and non-homologs • Existing software uses the following to compute an alignment score:
Number of times AA i is aligned with AA j Number of gaps in alignment Number of residues in each gap beyond one Score function / Substitution matrix Contribution to score for AA match/mismatch Contribution to score for gap initialization Contribution to score for gap extension
Current Methods to Calculate Homology • p(Sr > x): probability that a random pair of proteins of the same length would have that score • E: expected number of random proteins in the db that would have at least that score • P: probability that there is at least one random pair with a higher score • As p(Sr > x), E, P increase, the likelihood that the given pair is homologous decreases
Current Score Matrices • PAM (percent accepted mutations) – Dayhoff • GCB, JTT: used to apply to larger sequence datasets • BLOSUM62 – Henikoff & Henikoff, constructed using a dataset of aligned sequence blocks • STR – protein sequences aligned based on their observed structures
Limitations of Current Score Functions • Current score functions assume independent evolution of each location, overlooking correlations • Score functions derived from a db of properly aligned proteins, not on alignments between random sequences • Gap penalty a priori
Theory Z score for alignment: • Characterize the significance of alignment score by calculating the likelihood that this score or higher would be obtained by a random match • Account for variations in E with the length of the proteins
Theory • Score function optimized by maximizing the confidence <C> over the training set • Avoids dependence on extreme E values (easily detected or overly distant homologies) • Eliminates contribution of falsely identified homologies (overly distant)
Database Preparation • Use set of known homologs whose homology cannot be reliably determined with standard pairwise comparison, in order to optimize score function for detection of distant homologs • Training set: 900 pairs of protein in same COG with < 25% sequence identity
Optimization of Score Function • Align using BLOSOM62 matrix • Calculate Z and C for each pair of homologs, then averaged over pairs in training set to yield <C> • Generate initial alignments using gap penalties that yielded highest C values • ~10 cycles of optimization and realignments until score function converged
Results • Small changes in gap penalties: most of the improvement cones from refinements of • OPTIMA: resulting score function • has significantly improved average confidence <C> value compared with other score matrices • <p(Sr > x)>, <P> significantly decreased
Summary • Aim: optimize score matrix to discriminate between homologs and non-homologs • OPTIMA score function: more successful at discriminating between homologs and non-homologs compared with standard score matrices • Gap penalties treated as additional parameters to be optimized