- 255 Views
- Updated On :
- Presentation posted in: General

Similarity Methods. C371 Fall 2004. Limitations of Substructure Searching/3D Pharmacophore Searching. Need to know what you are looking for Compound is either there or not Don’t get a feel for the relative ranking of the compounds Output size can be a problem. Similarity Searching.

Similarity Methods

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Similarity Methods

C371

Fall 2004

- Need to know what you are looking for
- Compound is either there or not
- Don’t get a feel for the relative ranking of the compounds

- Output size can be a problem

- Look for compounds that are most similar to the query compound
- Each compound in the database is ranked
- In other application areas, the technique is known as pattern matching or signature analysis

- Structurally similar molecules usually have similar properties, e.g., biological activity
- Known also as “neighborhood behavior”
- Examples: morphine, codeine, heroin
- Define: in silico
- Using computational techniques as a substitute for or complement to experimental methods

- One known active compound becomes the search key
- User sets the limits on output
- Possible to re-cycle the top answers to find other possibilities
- Subjective determination of the degree of similarity

- Evaluation of the uniqueness of proposed or newly synthesized compounds
- Finding starting materials or intermediates in synthesis design
- Handling of chemical reactions and mixtures
- Finding the right chemicals for one’s needs, even if not sure what is needed.

- No hard and fast rules
- Numerical descriptors are used to compare molecules
- A similarity coefficient is defined to quantify the degree of similarity
- Similarity and dissimilarity rankings can be different in principle

“Consider two objects A and B, a is the number of features (characteristics) present in A and absent in B, b is the number of features absent in A and present in B, c is the number of features common to both objects, and d is the number of features absent from both objects. Thus, c and d measure the present and the absent matches, respectively, i.e., similarity; while a and b measure the corresponding mismatches, i.e., dissimilarity.” (Chemoinformatics; A Textbook (2003), p. 304)

- Commonly based on “fingerprints,” binary vectors with 1 indicating the presence of the fragment and 0 the absence
- Could relate structural keys, hashed fingerprints, or continuous data (e.g., topological indexes that take into acount size, degree of branching, and overall shape)

- Tanimoto Coefficient of similarity for Molecules A and B:
SAB = c _

a + b – c

a = bits set to 1 in A, b = bits set to 1 in B, c = number of 1 bits common to both

Range is 0 to 1.

Value of 1 does not mean the molecules are identical.

- Tanimoto coefficient is most widely used for binary fingerprints
- Others:
- Dice coefficient
- Cosine similarity
- Euclidean distance
- Hamming distance
- Soergel distance

- Used to define dissimilarity of molecules
- Regards a common absence of a feature as evidence of similarity

- Distance values must be zero or positive
- Distance from an object to itself must be zero

- Distance values must be symmetric
- Distance values must obey the triangle inequality: DAB ≤ DAC + DBC
- Distance between non-identical objects must be greater than zero.
- Dissimilarity = distance in the n-dimensional descriptor space

- Small molecules often have lower similarity values using Tanimoto
- Tanimoto normalizes the degree of size in the denominator:
SAB = c _

a + b – c

- Similarity can be based on continuous whole molecule properties, e.g. logP, molar refractivity, topological indexes.
- Usual approach is to use a distance coefficient, such as Euclidean distance.

- Another approach: generate alignment between the molecules (mapping)
- Define MCS: largest set of atoms and bonds in common between the two structures.
- A Non-Polynomial- (NP)-complete problem: very computer intensive; in the worst case, the algorithm will have an exponential computational complexity
- Tricks are used to cut down on the computer usage

- A structure’s key features are condensed while retaining the connections between them
- Cen ID structures with similar binding characteristics, but different underlying skeletons
- Smaller number of nodes speeds up searching

- Aim is often to identify structurally different molecules
- 3D methods require consideration of the conformational properties of molecules

- Descriptors: geometric atom pairs and their distances, valence and torsion angles, atom triplets
- Consideration of conformational flexibility increases greatly the compute time
- Relatively fewer pharmacophoric fingerprints than 2D fingerprints
- Result: Low similarity values using Tanimoto

- A structural abstraction of the interactions between various functional group types in a compound
- Described by a spatial representation of these groups as centers (or vertices) of geometrical polyhedra, together with pairwise distances between centers
- http://www.ma.psu.edu/~csb15/pubs/searle.pdf

- Require consideration of the degrees of freedom related to the conformational flexibility of the molecules
- Goal: determine the alignment where similarity measure is at a maximum

- Consideration of the electron density of the molecules
- Requires quantum mechanical calculation: costly
- Property not sufficiently discriminatory

- Molecule positioned at the center of a sphere and properties projected on the surface
- Sphere approximated by a tessellated icosahedron or dodecahedron
- Each triangular face is divided into a series of smaller triangles

- Need a mechanism for exploring the orientational (and conformational) degrees of freedon for determining the optimal alignment where the similarity is maximized
- Methods: simplex algorithm, Monte Carlo methods, genetic alrogithms

- Generally, 2D methods are more effective that 3D
- 2D methods may be artificially enhanced because of database characteristics (close analogs)
- Incomplete handling of conformational flexibility in 3D databases

- Best to use data fusion techniques, combining methods

- See Dr. John Barnard’s lecture at:
http://www.indiana.edu/~cheminfo/C571/c571_Barnard6.ppt