1 / 18

An Efficient Index-based Protein Structure Database Searching Method

An Efficient Index-based Protein Structure Database Searching Method. 陳冠宇. Introduction. More than 18,000 protein structures stored in PDB (September 2002) Structural comparison(3D) and database searching – other methods practice exhaustive searching Their design philosophy:

wattan
Download Presentation

An Efficient Index-based Protein Structure Database Searching Method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Efficient Index-based Protein Structure Database Searching Method 陳冠宇

  2. Introduction • More than 18,000 protein structures stored in PDB (September 2002) • Structural comparison(3D) and database searching – other methods practice exhaustive searching • Their design philosophy: • Filter-and-refine • Using Indexed-based searching method • Results: 16 times faster than DALI

  3. Filter-and-Refine ProtDex Actual alignment query result Top 100 proteins Database 20,000 proteins

  4. Problem Definition • Protein Structures • 3D Structural Comparison • Structural Database Searching

  5. A protein is composed of a sequence of amino acid (AA) residues. SSE – secondary structure element (ex. helices, sheets) Loop Regions (no specific shape)

  6. Sequence Comparison vs. Structural Comparison • One cannot determine the similarity of two remotely homologous proteins by sequence comparison. • We try to superimpose one protein structure over another in order to obtain the minimum rootmean square deviation(RMSD)between them. -> O(n4m4)

  7. The ProtDex Method • Step 1: Extracting Information from PDB database • Step 2: Building Intra-molecular Distance Matrices • Design rationale: two protein structures are similar if their distance matrices are similar • Step 3: Cutting Fixed Matrices and Extracting Properties • Step 4: Building Inverted File Index

  8. Step 1: Extracting Information • For each protein chain in PDB file: • PDB id - chain id; No. of AA residues; No. of SSEs • For each AA Residue: • 3D coordinate (x, y, z) of C carbon • For each SSE: • SSE type (Helix or Sheet); SSE Start position; SSE length

  9. Step 2: Representation - Building Distance Matrices Protein 9xxxx with 7 AA residues

  10. Step 3-1: Contact Patterns & Fixed-Size Matrices SSE(H) SSE(E) contact patterns Fixed-size matrix

  11. Step 3-2: Extracting Properties • For the 2X2 sub-matrix starting at the cell (2, 2), we store the values: 8, HH, (3,3), (1,1), (1,1) • For the 2X2 sub-matrix starting at the cell (3,6), we store the values: 49, HE, (3,2), (1,2), (2,1), etc.

  12. Step 4: Building Inverted File Index Implemented as sorted list

  13. Searching a Protein Structure • S(Q,P) = WFMCount(Q,P) X WGSum(I,j) X Sigma(match(I,j)[ (WTerm(i) X max(match(a,b)^PdbIdb=P)( WArea(a,b) X WARatio(a,b) X WOrdinal(a,b) ) ] • WFMCountis to compensate the effect that the large proteins being matched and scored more frequently than the small ones. • WTerm is to add more weight to the query index terms that rarely occur in the database.

  14. Discussion • Design: • representation of structures • scoring schemes • comparison algorithms • assessment of the results • Performance • Accuracy – SCOP classification hierarchy is made of 4 levels: class, fold, superfamily and family • Pros and Cons of ProtDex

  15. Conclusions • Advantages: • Speed (need not to scan through each structure in the database) • Disadvantages: • Cannot provide the actual alignment • Storage overhead for the index structure (the entire index: 1.2GB) • Time requirement to build and update the index (building the entire index: 30min 38 sec)

More Related