Loading in 2 Seconds...

Mining Patterns in Protein Structures Algorithms and Applications

Loading in 2 Seconds...

- 117 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Mining Patterns in Protein Structures Algorithms and Applications' - rollin

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Proteins Are the Machinery of Life

Protein Structure Initiative

Function

Spatial motifs

Protein Data Bank

Serine protease

Papain-like

Cysteine protease

GTP binding

protein

MotifSpace

protein

classification

Digital

Library

EC

Protein

Data Bank

GO

CATH

SCOP

User Input

protein

structures

articles

protein family

Motif

Filter

Motif

Miner

Protein

Classifier

Knowledge

Retriever

Feature selection

Association

discovery

spatial

motifs

Subgraph

mining

Classification

Info retrieval

Text mining

family-specific

motifs

experimental

knowledge

Motif

Navigator

Visualization

Spatial Motif

Database

Spatial Motif

Knowledgebase

Indexing &

Search

Knowledge

management

Modeling a Protein by a Set of Points

- Amino acids can be presented by points in a 3D space.

ATOM 156 C GLY A 38 43.696 71.361 61.773 1.00 25.96 C

ATOM 157 O GLY A 38 43.916 70.461 62.583 1.00 27.40 O

ATOM 158 N HIS A 39 43.506 72.626 62.145 1.00 25.72 N

ATOM 159 CA HIS A 39 43.583 73.021 63.550 1.00 22.52 C

ATOM 160 C HIS A 39 42.367 73.829 63.983 1.00 19.35 C

ATOM 161 O HIS A 39 41.790 74.562 63.187 1.00 20.24 O

ATOM 162 CB HIS A 39 44.821 73.890 63.798 1.00 26.08 C

ATOM 163 CG HIS A 39 46.117 73.173 63.590 1.00 32.47 C

ATOM 164 ND1 HIS A 39 46.786 72.533 64.612 1.00 34.50 N

ATOM 165 CD2 HIS A 39 46.850 72.967 62.471 1.00 31.79 C

ATOM 166 CE1 HIS A 39 47.875 71.961 64.129 1.00 36.40 C

ATOM 167 NE2 HIS A 39 47.937 72.209 62.832 1.00 31.42 N

ATOM 168 N LEU A 40 41.986 73.701 65.248 1.00 22.27 N

ATOM 169 CA LEU A 40 40.851 74.468 65.724 1.00 21.68 C

ATOM 170 C LEU A 40 41.226 75.942 65.709 1.00 23.21 C

Protein structures are chains of amino acid residues with certain spatial arrangements

ASP102

HIS57

ALA55

SER195

ASP194

GLY43

GLY42

SER190

GLY40

Frequent subgraph mining:

Given a group of proteins G each of which is represented by a graph and a support threshold 1≥ σ ≥ 0, find all maximal subgraphs which occurs in at least σ fraction of graphs in G

node ↔ amino acid residue

edge ↔ potential physical interaction

Graph

complexity

Information

Challenge: subgraph isomorphism (NP-complete)

Almost-Delaunay (AD)

- A 4-tuple of points is almost-Delaunay with parameter , if, by perturbing all points in the set by at most , the circumscribing sphere can become empty.
- A 4-tuple of points is AD() if is the minimal perturbation.

Vertex can move within a sphere of radius

R1

New tetrahedron may be formed due to the perturbation

R4

R5

R2

Blue: Delaunay is AD(0)

Red: is AD()

R3

(Bandyopadhyay and Snoeyink, SODA, 2004)

Recurring patterns from Graph Databases

Input: a database of labeled undirected graphs

p2

p4

s1

q1

x

b

c

x

x

c

x

c

s2

q2

p1

y

d

d

y

d

x

x

x

x

c

c

a

c

s3

q3

p5

p3

(S)

(Q)

(P)

Output: All (connected) frequent subgraphs from the graph database.

x

y

d

3/3

2/3

c

c

c

c

3/3

3/3

c

c

x

x

c

x

y

2/3

d

3/3

y

2/3

d

d

x

x

c

c

c

p4

x

b

c

x

p1

y

d

x

>

>

x

c

a

c

y

c

p5

p3

x

0

a

(P)

d

0

x

0

b

x

c

x

x

0

0

d

d

x

y

c

x

c

M3

p’2

p’4

0

0

x

a

x

x

y

c

a

c

0

x

0

0

b

x

p’1

0

x

b

0

M2

y

d

0

0

x

0

a

x

x

M1

b

c

p’5

p’3

(P’)

Canonical Adjacency Matrix- The Canonical Adjacency Matrix(CAM) of a graph G is the maximaladjacency matrix for G under a total ordering defined on adjacency matrices.

P3 P2 P5 P4 P1

P1 P2 P3 P4 P5

P1 P2 P3 P5 P4

dxcxyc0x0b00x0a > dxcxyc00xa0x00b > cycx0a0x0bxx00d

b

a

b

y

b

x

b

a

a

y

b

y

b

0

x

b

y

0

b

a

y

b

y

x

b

p2

p5

s1

q1

y

c

b

y

y

y

b

b

s2

p1

q2

x

a

a

a

x

y

y

y

y

d

b

b

b

p4

s3

q3

p3

(S)

(P)

(Q)

CAM Tree: Frequent Subgraphs= 2/3

Fast Frequent Subgraph Mining

- Spatial locality
- Subgraphs with boundeddegree and size
- Apriori property
- any supergraph of an infrequent subgraph is infrequent
- eliminates unnecessary isomorphism checks
- Canonical form
- Avoid redundant examination
- Depth-first
- Incremental isomorphism check
- Better memory utilization
- The state of the art algorithm that can handle large and complex protein graphs
- Open issues
- Substitution
- Dynamics and geometric constraints

Proof of ConceptSerine Proteases

Packing motifs identified in the Eukaryotic Serine Protease. N: total number of structures included in the data set. σ: The support threshold used to obtain recurring spatial motifs, T: processing time (in unit of second). M: motif number, C: the sequence of one-letter residue codes for the residue composition of the motif, κ: the actual number of occurrences of a motif in the family, λ, the background frequency of the motif, and S= -log(P) where the P-value defined by a hyper-geometric distribution. The packing motifs were sorted first by their support values in descending order, and then by their background frequencies in ascending order. The –log(P) values are highlighted

Proof of ConceptSerine Proteases

38 highly specific motifs mined from

serine proteases classified by

SCOP v1.65 (Dec 2003)

1HJ9

1MD8

1OP0

1OS8

1PQ7

1P57

1SSX

1S83

Proof of ConceptPapain-like Cysteine Protease

All the patterns have –log(P) > 49,: support in the PCP family, : number of occurrences outside the family. Patterns that contain the active diad (His and Cys) of the proteins are highlighted.

Proof of ConceptPapain-like Cysteine Protease

The active site in 1cqd

Choi, K. H., Laursen, R. A. & Allen, K. N. (1999). The 2.1 angstrom structure of a cysteine protease with proline specificity from ginger rhizome, zingiber officinale. Biochemistry, 7, 38(36), 11624–33.

Proof of ConceptFunction Inference of Orphan Structure

1nfg

1m65

SCOP

51556

CASP5

T0147

unknown function

no good sequence and global structure alignment to known proteins

7-stranded barrel fold, 30 motifs found

Metallo-dependent hydrolase (MDH)

8-stranded ba (TIM) barrel fold

17 members, 49 family specific spatial motifs

Proof of ConceptFunction Inference II

1ecs

1twu

SCOP

54598

Yyce

Antibiotic resistance protein

Glyoxalase / bleomycin resistance / dioxygenase superfamily

4 members (SCOP 1.65), 62 family specific spatial motifs

unknown function, not in SCOP 1.67, DALI z < 10 in Nov 2004

46 motifs found, structurally similar to the three new non-redundant AR proteins added in SCOP 1.67

References and Acknowledgement

- Collaborators
- Catherine Blake (information retrieval)
- Charlie Carter (biochemistry)
- Nikolay Dohkolyan (biophysics)
- Leonard McMillan (computer graphics)
- Jan Prins (high performance computing)
- Jack Snoeyink (computational geometry)
- Alexander Tropsha (pharmacy)
- Partially supported by
- Microsoft eScience Applications Award
- Microsoft New Faculty Fellowship
- NSF CAREER Award IIS-0448392
- NSF CCF-0523875
- NSF DMS-0406381
- Prototype deployed at

- Comparing graph representations of protein structure for mining family-specific residue-based packing motifs, Journal of Computational Biology (JCB), 2005.
- SPIN: Mining maximal frequent subgraphs from graph databases, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 581-586, 2004.
- Mining spatial motifs from protein structure graphs,. Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB), pp. 308-315, 2004.
- Accurate classification of protein structural families using coherent subgraph analysis, Proceedings of the Pacific Symposium on Biocomputing (PSB), pp. 411-422, 2004.
- Efficient mining of frequent subgraph in the presence of isomorphism, Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), pp. 549-552, 2003.
- Another 45 papers on general methodology development directly related to this project

Download Presentation

Connecting to Server..