Mining patterns from protein structures
This presentation is the property of its rightful owner.
Sponsored Links
1 / 42

Mining Patterns from Protein Structures PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on
  • Presentation posted in: General

Mining Patterns from Protein Structures. Wei Wang University of North Carolina at Chapel Hill. Outline. Introduction Motivation Challenges Graph-based Pattern Discovery in Protein Structures Applications Conclusions Future Directions. Lys. Lys. Gly. Gly. Leu. Val. Ala. His.

Download Presentation

Mining Patterns from Protein Structures

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Mining patterns from protein structures

Mining Patterns from Protein Structures

Wei Wang

University of North Carolina at Chapel Hill


Outline

Outline

  • Introduction

    • Motivation

    • Challenges

  • Graph-based Pattern Discovery in Protein Structures

  • Applications

  • Conclusions

  • Future Directions


Introduction

Lys

Lys

Gly

Gly

Leu

Val

Ala

His

Oxygen

Nitrogen

Carbon

Sulfur

Ribbon

Introduction

  • Protein

    • A sequence from 20 amino acids

    • Adopts a stable 3D structure that can be measured experimentally


Introduction1

Serine protease active center

1HJ9

1R64

1SSX

Introduction

  • Structure patterns are geometric arrangements of amino acids that are common to a group of different proteins.

Three proteins with the same function


Motivations

Motivations

  • Structure patterns are useful in:

    • Protein structure alignment

    • Protein design

    • Prediction of protein-protein interactions

    • Understanding protein folding

    • Drug design


Mining patterns from protein structures

Goal

  • Develop techniques to discover structure patterns that are

    • Efficient

    • Effective


Challenges

Growth of Known Structures in Protein Data Bank

35,000

The total number of known

protein structures

Newly characterized proteins

in that year

# of structures

1988

2005

Year

Challenges

  • Define mathematical models to represent protein structures

    • Point set

    • Labeled graph

  • Define computational components

    • Define structure pattern

    • Specify a matching condition

    • Design a search procedure

  • Evaluate the results

    • computational efficiency and effectiveness


The nature of protein structure data

….

The Nature of Protein Structure Data

  • The ball-stick model is an element-based structure representation

    • A structure is decomposed into a set of amino acids

    • Proteingeometry,topology,andattributesare defined with respect to the amino acid set


Components of pattern discovery

Components of Pattern Discovery

  • The definition of patterns

    • Geometry vs. topology

  • The matching condition

    • Measures the fitness of a pattern to a set of protein structures

  • The search procedure


Related work

Related Work

Protein Local Structure Comparison Problem

Pattern Discovery

Pattern Matching

  • ASSAM, Artymiuk et al., JMB’94

  • TESS, Wallace et al., Prot. Sci. ‘97

Sequence-dependent

Sequence-independent

  • TRILOGY, Bradley et al., RECOMB’01

Multi-way comparison

Pair-wise comparison

  • PINTS, Russell, JMB’98

  • Geometric Hashing, Fischer et al., Prot. Sci.’94

  • Graph Matching, Schmitt et al., JMB’02

  • Evolutionary Trace, Lichtarge et al., JMB’96

  • FFSM & its variants, Huan et al., ICDM’03, RECOMB’04, CSB’06

Huan et al. Advances in Computers


Our approach

Our Approach

A group of protein structures

Represent each structure as a labeled graph

Discover frequent occurring subgraphs

Map subgraphs to protein structures and obtain structure patterns

Predict protein function

Identify functional sites in proteins

Discover patterns in structure evolution


Outline1

Outline

  • Introduction

  • Graph-based Pattern Discovery in Protein Structures

    • Labeled graphs and representing structures as labeled graphs

    • Frequent subgraph mining

  • Applications

  • Conclusions

  • Future Directions


Labeled graphs

p5

p2

y

c

b

y

p1

x

a

y

y

d

b

p4

p3

G1

q1

s1

s4

y

b

c

y

y

b

s2

q2

a

a

x

y

y

b

b

s3

q3

G3

G2

Labeled Graphs

  • A labeled graph is a graph where each node and each edge has a label.


Protein contact map

Protein Contact Map

  • Use a labeled graph to represent a protein structure

    • Nodesrepresent amino acids,labeled by theidentityof the amino acids

    • Edgesconnect two amino acids if their Euclidian distance is less than a certain threshold

Contact

A protein


Pattern matching

p5

p2

s1

s4

y

y

c

b

b

c

y

y

s2

p1

x

a

a

y

y

y

q1

b

d

b

y

b

s3

p4

p3

q2

G3

G1

a

x

g2

g3

y

y

y

c

b

a

b

g1

q3

G2

G

Pattern Matching

  • A graph G is subgraph isomorphic to a graph G’, denoted by G  G’, if

    • there exists a 1-1 mapping from nodes in G to G’ such that node labels, edges, and edge labels are preserved with the mapping.

  • A pattern is a graph. Pattern Gmatches G’ if G  G’

    • Goccurs in G’ if G  G’.

    • With a label set, a graph space is a collection of graphs whose labels are from the set.


Subgraph mining notations cont

Subgraph Mining: Notations Cont.

  • The support value of a pattern P in a collection of graphs G is the fraction of graphs in G where Poccurs.

  • Given a collection of graphs G and a threshold 0 <   1, the frequent subgraph mining problem is the identification of all patterns that have support at least .


Examples

p5

p2

y

c

b

y

p1

x

a

y

y

d

b

p4

p3

G1

y

y

b

c

b

q1

s1

s4

y

b

P3

b

P2

y

b

c

y

y

b

y

+

s2

q2

x

x

a

a

a

a

x

y

y

y

f=3/3

x

f=2/3

b

b

a

f=2/3

b

b

b

b

+

+

P6

P5

s3

q3

+

P4

G3

G2

Examples

The induced subgraph isomorphism penalizes any unmatched edges

 = 2/3

b

y

f=2/3

f=0/3

f=2/3

f = 1/3

f = 3/3

a

y

b

P1

+: induced frequent subgraphs


Examples1

p5

p2

y

c

b

y

p1

x

a

y

y

d

b

p4

p3

G1

b

y

y

y

b

a

c

b

y

q1

s1

s4

b

y

b

P1

b

P2

y

b

c

y

y

b

y

s2

q2

x

x

a

a

a

a

x

y

y

y

f=3/3

x

f=2/3

b

b

a

f=2/3

b

b

b

b

P6

P5

s3

q3

P4

G3

G2

Examples

Maximal frequent subgraph are ones that none of their supergraphs are frequent

Other criteria for selecting subgraphs may be incorporated

 = 2/3

f=2/3

!

P3

!: Maximal frequent subgraphs


Search dag

Search DAG

  • Task: identify all frequently occurring subgraphs from a group of graphs, or a graph database

  • Support anti-monotonicity

    • Any supergraph of an infrequent subgraph is infrequent

    • Known as the Apriori property

  • Level-wise search

    • Keep all patterns with the same size in memory (poor memory utilization)

  • Depth-firstsearch

    • Better memory utilization

    • May repeatedly search patterns in the DAG (redundant candidates)


Related work1

Related Work

  • Level-wise search

    • AGM: Inokuchi et al., PKDD’00

    • FSG: Kuramochi & Karypis, ICDM’01

  • Depth-first search

    • gSpan, Yan & Han, ICDM’02, KDD’03

    • FFSM, Huan et al., ICDM’03

  • Path-based search

    • Vanetik, et al., ICDM’02, ICDE’04

    • GASTON: Nijssen & Kok, KDD’04

  • Tree-based search

    • SPIN, Huan et al., SIGKDD’04

  • Mining with constraints

    • CSM, Huan et al., CSB’06


The fast frequent subgraph mining ffsm overview

The Fast Frequent Subgraph Mining (FFSM) Overview

  • Graph normalization

  • Graph Canonical Adjacency Matrix Tree (CAM Tree)

  • Incremental subgraph isomorphism test

Huan et al. ICDM 2003


Intuitions for graph normalization

An arbitrary set 

Intuitions for Graph Normalization

A Graph Space

A partial order defined on the graph space

A 1-1 mapping

A partial order defined on 


Graph normalization

Graph Normalization

  • With a partially ordered set (, ),φ: G* →  that maps a graph space G* to  is a graph normalization function if φ is a 1-1 mapping.

  • (mapping partial orderφ) Given a graph normalization φandits codomain(, ), we define a binary relation

    φ G*  G* such that P φQ if φ(P) φ(Q)

  • Claim: φis a partial order


Ideal normalization

Ideal Normalization

  • Given a partially ordered codomain (, ),a normalization functionφ: G* → is an ideal normalization if

    • φinduces a search tree (No redundant candidates)

    • φ is a subset of the subgraph relation, i.e. for all graphs P and Q, P φQ implies PQ (anti-monotonicity of support )


Graph canonical code

p’2

P1 P2 P3 P4

P1 P2 P4 P3

P1 P4 P2 P3

b

x

p’1

y

a

x

a

a

a

x

c

b

x

x

0

b

c

b

p’4

p’3

0

x

x

x

y

x

b

b

c

M1

M3

M2

(P’)

y

0

0

x

x

0

0

y

x

b

b

c

p2

p4

x

c

b

x

p1

y

a

x

b

p3

(P)

Graph Canonical Code

  • The Canonical Code (θ)maps a graph G to a string.

  • Claim:θ: G* → (*, ) is a graph normalization

    θ: G* → (*, ) is an ideal graph normalization

Code(M1): (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c)

Code(M1): (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) <

Code(M2):(1, 1, a)(2, 1, x) (2, 2, b) (3, 2, x) (3, 3, c) (4, 1, x) (4, 2, y) (4, 4, b) <

Code(M3): (1, 1, a)(2, 2, c) (3, 1, x) (3, 2, x) (3, 3, b) (4, 1, x) (4, 3, y) (4, 4, b)

θ(P) = (1, 1, a)(2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c)

  • (i, j, Mi,j)  (k, l, Mk,l) if

  • i < k, or

  • i = k, j < l, or

  • i =k, j = l, Mi,j  Mk,l


Ffsm search

FFSM Search

  • Task: identify all frequently occurring subgraphs from a family of graphs

  • Depth-firstsearch

    • Better memory utilization

  • Apriori property

    • Eliminate unnecessary isomorphism checks

  • Graph normalization

    • Avoid redundant examination

  • Subgraph isomorphism test is NP-complete

    • Incremental isomorphism check

  • Applies to frequent induced subgraph mining with minor modifications

+


Performance of ffsm

O

=

_

_

C

C

C

Performance of FFSM

Running time (s)

PTE (Predictive Toxicology Evaluation) data set

  • Contains 340 chemicals

  • Performances were collected from literatures where experiments were performed with different hardware configurations (400Mhz PIII to 2GHz PIV)

  • Software downloadable from http://www.cs.unc.edu/~huan

  • AGM: Inokuchi et al. PKDD’00

  • FSG: Kuramochi & Karypis, ICDM’01

  • gSpan: Yan & Han, ICDM’02

  • FFSM: Huan et al. ICDM’03

  • Gaston: Nijssen & Kok, KDD’04


Ffsm scalability

FFSM Scalability

Running time (s)

Serine protease:

  • Contains 40 proteins

  • Contact is defined between every pair of distinct residues if the distance between their C atoms is less than a certain upper-bound (e.g. 6.5 angstrom)

  • Performances were measured in a single 2GHz PIV CPU with 2GB main memory

    • gSpanhandles graphs with no more than 254 edges

    • Gaston runs out of memory


Outline2

Outline

  • Introduction

  • Graph-based Pattern Discovery in Protein Structures

  • Applications

    • MotifSpace Architecture

    • Identify functional sites in proteins

    • Predict protein function

  • Conclusions

  • Future Directions


Effectiveness

Effectiveness

  • Serine proteases have three subclasses

    • Subtilisins

    • Eukaryotic serine proteases

    • Prokaryotic serine proteases

1HJ9

1R64

1SSX


Frequent patterns

Frequent Patterns

  • 20 highly specific patterns mined from serine proteases

# of patterns is the total number of fingerprints a protein has. The coverage of a protein is the fraction of residues which are covered by at least one fingerprint (%), Length (of the protein) is displayed in unit of 200 residues


Patterns biological relevance

Patterns’ Biological Relevance

1HJ9

1MD8

1OP0

1OS8

1PQ7

1P57

1SSX

1S83


More case studies

More Case Studies

  • Papain-like cysteine proteases

  • Nuclear receptor ligand binding domains

  • NADP/FAD binding proteins

Papain-like cysteine protease Nuclear Binding domains NADP binding proteins


Predict protein function

Predict Protein Function

How does a protein function in a biological system?

Function

Functional motifs carry out protein function

3D structure of a protein


Distinguishing families with different function

Abr. Name #M #P

#M: number of members in a family

#P: number of patterns obtained from the family

Distinguishing Families with Different Function

  • TIM barrel Fold contains many proteins with similar structures but different functions

Bandyopadhyay, Huan et al. Prot. Sci. ‘06


Functional inference for 1twu

Functional Inference for 1TWU

1ecs

1twu

Yyce

SCOP 54598

Antibiotic resistance protein

Glyoxalase / bleomycin resistance / dioxygenase superfamily

4 members (SCOP 1.65), 62 family specific spatial motifs

unknown function, not in SCOP 1.67, DALI z < 10 in Nov 2004

46 motifs found, structurally similar to the three new non-redundant AR proteins added in SCOP 1.67


Motifspace architecture

G

O

C

A

T

H

S

C

O

P

MotifSpace Architecture

Biological

Experiments

Protein

Data Bank

testable hypotheses

Experimental validation

protein

structures

protein family

Pattern

Filter

Pattern

Miner

Protein

Classifier

Pattern

Validation

Subgraph

mining

Visualization

Classification

Feature

selection

structure

patterns

family-specific

patterns

Structure Pattern

Database

Functional Motifs

Knowledgebase

Indexing &

Search

Knowledge

management

Huan et al. ISMB’05 demo, http://escience2-cs.cs.unc.edu/Default.aspx


Summary

Summary

Goal: pattern discovery in protein structures

  • Develop labeled graph representations for protein structures

  • Design algorithms to identify recurring subgraphs in a collection of graphs

    • Frequent, constrained, maximal, or coherent subgraph mining

    • Performance evaluation on various data sets

  • Collaborate with domain experts to evaluate the utility of the algorithms

    • Predict function for protein structures

    • Identify structure patterns in protein fold families


Future work

Future Work

  • Pattern discovery in protein structures

    • Approximate pattern discovery

    • More applications:

      • Protein-protein interaction

      • Protein subcellular localization


Complex data in biology

Complex Data in Biology

Data Models Biological Data Volume


Data analysis in biological systems

Biological systems at the molecular level

Data Analysis in Biological Systems

  • Challenges:

    • What are the nature of the data from biological systems?

    • What are the computational tasks?

    • How to divide the tasks into a group of computational components?

    • How to evaluate the results?

Source: http://bioinformatics.ca/workshop_pages/bioinformatics/


Acknowledgements

Acknowledgements

  • Collaborators: Charlie Carter (UNC School of Medicine), Nikolay Dokholyan (UNC School of Medicine),Leonard McMillan, Jan Prins, Jack Snoeyink,Alexander Tropsha (UNC School of Pharmacy)

  • Students: Deepak Bandyopadhyay, Yetian Chen (UNC School of Pharmacy), Jun Huan, Jinze Liu, Ruchir Shah (UNC School of Pharmacy), Kiran Sidhu, Xueyi Wang, David Williams, Tao Xie, Jingdan Zhang


  • Login