Eecs 800 research seminar mining biological data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 42

EECS 800 Research Seminar Mining Biological Data PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on
  • Presentation posted in: General

EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Overview. A quick review of PCA Graph mining in microarray analysis Graph indexing. Data Matrix. The data matrix: where is a column vector is the column mean of. Projection.

Download Presentation

EECS 800 Research Seminar Mining Biological Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Eecs 800 research seminar mining biological data

EECS 800 Research SeminarMining Biological Data

Instructor: Luke Huan

Fall, 2006


Overview

Overview

  • A quick review of PCA

  • Graph mining in microarray analysis

  • Graph indexing


Data matrix

Data Matrix

  • The data matrix:

    where is a column vector

    is the column mean of


Projection

Projection

  • Project the data matrix to a line

    where is a unit column vector

  • The variance of the projection is

    Where is the covariance matrix


Eecs 800 research seminar mining biological data

Derivation of PCs

To find λ that maximizes V subject to

Let k be a Lagrange multiplier

Therefore λ is an eigenvector of Σ.

Chose the engenvector with the largest eigenvalue


Relational graph

Relational Graph

  • Each node represents a distinct entity

    • Social networks

    • Gene relevance networks

    • Protein interaction networks

YKL172W

YOR206W

YPL146C


Motivation

Motivation

  • Highly connected subgraphs in a large graph usually are not artifacts (group, functionality)

  • Recurrent patterns discovered in multiple graphs are more robust than the patterns mined from a single graph


Microarray data analysis

Microarray Data Analysis

  • How to integration results from multiple microarray experiments that are performed on the same set of genes?


Problem definition

Problem Definition

  • Given a set of relational graphs, find all frequent closed subgraphs with high edge connectivity


Constraints

Constraints

  • Highly connected subgraph

    • The edge connectivity is greater than a threshold

  • Frequent subgraph

    • A subgraph is frequent if a large number of graphs contain this subgraph

  • Closed subgraph

    • A subgraph is closed if there does not exist a supergraph that has the same support


Minimum cut decomposition

G

Minimum Cut Decomposition

  • A minimal cut of a graph G is the (minimal) set of edges, once removed from G, G becomes an unconnected graphs.

  • The connectivity of G is defined as the size of the minimal cut of G.

  • Problem: find subgraphs in a graph such that its minimum cut size (edge connectivity) is greater than K


Minimum cut decomposition1

Minimum Cut Decomposition

  • Solution: repeatedly find a minimum cut in the graph and remove the cut edges until the minimum cut size is greater than K or there is no edge left


Challenges

Challenges

  • How to perform minimum cut decomposition in the context of multiple relational graphs

  • How to integrate with pattern-growth approach and pattern-reduction approach


No downward closure property

G’

G

No Downward Closure Property

  • Given two graphs G and G’, if G is a subgraph of G’, it does not imply that the connectivity of G is less than that of G’, and vice versa.


Minimum degree constraint

Minimum Degree Constraint

Let G be a frequent graph and X be the set

of edges which can be added to G such that

G U e (e ε X) is connected and frequent.

Graph G U X is the maximal graph that can be

extended for the vertices belong to G.

G U X

G


Pattern growth approach

Pattern-Growth Approach

  • Find a small frequent candidate graph

    • Remove vertices (shadow graph) whose degree is less than the connectivity

    • Decompose it to extract the subgraphs satisfying the connectivity constraint

    • Stop decomposing when the subgraph has been checked before

  • Extend this candidate graph by adding new vertices and edges

  • Repeat


Pattern reduction approach

Pattern-Reduction Approach

  • Decompose the relational graphs according to the connectivity constraint


Pattern reduction approach cont

Pattern-Reduction Approach (cont.)

  • Intersect them and decompose the resulting subgraphs

decompose

+

+


Experimental results

Experimental Results

  • Pattern-growth approach: CloseCut

  • Pattern-reduction approach: Splat

  • Synthetic data: the number of graphs, objects, seeds, the size of seeds, the density, the number of seeds per graph, and the density of noise edges


Experimental results cont

Experimental Results (cont.)

  • 32 yeast microarray data sets from Stanford Microarray Database and the NCBI Gene Expression Omnibus

  • Each data set has the expression profiles of 6,661 genes in at least 8 experiments,

    • Cell cycle

    • Amino acid starvation

    • Heat shock

  • We constructed 32 relational graphs from this dataset


Experimental results synthetic data

Experimental Results (Synthetic Data)


Experimental results 32 m icroarray datasets

Experimental Results (32 Microarray Datasets )


Discovered patterns

Discovered Patterns

  • Ribosomal RNA Processing

UNKNOWN


Discovered patterns1

Discovered Patterns

  • Ribosomal Biogenesis

UNKNOWN


Summary

Summary

  • Introduce a new graph mining problem

  • Develop graph algorithms in the context of multiple graphs, where the existing methods should be re-examined

  • Demonstrate the applicability of frequent graph mining in biological network


Types of graph database queries

Types of Graph Database Queries

  • Given a query graph Q and a graph database G, perform one of the following:

    • Graph Isomorphism Query: Find a graph in G equivalent to Q.

    • Subgraph Isomorphism Query: Find all graphs in G with a subgraph equivalent to Q.

    • Similarity Query: Find all graphs in G which are similar to Q.


Graph isomorphism

Graph Isomorphism

Let V(G) be the vertex set of a graph and E(G) its edge set. Graphs G and H are isomorphic iff there is a bijection f: V(G) →V(H) such that uv ε E(G) if and only if f(u)f(v) ε E(G).


Graph labeling

Graph Labeling

  • All nodes in a graph may be considered equivalent. Labels in such a graph are merely names.

  • Alternatively, graphs may be labeled with class labels.

For example, in the graph of benzene, vertexes labeled with “C” correspond with carbon atoms. Vertexes with “H” correspond with hydrogen atoms.

Nodes and edges with different class labels are not considered interchangeable.


Class labels and isomorphism

B

B

B

A

Z

A

Z

Z

A

A

B

B

B

B

B

M

M

M

Class Labels and Isomorphism

Under a class labeling scheme, graph isomorphism limits a bijection to only map nodes/edges with an equivalent class label.

@

@


Subgraph isomorphism

Subgraph Isomorphism

  • Let V(G) be the vertex set of a graph and E(G) its edge set. Graphs G and H are sub-isomorphic iff there is an injection f: V(G) -> V(H) such that uv ε E(G) if and only if f(u)f(v) ε E(G).

    In other words:

  • A graph G is sub-isomorphic to graph H iff graph G is isomorphic to at least one subgraph of H.


Graph similarity

Graph Similarity

  • What makes two graphs similar?

    • Abstractly, two graphs can be described as similar if they have a high number of corresponding nodes and edges.

    • However, in depending on the interpretation, the change of a single node may (or may not) completely change the properties of the represented object.

      Thus, similarity is application dependent.


Graph similarity1

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

M

M

M

M

Graph Similarity

For this discussion, similarity between two graphs, G1 and G2, is defined as the maximum number of node and edge matches under any mapping of nodes between them.

B

A

Z

N

M

Similarity = 3

Similarity = 6


A graph isomorphism query

B

B

B

B

B

A

Z

M

M

B

B

B

B

M

M

B

B

B

B

Z

N

A Graph Isomorphism Query

B

B

+

=>

M

Graph Query

Graph IsomorphismMatches

B

B

Database of Graphs


A subgraph isomorphism query

B

B

B

B

B

B

A

A

Z

Z

M

M

B

B

B

B

B

B

M

M

M

B

B

B

B

Z

N

A Subgraph Isomorphism Query

B

B

+

=>

M

Graph Query

Subgraph IsomorphismMatches

B

B

Database of Graphs


A graph similarity query

B

B

B

B

B

B

A

A

Z

Z

M

M

B

B

B

B

B

B

M

M

M

B

B

B

B

B

B

B

B

Z

Z

N

N

A Graph Similarity Query

> 4

Similarity

Criteria

+

=>

B

B

M

Graph Query

B

B

Subgraph Similarity

Matches

Database of Graphs


Computational challenges

Computational Challenges

  • Pairwise graph comparisons is difficult.

    • Graph isomorphism problem is GI-Complete.

    • Subgraph isomorphism problem is NP-Complete.

    • Usual similarity comparisons also not in P.

  • Graph databases are often large in size.

    • NCI/NIH AIDS antiviral screen dataset contains ~42,000 chemical compounds with average 25 vertices and 27 edges.

    • Intelligent indexing and filtering is needed!


Related work

Related Work

  • GraphGrep by Sasha et al.

    • Filters by enumerating all possible node-to-node paths up to a specified maximum length.

  • GIndex by Yan et al.

    • Indexes by finding distinctive features from frequently occurring subgraphs.

  • Limitations:

    • Support only discrete values for nodes and edges.

    • Require exhaustive enumeration of features.

    • Summarizing features lose information about graphs.


Graph closures

B

B

M

Graph Closures

A graph closure is a an element-wise union of graphs. It has the characteristics of a graph except that instead of singleton labels, a graph closure can have multiple labels. The symbol ε represents a null label.

{B,C}

B

C

B

M

M

{C, ε}

C

G1

G2

C1 = Closure(G1, G2)


Volume of graph closures

B

B

C

B

M

M

Volume of Graph Closures

A graph closure is a bounding container which can contain one or more graphs. The volume of a graph closure is determined by the number of graph permutations it contains.

B

B

B

C

{B,C}

B

==

M

M

M

C

C

{C, ε}

Volume(C1) = 4

C1


Isomorphism of graph closures

B

B

M

Isomorphism of Graph Closures

Isomorphism can be extended to graph closures. When matching, any label of node or edge can be used:

  • A graph is sub-isomorphic to a graph closure if it is sub-isomorphic to at least one of the graphs it encloses.

{B,C}

C

B

B

=>

M

M

M

{C, ε}

Sub-isomorphs

C


Pseudo subgraph isomorphism

Pseudo Subgraph Isomorphism


Further readings

Further Readings

  • H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, Mining coherent dense subgraphs across massive biological networks for functional discoveryISMB'05.

  • X. Yan, X. Jasmine Zhou, and J. Han, Mining closed relational graphs with connectivity constraints, by SIGKDD'05.

  • Huahai He, Singh, A.K. Closure-Tree: An Index Structure for Graph Queries, ICDE’06

  • David Williams, Jun Huan, Wei Wang Graph Database Indexing Using Structured Graph Decomposition, ICDE’07


  • Login