1 / 57

# Graph Similarity - PowerPoint PPT Presentation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Graph Similarity' - liv

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Institute of Computer Science and Technology of Peking University

### Graph Similarity

Instructor: Lei Zou

• Maximal Common Subgraph

• Minimal Edit Distance

• Graph Similarity Search

• Maximal Common Subgraph

• Minimal Edit Distance

• Graph Similarity Search

Def. 1 (Induced Subgraph). An induced subgraph is a set S of vertices of a graph G and those edges of G with both endpoints in S.

Def. 2 (Maximal Common Induced Subgraph ) A graph G12 is a common induced subgraph of graphs G1 and G2 if G12 is isomorphic to induced subgraphs of G1 and G2, respectively. A maximum common induced subgraph (MCIS) consists of a graph G12 with the largest number of vertices meeting the aforementioned property

A

A

A

B

C

B

C

B

C

B

C

D

D

D

D

MCIS

MCIS

Def. 3 (Maximum Common Edge Subgraph) An MCES is a subgraph consisting of the largest number of edges common to both G1 and G2

A

A

B

C

B

C

D

D

• Maximum clique-based algorithm(for MCIS)

Def. 4 The modular product of two graphs G1 and G2 is defined on the vertex set V (G1) × V (G2) with two

vertices (ui vi ) and (uj vj ) being adjacent whenever

1. ui and vi have the same vertex label, so do uj and vj

2. (ui uj ) ∈ E(G1) and (vi vj ) ∈ E(G2), or

3. (ui uj) E(G1) and (vi vj ) E(G2).

v1

u1

A

A

(u1, v1)

(u3, v3)

u2

u3

v3

v2

B

C

B

C

(u2, v2)

(u4, v4)

D

v4

D

u4

modular product

(association graph)

A Maximal Clique in the modular product corresponds to a maximal common induced subgraph

• Def. 5 A clique in a graph G is a subset of vertices in the graph such that each pair of vertices in the subset is connected by an edge in the graph G.

A maximal clique (极大团) is a clique that cannot be extended by including one more adjacent vertex, that is, a clique which does not exist exclusively within the vertex set of a larger clique.

A maximum clique ( 最大团) is a clique of the largest possible size in a given graph. The clique number ω(G) of a graph G is the number of vertices in a maximum clique in G.

Maximal clique:

(1,2,3)

(1,3,4,5)

A maximum clique:

(1,3,4,5)

1

4

2

3

• Bron–Kerbosch algorithm(@1973)

Basic Algorithm:

R=null; and P=V(G); // V(G) denotes all vertices in G

FindingMaximalClque(R,P):

if P is empty:

report R as a maximal clique

for each vertex v in P:

FindingMaximalClque (R ⋃ {v}, P ⋂ N(v))

// N(v) denotes all v’s neighbor vertices.

Problem: It may generate duplicate answers

1

2

3

FindingMaximalClque(R,P):

if P is empty:

report R as a maximal clique

for each vertex v in P:

FindingMaximalClque (R ⋃ {v}, P ⋂ N(v))

// N(v) denotes all v’s neighbor vertices.

P=P\ {v};

Problem: It may generate some un-maximal clique.

1

2

3

R=null; and P=V(G); // V(G) denotes all vertices in G

FindingMaximalClque(R,P, S):

if P and S are both empty:

report R as a maximal clique

for each vertex v in P:

FindingMaximalClque (R ⋃ {v}, P ⋂ N(v), X ⋂ N(v))

// N(v) denotes all v’s neighbor vertices.

P=P\ {v};

X= X ⋃ {v}; // why ???

• Theorem. Given a vertex u, consider that all the maximal cliques containing Q ∪ {u} have been generated. Then, every new maximal clique containing Q, but not Q ∪ {u}, must contain at least one vertex q that is not adjacent to u.

• Backtracking algorithms (e.g., McGregor algorithm) (for both MCIS and MCES)

It can be suitably described through a State Space Representation . Each state s represents a common subgraph of the two graphs under construction. This common subgraph is part of the MCS to be eventually formed.

• Maximal Common Subgraph

• Minimal Edit Distance

• Graph Similarity Search

• Six edit operations

• Insert an isolated vertex

• Delete an isolated vertex

• Change the label of a vertex

• Insert an edge between two disconnected vertices

• Delete an edge from two connected vertices

• Change the label of an edge

• Graph Edit Distance:

• The minimum operations needed to transform a graph to another one (NP-Hard)

A

A

A

B

D

B

D

B

C

G1

A

A

B

C

B

C

MED(G1,G2)=4

D

D

G2

Given two graphs G1 and G2, assume that they have the same number of vertices. Define a function f: V(G1)  V(G2). The distance under this function is:

The distance between G1 and G2 is defined as

We can prove that

If G1 and G2 have different vertex numbers, assume that |V(G1)| < |V(G2)|, we introduce |V(G2)|-|V(G1)| pseudo vertices, the following equation still holds.

A

A

B

D

G1

B

C

D

G2

• Exact Algorithm (A*-algorithm )

What’s A*-algorithm:

A* uses a best-first search and finds a least-cost path from a given initial node to one goal node (out of one or more possible goals). As A* traverses the graph, it follows a path of the lowest known heuristic cost, keeping a sorted priority queue of alternate path segments along the way.

where g(x) denotes the cost from the starting node to the current node; h(x) denotes the "heuristic estimate“ (lower bound) of the distance from  to the goal

Given two graphs G1 and G2 have the same number m of vertices, let us consider the following process.

Let N1 and N2 denote the vertices in G1 and G2 that have been matched.

N1=(v1,v2,…,vn);

N2=(u1,u2,…,un);

Let M1 and M2 denote the vertices in G1 and G2 that have not been matched.

M1=(vn+1,vn+2,…,vm);

M2=(un+1,un+2,…,um).

• Maximal Common Subgraph

• Minimal Edit Distance

• Graph Similarity Search

Comparing Stars: On Approximating Graph Edit Distance

Zhiping Zeng, Anthony K.H. Tung, Jianyong Wang, Jianhua Feng, Lizhu Zhou

@VLDB09

• 问题定义

Given a graph database D consisting of n graphs

• Approximate full graph search

• Find all the graphs in D s.t. { gi | GED(q,gi) ≤𝜏 }

• Approximate subgraph search

• Find all the graphs in D s.t. { gi |GED(q,r) ≤𝜏 and r gi }

• Main Idea:

G star structures

• Star Structure:

三元组(r,L,l): r: root vertex

L: the set of leaves

l: labeling function

T

• Star edit distance

• Given two multisets of star structures S1 and S2, P: S1 S2 , is a bijection.

• Assignment Problem

What’s the relationship between GED(g1,g2) and

?

A distance function f is metricif and only if the following conditions hold:

We can prove that

• Graph edit distance is metric. (assume that all edit operation cost is non-negative)

• Mapping distance is also metric.

• Given two graphs g1 and g2, Let P=(p1, p2, . . . , pk) be an alignment transforming g1 to g2. Accordingly, there is a sequence of graphs

g1=h0h1. . .hk=g2, where hi−1hi indicates that hi is the derived graph by performing pi over hi−1.

As is metric, thus, we have the following equation:

What’s the relationship between one operation pi and ?

• Edge Insertion/Deletion

One edge insertion/deletion at most affect two stars. Each star cost is at most 2.

Thus, due to one edge insertion/deletion.

What’s the relationship between one operation pi and ?

2. Vertex Insertion/Deletion

One vertex insertion/deletion at most affect one star. Each star cost is at most 1.

Thus, due to one vertex insertion/deletion.

What’s the relationship between one operation pi and ?

3. Vertex Relabeling

One vertex relabeling v0 at most affects deg(v0)+1’s stars.

Lower Bound

Upper Bound:

Based on the bipartite graph matching, we can define a upper bound for the edit distance.

Experiment datasets

• Real dataset

• AIDS antivirus screen component. 42,687 chemical components

• Synthetic dataset

• 1000 graphs, average size:10

Efficient graph similarity joins with edit distance constraints

Xiang Zhao, Chun Xiao, Xuemin Lin, and Wei Wang.

@ICDE12

• 问题定义

Given two sets of graphs 𝑅 and 𝑆, a graph similarity join with edit distance threshold 𝜏 returns pairs of graphs from each set, such that their graph edit distance is no larger than𝜏, i.e.,

{ ⟨𝑟, 𝑠⟩ ∣ 𝑔𝑒𝑑(𝑟, 𝑠) ≤ 𝜏, 𝑟 ∈ 𝑅, 𝑠 ∈ 𝑆 }.

• this paper will focus on the self-join case

{⟨𝑟𝑖, 𝑟𝑗⟩ ∣ 𝑔𝑒𝑑(𝑟𝑖, 𝑟𝑗) ≤ 𝜏 ∧ 𝑟𝑖.𝑖𝑑 <𝑟𝑗.𝑖𝑑,𝑟𝑖∈ 𝑅, 𝑟𝑗∈ 𝑅}.

• Definition (path-based 𝑞-gram): A path-based 𝑞-gram in a graph 𝑟 is a simple path of length 𝑞.

Let Qr denote the multiset of q-gram in a graph r and Qru denote the multiset of q-grams that contain the vertex u.

• Count Filtering: Consider two graphs 𝑟 and 𝑠.

If 𝑔𝑒𝑑(𝑟, 𝑠) ≤ 𝜏 , 𝑟 and 𝑠 must share at least

𝐿𝐵𝑝𝑎𝑡ℎ common 𝑞-grams.

对于前面的两个例子𝜏 =1，

当q=1, LB = max(4-3,5-3)=2

当q=2, LB= max(5-5, 7-6) =1

• Prefix Filtering

the 𝑝-prefix be their first 𝑝 elements. If

|𝑄𝑟 ∩𝑄𝑠| ≥𝛼, then the (|𝑄𝑟|−𝛼+1)-prefix of 𝑄𝑟 and the (|𝑄𝑠|−𝛼+1)-prefix of 𝑄𝑠 must have at least one common 𝑞-gram.

• Minimum Edit Filtering

• 𝜏 =1，q=1 , LB=2 (need at least 2 matches) < 3

consider the two mismatching 𝑞-grams in 𝑠: C-O and C-N, it can be seen that they are disjoint

• To handle the general case where 𝑞-grams may overlap

• minimum graph edit operation problem:

Given a multiset of 𝑞-grams 𝑄, find the minimum number of graph edit operations that can affect all the 𝑞-grams in 𝑄.

NP-Hard (greedy algorithm)

• Label Filtering

• Connect the mismatching components, and compute the minimum edit operations for them.

• John W. Raymond and Peter Willett, Maximum common subgraph isomorphism algorithms for the matching of chemical structures, Journal of Computer-Aided Molecular Design, 16: 521–533, 2002.

• D. Conte, C. Guidobaldi, and C. Sansone, A Comparison of Three Maximum Common Subgraph Algorithms on a Large Database of Labeled Graphs, GbRPR'03.

• Etsuji Tomita, Akira Tanaka, Haruhisa Takahashi: The worst-case time complexity for generating all maximal cliques and computational experiments. Theor. Comput. Sci. 363(1): 28-42 (2006)

• Andrew K. C. Wong, Manlai You, S. C. Chan, An Algorithm for Graph Optimal Monomorphism, IEEE TRANSAC‘TIONS ON SYSTFMS, MAN, ANI) C’YBEKNETIC’S, VOI.. 20. NO. 3. MAY/JUNE 1990

• Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang: Efficient Graph Similarity Joins with Edit Distance Constraints. ICDE 2012: 834-845

• Zhiping Zeng, Anthony K. H. Tung, Jianyong Wang, Jianhua Feng, Lizhu Zhou: Comparing Stars: On Approximating Graph Edit Distance. PVLDB 2(1): 25-36 (2009)