Subgraph Search Over Large Graph Database

北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University Subgraph Search Over Large Graph Database Instructor: Lei Zou

Outline • Subgraph Isomorphism Algorihtm Ullmann Algorithm; VF2 Algorithm QuickSI • Subgraph Search Over a large collection of graphs GraphGrep, gIndex, Closure-Tree, Gcode • Subgraph Search Over a Single Large Graph

Graph Database As an alterative to relational database, graph database utilizes graph as the underlying model, which represents and stores information by nodes and connecting edges.

AEP (Peng Peng et al. SSDBM 2011) • In this paper, we focus on finding all embeddings of Q over a single large graph G.

Classify Edge • We can classify the edges according to their labels. • All of the edges with the same label are organized as one block.

For each block, we can build clustered B+-trees over it to save I/O cost and propose the following bitmap indexing structures.

Edge Join • First, let’s consider the query graph Q1 in our example, we propose Edge Join algorithm to handle it.

AEP-based Query Algorithm • For a complex query, we first decompose Q into a set of subqueries as above. We call the subquery is adjacent edge pair (AEP for short) query, and propose an greedy strategy to find the decomposition of query Q.

AEP-based Query Algorithm

Outline • Background • Index Technique • Query Algorithm • Experiments • Conclusions

Datasets 1) Erdos Renyi Model: This is a classical random graph model. It defines a random graph as N vertices connected by M edges, chosen randomly from the N(N-1)=2 possible edges. In experiments, we vary N from 10K to 100K. The default average degree is set to be 5 and the default number of vertex labels (denoted as |L|) is 250. 2) Yago extracts facts from Wikipedia and integrates them with the WordNet thesaurus. We build a RDF graph, in which There are 368,587 vertices, 543,815 edges and 45,450 vertex labels.

Offline Performance

Online Performance

Conclusions • In order to address subgraph query over a single large data graph G, in this paper, we propose a novel bitmap structure to index G. At run time, we propose a subgraph query algorithm in this paper. Aimed by the bitmap index, we can reduce the search space and improve the query performance significantly. • Extensive experiments over both real and synthetic data sets confirm that our methods outperforms existing ones in both offline and online performances by orders of magnitude.

SMS (WAIM2011) • An example of matching between Q and G • Problem Statement. Given a large data graph G and a query graph Q, the problem is to find all matches of Q in G, where matches are defined in Definition 2.

SMS Algorithm • We introduce a simple yeteffective vertex code for each vertex based on the neighborhood structure. • Definition 3.Vertex Code. Given a vertex u in graph G, its vertex code is defined as C(u) = [L(u), NLS(u) = {[li, (di1, ..., dim)]}], where L(u) is the label of vertex u, liis a kind of label of neighbor vertices ofu, di1, ..., dim is the degree list of u’neighbor vertices with label lianddi1≤ di2≤ ... ≤ dim.

SMS Algorithm • Definition 4. PreSequence. Given two ordered sequences of numbers, S = {s1, s2, s3, . . . sm} andT = {t1, t2, t3, . . . tn}, wheresi (0≤i≤m) andtj (0≤j≤n) are both integers.Sis called a PreSequenceof Tif and only if 1) m≤n; and 2) ∀si ∈S, |{tj | tj≥ si}|≥ i. • Definition 5. Subnode.Given two vertices vand uin query Qand graph G, respectively.vis a subnode of uif and only if • 1) L(v) = L(u); and • 2) ∀li∈NLS(v), ∃lj∈ NLS(u) andli = ljand the degree list {di1, ..., dim } is PreSequence of the degree list {dj1, ..., djn }.

SMS Algorithm Ifvis a subnode ofu, we can also say uis supernode of v. A query graph Q and a database graph G • LVQ = {[1, c, [a, (3); b, (2); c, (2)]]; [2, c, [a, (3); c, (3)]]; [3, a, [b, (2); c, (2,3)]]; [4, b, [a, (3); c, (3)]]} • LVG = {[1, a, [a, (3); b, (3);c, (2,4)]]; [2, c, [a, (4); c, (4)]]; [3, c, [a, (3,4); b, (3); (c, 2)]]; [4, a, [a, (4); b, (2); c, (4)]];[5, b, [a, (3,4)]]; [6, a, [a, (3,4); b, (2,3)]]; [7, b, [a, (4, 4); c, (4)]]}.

SMS Algorithm • A matchsequence betweenQandGis a sequence of matched vertex pairs MS(n)= {(vi1 , uj1) . . . (vin , ujn)}, whereujkis the vertex inG that matches with the vertexvikinQ. • If Qis subgraph isomorphic to G, each subgraph of Qmust be subgraph isomorphic to the corresponding subgraph ofG. This is similar to the Apriori property in the frequent pattern mining. • For each subset vertices of the match sequence, the vertex inQ must be subnode of the corresponding vertex in G.

SMS Algorithm If vis a subnode ofu, we can also say uis supernode of v. A query graph Q and a database graph G 第一步，随机选择一个点（简单地，可以选择度数最高的点），这里选择 u1放入SMQ中，同时查找它的候选顶点（subnodes）选择一个放入SMG中，进行宽度优先搜索。在匹配的过程中，对于Q中得每个点进行verify的策略，匹配好之后，把它的邻居依次放入SMQ中。

SNS 的长度是当前节点的度. 当前节点的每个邻居 v，如果已经在匹配向量中则记相应的位置序号；最后一个元素是尚未匹配好的邻居数目。 v1 v2 v4 v6 v3 SMQ： SNS(v)={1, 1 } u6 u5 SMG： SNS(u)={1, 1 }

除了最后一个元素之外， SNS(v) is a subsequence ofSNS(u) • SNS(v)的最后一个元素 ≤ SNS(u)的最后一个元素只有同时满足这两点时当前节点与其当前候选点可以匹配。 v1 v2 v4 v6 v3 v5 SMQ： SNS(v3)={2, 3, 1} u6 u5 u7 u1 u4 SMG： SNS(u4)={1, 2, 3, 1 }

SMS Algorithm • Lemma 1. For each pair of vertices in each subsequence of a sequenceSof vertices betweenQandG, if the vertex inQis a subnode of the vertex inG, the sequenceSis a match sequence of Q andG.

SMSP Algorithm • Consideringisomorphism test will be costly and inefficient in time and space cost if the size of database graph is very dense and large . • if the size of the graph is small or the graph is sparse, isomorphism test will be much easier.

SMSP Algorithm • Definition 6. Block Codes.GivenablockBofG;itsblockcodeisdefinedasLLB = {l1 ,m1,(d1,d2 . . . dm1); . . . ; ln, mn; (d1, d2 . . . dmn )} wherelis the label, m is the number of vertices with label land diis the degree of ith node which has the label l.

Experiments Evaluate the query performance of our algorithms SMS and SMSP • Synthetic Data Sets • ER and SF both vary from 10K to 100K. And the number of vertex labels of all the database graphs is 250. • Real Data Sets • Human protein interaction network(HPRD), 9,460 vertices, 37, 000 edges and 307 generated vertex labels with the GO term description • Yago (a RDFgraph) 368, 587 vertices, 543, 815 edges and 45, 450 vertex labels

Experiments • Performance of SMS versus |V (G)| (online response)

Experiments Performance VS. |V (G)| in Index Building

Experiments • Performance VS. |V (Q)|Over SF100K

Experiments • Performance VS. |V (Q)|Over HPRD, Yago

Experiments • Performance of SMSP versus partition number.

Syllabus • Introduction • Problem Definition • SMS & SMSP Algorithms • Experiments • Conclusion

Conclusion SMS & SMSP are very effective and efficient

Answer Pattern Match Query In a Large Graph Database Via Graph Embeeding Lei Zou The extended version of paper “Distance-Join: Pattern Match Query In a Large Graph Database” that was presented at VLDB 09

Graph Data (b) Social Network (a) Protein Network (c) RDF Graph

Motivations (Social Network) Query 1: Finding all CFO and CEO who have close relationships (i.e. the relationship distance is no larger than 2.)

Motivations (Social Network) Query 2: Finding all CFO and CEO who have close relationships, where the CEO has a close relationship with another Manager.

Pattern Matching Query Over Graph Data • A subgraph of n vertices {u1,…, un}(in a graph G) is said to match query graph Q that has n vertices {v1,…,vn}, if the following conditions hold: • L(ui)=L(vi), where L(vi) denotes vi’s label. • If there is an edge between viand vjin Q, the shortest path distance between ui and uj is no larger than δ, that is, Distsp(ui, uj) ≤ δ, where δ is a parameter.

Example

Problem Definition Definition 2. (Pattern Match Query). Given a large data graph G, a query graph Q and a parameter δ, pattern match query returns all matches of Q in G.

Ideas The scope of our method is to efficiently perform a series of edge query (i.e. join processing). • In each edge query: • Reduce search space by graph embedding • An efficient pruning strategy to reduce the search space. • Efficient SP-distance computation • 2-hop graph labeling is an efficient SP-distance computation technique.

Framework • At the end of offline processing: • Table T • the converted vector space • Vertex distance label • 2-hop distance vertex label

Graph Embedding (Shahabi et al.@ GeoInformatica) u … … S1 S2

Details about Embedding

Computing distance-aware 2-hop Labeling (Cohen et al.@SIAM J) Selected hops can cover all pair-wise shortest paths u0 u1 2 2 5 5 u3 u2 1 hops 1 1 u4 A(u2) A(u3) u0 u0 u1 u3 u1 3 2 2 1 5 u3 u2 1 1 1 D(u2) u4 D(u3) u2 u4

An Example

Betweenness Estimation-Based Method For 2-Hop Distance-Aware Labeling (Zou et al., VLDB J) “Betweenness” measures the relative importance of a vertex that is needed by others when connecting alnong shortest pathes. • We prefer to select some vertices having large betweenness values as “hops”. • We use the sampling technique to estimate the betweenness for each vertex (Brandes et al. @2007).

Betweenness Estimation-Based Method For 2-Hop Distance-Aware Labeling G Wb Step 1. Finding high betweenness vertices by sampling approach. These vertices are selected as hops Wb. Step 2. Remove all hops in Ab , and partition the remainder into several partitions. The node separators are selected as hops, Ws. G2 G1 Ws Labeling: Local Labeling: paths that are contained in each subgraph; Global Labeling: paths that pass through some boundary vertices.

Betweenness Estimation-Based Method For 2-Hop Distance-Aware Labeling G Branch Stop Step 3. Recoding SP distances between each vertex and each hop in Wb. (running Dijkstra's algorithm from each hop in Wb). G2 Step 4. Running Dijkstra's algorithm from each hop in Ws, but some branches can terminate if they encounters another hop in Wb. (Pruning Rule) G1 Labeling: Local Labeling: paths that are contained in each subgraph; Global Labeling: paths that pass through some boundary vertices.

Subgraph Search Over Large Graph Database