1 / 35

Querying Big Social Graphs

Querying Big Social Graphs. Incremental graph pattern matching Query preserving graph compression Graph pattern matching using views Top-k graph pattern matching Distributed graph pattern matching. 1. The complexity of graph pattern matching. Recall from the last lecture. 2.

burke
Download Presentation

Querying Big Social Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Querying Big Social Graphs • Incremental graph pattern matching • Query preserving graph compression • Graph pattern matching using views • Top-k graph pattern matching • Distributed graph pattern matching QSX (LN 8) 1

  2. The complexity of graph pattern matching Recall from the last lecture QSX (LN 8) 2

  3. Real-life graphs are “big” • Graph pattern matching: • Input: Pattern Q, and data graph G, • Output: M(Q, G), the set of matches of Q in G Facebook : 1B users, 140B links Too costly • Assuming SSD (Solid State Drives) of 6G/s. How long is O(|G|)? • when G is of 1PB (1015B) • when G is of 1EB (1018B) 1.9 days 5.28 years • On graphs with millions of nodes and billions of edges? • NP-complete for subgraph isomorphism • cubic-time for bounded simulation • quadratic-time for simulation 3

  4. To cope with the sheer size of social graphs • Graph pattern matching: • Input: Pattern Q, and data graph G, • Output: M(Q, G), the set of matches of Q in G How can we query big graphs? The cost of query processing: afunction f(|G|, |Q|) can’t reduce the lower bound of the computation • Reduce f? • Reduce |Q|? • Reduce |G|? • Incremental graph pattern matching • Query preserving graph compression • Graph pattern matching using views • Top-k graph pattern matching • Distributed graph pattern matching does not help much: |Q| is small anyway Yes! Make big data “small”! 4

  5. Incremental graph pattern matching 5

  6. Incremental graph pattern matching 5%/week in Web graphs • Real-life social graphs are dynamic – constantly change, ∆G • Re-compute M(Q, G⊕∆G)starting from scratch? • Changes ∆G are typically small Compute M(Q, G) once, and then incrementally maintain it Changes to the input Incremental graph pattern matching • Input: Q, G, M(Q, G), ∆G • Output: ∆M such that M(Q, G⊕∆G) = M(Q, G) ⊕∆M Old output New output Changes to the output When changes ∆G to the data graph G are small, typically so are the changes ∆M to the output M(Q, G⊕∆G) Recall incremental XML publishing Minimizing unnecessary recomputation 6

  7. Complexity of incremental problems Incremental graph pattern matching • Input: Q, G, M(Q, G), ∆G • Output: ∆M such that M(Q, G⊕∆G) = M(Q, G) ⊕∆M Incremental algorithms? The cost of a batch algorithm: afunction of |G| and |Q|? • incremental algorithms: |CHANGED|, the size of changes in • the input: ∆G, and • the output: AFF, characterizing ∆M The updating cost that isinherentto the incremental problem itself G. Ramalingam, Thomas W. Reps: On the Computational Complexity of Dynamic Graph Problems. TCS 158(1&2), 1996 The amount of workabsolutely necessary to perform for any incremental algorithm Bounded: the cost is expressible as f(|CHANGED|)? Optimal: in O(|CHANGED|)? 7 Complexity analysis in terms of the size of changes

  8. The affected area * 1 (bounded) simulation edge-path relation 2 John, DB Ann, CTO Pat, DB 1 Mat, Bio Bill, Bio Q • Vr : the nodes in G matching pattern nodes in Q • Er: the paths in G matching edges in Q the result graph of Q in G⊕∆G the result graph of Q in G • Affected Area (AFF) • the difference between Gr and Gr’ • The size of changes in the output The complexity and bounded analyses of incremental matching • |CHANGED| = |∆G| + |AFF| Result graphs: Gr = (Vr, Er) for (bounded) simulation 8

  9. Incremental graph pattern matching: An example Q CTO * 2 1 DB Insert e2 Bio 1 John, DB Mat, Bio Ann, CTO Insert e1 e5 e3 Insert e3 Bill, Bio Tom, Bio e4 Insert e4 e2 Ross, Med Pat, DB Don, CTO Insert e5 e1 ∆G G affected area Gr Ann, CTO John, CTO Pat, DB Dan, DB Bill, Bio Tom, Bio Mat, Bio 9 Comparing the cost of incremental matching with its batch counterpart

  10. Incremental simulation matching in O(|AFF|) time Outperform its batch counterpart by 50% for changes up to 10% • Input: Q, G, Msim (Q, G), ∆G • Output: ∆M such that Msim (Q, G ⊕ ∆G) = Msim(Q, G) ⊕∆M • Updates: • Unit updates: single edge deletion or insertion • Batch updates: a sequence of edge deletions and insertions • Boundedness results • unbounded even for unit updates and general patterns • Optimal for • single-edge deletions and general patterns • single-edge insertions and DAG patterns 10

  11. Incremental bounded simulation Negative: unbounded even for unit updates Path pattern: a graph pattern consisting of a single path both simulation and bounded simulation Is it really that bad? • Input: Q, G, Mbsim(Q, G), ∆G • Output: ∆M such that Mbsim (Q, G ⊕ ∆G) = Mbsim(Q, G) ⊕∆M • Boundedness result • unbounded even for unit updates and path patterns 11

  12. Semi-bounded results • Semi-bounded: the cost is a PTME function f(|CHANGED|, |Q|) | Q| is small O(|∆G|(|Q||AFF| + |AFF|2)) time Independent of | G | • for batchupdates and general patterns Incremental matching via bounded simulation Outperform its batch counterpart by 30% for changes up to 10% Incremental simulation and incremental bounded simulation are both in 12

  13. Incremental subgraph isomorphism not semi-bounded unless P = NP • Input: Q, G, M(Q, G), ∆G • Question: whether there exists a subgraph in G⊕∆G that is isomorphic to Q Neither bounded nor semi-bounded • Input: Q, G, Miso(Q, G), ∆G • Output: ∆M such that Miso (Q, G⊕∆G) = Miso(Q, G) ⊕∆M • Boundedness and complexity • Incremental matching via subgraph isomorphism is unbounded even for unit updates over DAG graphs for path patterns • Incremental subgraph isomorphism is NP-complete even when G is fixed 13

  14. Query preserving graph compression 14

  15. Query preserving graph compression R G Gc Q Q P Q( G) Q( Gc) The cost of a batch matching algorithm:f(|G|, |Q|) It is unlikely that we can lower its complexity, but can we reduce the size of its parameter |G|? Query preserving compression <R, P> for a class L of queries • For any graph G, Gc =R(G) • For any Q in L, Q( G ) = P(Q, Gc) Compressed graph Post-processing Compress graphs relative to a particular class of queries 15

  16. What is new about query preserving compression? Query preserving compression <R, P> for a class L of queries • For any graph G, Gc =R(G) • For any Q in L, Q( G ) = P(Q, Gc) • Relative to a class L of queries of users’ choice • Better compression ratio: only information about L queries no need to decompress Gc • For any Q in L, Q(Gc) can be directly computed Any algorithms and indexing structures forG can be used for Gc In contrast to lossless compression, no need to restore the original graphG • Gc is computed once for all queries Q in L Incrementally maintained whether a node can reach another Reduction: 95% in average for reachability queries 16

  17. Compression for bounded simulation Query preserving compression <R, P> for graph pattern matching • R(G) inO(|E| log (|V|)) time • P(Q, Gc): linear time in the size of Q( G ) • compression function R( ): • maximum bisimulation relation on the nodes of G • equivalence relation nodes in Gc denote equivalence classes • post-processing function P( ): • making use of the inverse of R( ) nodes in Q(Gc) are expanded to nodes in their equivalence classes Reduction: 57% in average for graph pattern matching 17

  18. Compression for bounded simulation: example c1 c3 ck c2 fa1 fa2 fa3 R(G): computes equivalence classes msa1 msa2 MSAr msa1 msa2 R(G): constructs Gc with equivalence classes bsa1 bsa2 BSAr bsa1 bsa2 FAr’ fa1 P(Q,Gc): expanded to the nodes in their equivalence classes FAr fa2 fa3 … c1 c2 ck c3 Cr Cr’ G Gc 18

  19. Incremental graph compression Gc is computed once for all queries Q in L • Boundedness and complexity • unbounded even for unit updates • in O( |AFF|2 + | Gc | ) time Subgraph isomorphism? No need to decompress Gc Compressed once and incrementally maintained Input: G, Gc = R(G), ∆G Output: ∆Gc such that R(G ⊕ ∆G) = R(G) ⊕∆Gc 19

  20. Graph pattern matching using views 20

  21. Answering graph queries using views The cost of a matching algorithm:f(|G|, |Q|) View definitions: graph patterns can we compute Q(G) without accessing G, i.e., independent of |G|? Query answering using views: given a query Q in a language Land a set V views, find another query Q’such that • Q and Q’ are equivalent • Q’only accesses V(G) for any graph G, Q(G) =Q’(G) • Answering queries on big data: • Regardless of how big G is – the cost is “independent”of G • V(G)is often much smaller than G (4% -- 12% on real-life data) The complexity is no longer a function of |G| 21

  22. When can queries be answered using views? Query answering using views: given a query Q in a language Land a set V views, find another query Q’such that • Q and Q’ are equivalent: for any graph G, Q(G) =Q’(G) • Q’only accesses V(G) Can Q be answered using a set V of views? efficient • A characterization: a sufficient and necessary condition • Containment checking: Q V NP-complete for relational conjunctive queries How expensive is it to determine whether Q V? • Quadratic-time in | Q | and |V | for simulation • Cubic-time for bounded simulation 22 View definitions

  23. Pattern query containment: example PM PM e1 e2 View 1 PRG PRG PRG PRG DBA DBA DBA DBA e3 View 2 e4 Pattern query It takes 0.5 second to check containment of large cyclic patterns

  24. The complexity of query answering • Input: Pattern Q, a set V views, and data graph G • Output: M(Q, G) quadratic time O( |V(G)| |Q| + |V(G)|2 ) • In contrast, • Graph simulation:O((|V| + | VQ |) (|E| + |EQ| ) • Bounded simulation: O(|V| |E| + |EQ| |V|2 + |VQ| |V|) V(G): much smaller than G Substantially outperform traditional matching methods, by 97% 24

  25. Top-k graph pattern matching 25

  26. Computing top-k matches Traditional graph pattern matching: compute M(Q, G) • It is expensive to compute when G is large • The result M(Q, G) is excessively large for the users to inspect – larger than G • 15% of social queries are to find matches of specific pattern nodes, rather than the entire set M(Q, G) for instance, recommendation Top-k query answering: • Input: : Pattern Q, data graph G and a positive integer k. • Output: A top-ranked set of k matches of a designated node Early termination: return top-k matches without computing M(Q, G) 26

  27. Graph pattern matching with output node Output node pm1 pm2 pm3 Matches of the output node Top-k query answering: • Input: : Pattern Q, data graph G and a positive integer k. • Output: Top-k matches in Mu(Q, G, uo) * Top-2 matches …… PM pmn PRG DB prg1 db1 prg2 db2 prg3 db3 Output: k nodes vs. M(Q, G) ST st1 st2 st3 st4 stm Pattern Q …… Input: graphG = (V, E, fA), patternQ = (VQ, EQ, fv, uo) Output: Mu(Q, G, uo) = { v | (uo, v)  M(Q, G)} 27

  28. Ranking match results: Relevance Top-k query answering: • Input: : Pattern Q, data graph G and a positive integer k. • Output: Top-k matches in Mu(Q, G, uo) pm1 pm2 pm3 PM * pmn Tok-2 relevant matches prg1 db1 prg2 db2 prg3 db3 PRG DB ST Pattern …… st1 st2 st3 st4 stm Top-k graph pattern matching: social impact 28

  29. Ranking match results: Diversity Top-k query answering: • Input: : Pattern Q, data graph G and a positive integer k. • Output: Top-k matches in Mu(Q, G, uo) pm1 pm2 pm3 δd(pm1,pm2)=(m+5)/(m+6) δd(pm2,pm3)=3/(m+2) PM * pmn prg1 db1 prg2 db2 prg3 db3 PRG DB Top-2 diversified matches δd(pm1,pm3)=1 ST Pattern …… st1 st2 st3 st4 stm Diversified top-k graph pattern matching: social diversity 29

  30. The complexity Top-k query answering: • Input: : Pattern Q, data graph G and a positive integer k. • Output: Top-k matches in Mu(Q, G, uo) quadratic time • Relevance alone:O((|V| + | Q |) (|E| + |V | ) • Diversification based on both relevance and diversity • NP-complete (decision problem) • APX-hard • O((|V| + | Q |) (|E| + |V | ) with approximation ratio 2 • Early termination: stop as soon as top-k matches are found without computing Mu(Q, G, uo) Improving traditional matching methods by 65% 30

  31. Distributed graph pattern matching 31

  32. Distributed graph pattern matching The cost of a batch matching algorithm:f(|G|, |Q|) reduce the parameter? manageable sizes Divide and conquer • partition G into fragments (G1, …, Gn), distributed to various sites evaluate Q on smaller Gi • upon receiving a query Q, • evaluate Q( Gi )in parallel • collect partial matches at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Social graphs are already geometrically distributed Network traffic and response time: Independent of |G| 32

  33. Partial evaluation computef( x )  f( s, d ) • conduct the part of computation that depends only on s • generate a partial answer the part of known input yet unavailable input at each site,Gi as the known input a residual function • Partial evaluation in distributed query processing • evaluate Q( Gi )in parallel • collect partial matches at a coordinator site, and assemble them to find the answer Q( G ) in the entire G Gj as theyet unavailable input functions A TDD topic Partial evaluation: a promising approach 33

  34. Open research issues • Querying large social graphs • Distributed graph pattern matching • Query preserving graph compression • Graph pattern matching using views • top-k graph pattern matching • Approximate and inexact algorithms • . . . Distributed matching with the same performance guarantees? subgraph isomorphism? A combination of all these Many issues need a full treatment QSX (LN 8) 34

  35. More reading • W. Fan, X. Wang, and Y. Wu. Diversified Top-k Graph Pattern Matching, VLDB, 2014. • W. Fan, X. Wang, and Y. Wu. Answering graph pattern queries using views, ICDE, 2014. • W. Fan, X. Wang, and Y. Wu. Incremental Graph Pattern Matching, TODS 38(3), 2013 (SIGMOD 2011). • W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph Compression, SIGMOD, 2012. • W. Fan. Graph Pattern Matching Revised for Social Network Analysis, ICDT 2012 (invited). • W. Fan, X. Wang, and Y. Wu. Performance Guarantees for Distributed Reachability Queries, VLDB, 2012. • W. Fan J. Li, S. Ma, N. Tang, and Y. Wu. Adding regular expressions to graph reachability and pattern queries, ICDE 2011. • W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu. Graph pattern matching: From intractable to polynomial time, VLDB, 2010.

More Related