Skip this Video
Download Presentation
Shuai Ma

Loading in 2 Seconds...

play fullscreen
1 / 45

Shuai Ma - PowerPoint PPT Presentation

  • Uploaded on

Big Graph Search: Challenges and Techniques. Shuai Ma. Graphs are everywhere , and quite a few are huge graphs!. Application Scenarios. Software plagiarism detection [1]. Traditional plagiarism detection tools may not be applicable for serious software plagiarism problems.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Shuai Ma' - emele

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Big Graph Search:

Challenges and Techniques

Shuai Ma

application scenarios
Application Scenarios

Software plagiarism detection [1]

  • Traditional plagiarism detection tools may not be applicable for serioussoftware plagiarism problems.
  • A new tool based on graph pattern matching
  • Represent the source codes as program dependence graphs [2].
  • Use graph pattern matching to detect plagiarism.
application scenarios1
Application Scenarios

Recommender systems [3]

  • Recommendations have found its usage in many emerging specific applications, such as social matching systems.
  • Graph search is a useful tool for recommendations.
  • A headhunter wants to find a biologist (Bio) to help a group of software engineers (SEs) analyze genetic data.
  • To do this, (s)he uses an expertise recommendation network G, as depicted in G, where
    • a node denotes a person labeled with expertise, and
    • an edge indicates recommendation, e.g., HR1 recommends Bio1, and AI1 recommends DM1
application scenarios2
Application Scenarios

Transport routing [4]

  • Graph search is a common practice in transportation networks, due to the wide application of Location-Based Services.
  • Example: Mark, a driver in the U.S. who wants to go from Irvine to Riverside in California.
  • If Mark wants to reach Riverside by his carin the shortest time, the problem can be expressed as the shortest path problem. Then by using existing methods, we can get the shortest path from Irvine, CA to Riverside, CA traveling along State Route 261.
  • If Mark drives a truckdelivering hazardous materials may not be allowed to cross over some bridges or railroad crossings. This time we can use a pattern graph containing specific route constraints (such as regular expressions) to find the optimal transport routes.
application scenarios3
Application Scenarios

Biological data analysis [5]

  • A large amount of biological data can be represented by graphs, and it is significant to analyze biological data with graph search techniques.
  • “Protein-interaction network (PIN) analysis provides valuable insight into an organism’s functional organization and evolutionary behavior.”
  • For example, one can get the topological properties of a PIN formed by high-confidence human protein interactions obtained from various public interaction databases by PIN analysis.
  • What is graph search?
  • Graph search, why bother?
  • Challenges & related techniques
  • Summary
what is graph search
What is Graph Search?

A unified definition [6] (in the name of graph matching):

  • Given a pattern graph Gpanda data graph G:
    • check whether Gp‘‘matches’’ G; and
    • identify all ‘‘matched’’subgraphs.


  • Two classes of queries:
    • Boolean queries (Yes or No)
    • Functional queries, which may use Boolean queries as a subroutine
  • Graphs contain a set of nodes and a set of edges, typically with labels
  • Pattern graphs are typically small (e.g., 10), but data graphs are usually huge (e.g., 108)
what is graph search1
What is Graph Search?

Different semantics of “match” implies different “types” of graph search, including, but not limited to, the following:

  • Shortest paths/distances[4]
  • Subgraph isomorphism[12]
  • Graph homomorphism and its extensions[10]
  • Graph simulation and its extensions[8,9]
  • Graph keyword search[7]
  • Neighborhood queries[11]

Graph search is a very ‘‘ general’’ concept!

the need for a social search engine
The need for a Social Search Engine
  • File systems - 1960’s: very simple search functionalities
  • Databases - mid 1960’s:SQL language
  • World Wide Web - 1990’s:keyword search engines
  • Social networks - late 1990’s:

Social Networks

Facebook launched “graph search” on 16th January, 2013

Assault on Google, Yelp, and LinkedIn with new graph search;

Yelp was down more than 7%

World Wide Web

File systems


Graph search is a new paradigm for social computing!

graph search vs rdbms 13
Graph Search vs. RDBMS [13]


Find the name of all of

Alberto Pepe\'sfriends.

Step 1: The index -> the identifier of Alberto Pepe. [O(log2n)]

Step 2: The friend.person index -> k friend identifiers. [O(log2x) : x<<m]

Step 3:The k friend identifiers -> k friend names. [O(k log2n)]

graph search vs rdbms 131
Graph Search vs. RDBMS [13]


Find the name of all of

Alberto Pepe\'sfriends.

Step 1: The index -> the vertex with the name Alberto Pepe. [O(log2n)]

Step 2:The vertex returned -> the k friend names. [O(k + x)]

social search vs web search
Social Search vs. Web Search
  • key words only vs. Phrases、short sentences
  • (Simple Web) pages vs. Entities
  • Lifelessvs. Full of life
  • Historyvs. Future

it’s interesting, and over the last 10 years, people have been trained on how to use search engines more effectively.

Keywords & Search In 2013: Interview With A. Goodman & M. Wagner

International Conference on Application of Natural Language to Information Systems (NLDB) started from 1995

interesting coincidence
Interesting Coincidence!

Social computing


Web 2.0

DB people started working on graphs at around the same time!

social networks are big data
Social networks are “big data”


  • Volume: 10 x 108users, 2400 x 108photos, 104x 108 page visits
  • Velocity: 7.9 new users per second, over 60 thousands per day
  • Variety: text (weibo, blogs), figures, videos, relationships (topology)
    • Value:1.5 x 108 dollars in 2007, 3 x 108 dollars in 2008, 6 ~7 x 108 dollars in 2009, 10 x 108 dollars in 2010.
    • Further, data are often dirty due to data missing and data uncertainty [18, 19]
  • The amount of datahas reached hundred millions orders of magnitude.
  • The data are updated all the time, and the updated amount of data daily reaches hundred thousands orders of magnitude.
  • Same with traditional relational data, there exists data quality problems such as data uncertainty and data missing in the new applications.

Graph search with high efficiency, striking a balance between its performance and accuracy.

Consider the dynamic changes and timing characteristics of data.

Solve the data quality problems.

query approximation
Query Approximation

Key ideas:For a class Q of queries with a high computational complexity, find another class Q’ of queries that has a lower computational complexity with bounded quality loss for query answering.




Challenge: balancing the expressive power and computational complexity!

graph pattern matching 17
Graph Pattern Matching [17]
  • Given two directed graphs G1 (pattern graph) and G2 (data graph),
    • decide whether G1 “matches” G2 (Boolean queries);
    • identify “subgraphs” of G2 that match G1
  • Matching Semantics
    • Traditional: Subgraph Isomorphism
    • Emerging applications: Graph Simulation and its extensions, etc..


Subgraph Isomorphism


Strong Simulation


Shuai Ma, Yang Cao, Wenfei Fan, JinpengHuai, and TianyuWo. Strong Simulation: Capturing Topology in Graph Pattern Matching. TODS 2014.

Shuai Ma, Yang Cao, Wenfei Fan, JinpengHuai, and TianyuWo. Capturing Topology in Graph Pattern Matching. VLDB 2012.

subgraph isomorphism 12
Subgraph Isomorphism[12]
  • Given Pattern graph Q, subgraphGsof data graph G
    • Q matches Gs if there exists a bijectivefunction f: VQ→ VGs such that
      • for each node u in Q, u and f(u) have the same label
      • An edge (u, u‘) in Q if and only if (f(u), f(u\')) is an edge in Gs
    • Q matches G, via subgraphisomorphsim, if there is such a subraphGs
  • Goodness:
  • Badness:

Keep exact structure topology between Q and Gs

Decision problemisNP-complete

May return exponential many matched subgraphs

In certain scenarios, too restrictive to find matches

These hinder the usability in emerging applications, e.g., social networks

graph simulation 9 21
Graph Simulation [9, 21]
  • Given pattern graph Q(Vq, Eq) and data graph G(V, E), a binary relation R ⊆ Vq × V is said to be a match if
    • (1) for each (u, v) ∈ R, u and v have the same label; and
    • (2) for each edge (u, u′) ∈ Eq, there exists an edge (v, v′) in E such that (u′, v′) ∈ R.
  • Graph G matches pattern Q via graph simulation, if there exists a total match relation M
    • for each u ∈ Vq, there exists v ∈ V such that (u, v) ∈ M.
    • Intuitively, simulation preserves the labels and the child relationship of a graph pattern in its match.
    • Simulation was initially proposed for the analyses of programs; and simulation and its extensions were recently introduced for social networks.

Subgraph isomorphism (NP-complete) vs. graph simulation (O(n2))!

subgraph isomorphism
Subgraph Isomorphism

Set up a team to develop a new software product

Graph simulation returns F3, F4 and F5;

Subgraph isomorphism returns empty!

Subgraph isomorphism is too strict for emerging applications

terrorist collaboration network
Terrorist Collaboration Network

“Those who were trained to fly didn’t know the others. One group of people did not know the other group.” (Osama Bin Laden, 2001)

strong simulation 16 17
Strong Simulation[16,17]
  • Subgraph isomorphism
    • Goodness
      • Keep (strong) structure topology
    • Badness
      • May return exponential number of matched subgraphs
      • Decision problem: NP-complete
      • In certain scenarios, too restrictive to find sensible matches
  • Graph simulation
    • Goodness
      • Solvable in quadratic time
    • Badness
      • Lose structure topology (how much? open question)
      • Only return a single matched subgraph

Balance between complexity and the capability to capturing topology!

strong simulation
Strong Simulation


  • Graph simulation loses graph structures


Long cycle

strong simulation1
Strong Simulation
  • Duality (dual simulation)
    • Both child and parent relationships
    • Simulation considers only child relationships
  • Locality
    • Restricting matches within a ball
    • When social distance increases, the closeness of relationships decreases and the relationships may become irrelevant
  • The semantics of strong simulation is well defined
    • The results are unique

Strong simulation: bring duality and locality into graph simulation

strong simulation2
Strong Simulation









Topology preservation and bounded matches

distributed processing
Distributed Processing





distributed processing1
Distributed Processing
  • Real-life graphs are typically way too large:
    • Yahoo! web graph: 14 billion nodes
    • Facebook: over 1 billion users
  • Real-life graphs are naturally distributed:
    • Google, Yahoo! and Facebook have large-scale data centers

It is NOT practical to handle large graphs on single machines

Distributed graph processing is inevitable

distributed processing2
Distributed Processing

Model of Computation [3]:

  • A cluster of identical machines (with one acted as coordinator);
  • Each machine can directly send arbitrary number of messages to another one;
  • All machines co-work with each other by local computations and message-passing.

Complexity measures:

  • 1. Visit times: the maximum visiting times of a machine (interactions)
  • 2. Makespan: the evaluation completion time (efficiency)
  • 3. Data shipment: the size of the total messages shipped among distinct
  • machines (network band consumption)


Shuai Ma, Yang Cao, JinpengHuai, and TianyuWo. Distributed Graph Pattern Matching. WWW 2012.

incremental evaluation
Incremental Evaluation


Q(D + Δ)

Q(D) + Q(Δ)


G. Ramalingam, Thomas W. Reps: A Categorized Bibliography on Incremental Computation. POPL 1993: 502-510

Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, Yinghui Wu, and Yunpeng Wu. Graph Pattern Matching: From Intractable to Polynomial Time. VLDB 2010

incremental evaluation1
Incremental Evaluation
  • Converting the indexing system to an incremental system,
  • Reduce the average document processing latency by a factor of 100
  • Process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.

Google Percolator [20]:

It is a terrible waste to compute everything from scratch!

data sampling
Data Sampling
  • Instead of dealing with the entire data graphs, it reduces the size of data graphs by sampling and allows a certain loss of precision.
  • In the sampling process, ensure that the sampling data obtained can reflect thecharacteristics and information of the original data graphs as much as possible.




Michael I. Jordan: Divide-and-conquer and statistical inference for big data. KDD 2012: 4

Wenfei Fan, FlorisGeerts, Frank Neven: Making Queries Tractable on Big Data with Preprocessing. VLDB 2013

Weiren Yu, Charu Aggarwal, Shuai Ma, and Haixun Wang. On Anomalous Hotspot Discovery in Graph Streams. ICDM 2013

data compression
Data Compression
  • Query oriented compression generates smaller graphs from original graphs that preserve the information relevant to a class of queries.
  • Specific compression methods are needed for a given class of queries, e.g., reachability and neighbor queries




Wenfei Fan, Jianzhong Li, Xin Wang, Yinghui Wu: Query preserving graph compression. SIGMOD, 2012

data partitioning
Data Partitioning
  • Partition a data graph to relatively “small” graphs
  • Hash function is a simple approach for random partitioning.
  • There are well established tools, e.g. Metis.



Q(D1) + … +Q(Dn)

G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SISC, 20(1):359–392, 1998.


We have introduced graph search: a new paradigm for social computing

  • We have discussed the history and applications of graph search

We have also briefly discussed the challenges of graph search

We have presented some useful techniques towards solving the problems

A long way to go for big graph search!



  • CharuAggarwal, Yang Cao, Wenfei Fan, KaiyuFeng, JinpengHuai, Jia Li, Jianxin Li, Jianzhong Li, Xudong Liu, Nan Tang, Haixun Wang, TianyuWo, Yinghui Wu, Weiren Yu, …


Email: [email protected]

Address:   Room G1122,

New Main Building,

Beihang University

Beijing, China


[1] Chao Liu, Chen Chen, Jiawei Han and Philip S. Yu, GPLAG: detection of software plagiarism by program dependence graph analysis. KDD 2006.

[2] J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319–349, 1987.

[3] Shuai Ma, Yang Cao, JinpengHuai, and TianyuWo, Distributed Graph Pattern Matching, WWW 2012.

[4] Rice, M. and Tsotras, V.J., Graph indexing of road networks for shortest path queries with label restrictions, VLDB 2010.

[5] David A. Bader and KameshMadduri, A graph-theoretic analysis of the human protein-interaction network using multicore parallel algorithms. Parallel Computing 2008.

[6] Shuai Ma, Yang Cao, TianyuWo, and JinpengHuai, Social Networks and Graph Matching.Communications of CCF, 2012 (in Chinese).

[7] C. C. Aggarwal and H. Wang. Managing and Mining Graph Data. Springer, 2010.

[8] Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Yinghui Wu,  Adding Regular Expressions to Graph Reachability and Pattern Queries. ICDE 2011.

[9] Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Yinghui Wu, Graph Pattern Matching: From Intractable to Polynomial Time. VLDB 2010.

[10] Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Yinghui Wu, Graph Homomorphism Revisited for Graph Matching.  VLDB 2010.


[11] HosseinMaserrat and Jian Pei, Neighbor query friendly compression of social networks. KDD 2010.

[12] Brian Gallaghe, Matching structure and semantics: A survey on graph-based pattern matching. AAAI FS. 2006.

[13] Marko A. Rodriguez, Peter Neubauer: The Graph Traversal Pattern. Graph Data Management 2011: 29-46

[14] S.Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994.

[15] MehdiKargar, Aijun An: Keyword Search in Graphs: Finding r-cliques. In VLDB Conference, 2011.

[16] Shuai Ma, Yang Cao, Wenfei Fan, JinpengHuai, and TianyuWo, Capturing Topology in Graph Pattern Matching. VLDB 2012.

[17] Wenfei Fan, Graph Pattern Matching Revised for Social Network Analysis. ICDT 2012.

[18] Eytan Adar and Christopher Re, Managing Uncertainty in Social Networks, IEEE Data Eng. Bull., pp.15-22, 30(2), 2007.

[19] GueorgiKossinets, Effects of missing data in social networks. Social Networks 28:247-268, 2006.

[20] Daniel Peng, Frank Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010.

[21] Monika Rauch Henzinger, Thomas A. Henzinger, Peter W. Kopke: Computing Simulations on Finite and Infinite Graphs. FOCS 1995:

short bio
Short Bio

Dr. Shuai Ma

  • 2006~2010 University of Edinburgh, UK PhD
  • 2001~2004 Peking University, China PhD
  • 2011~ Beihang University, ChinaFull Professor
  • 2012 Microsoft Research, China Visiting Researcher
  • 2008 Bell Labs, USA Summer Consultant
  • 2005~2010 University of Edinburgh, UK Research Fellow