Towards Data Analytics on Attributed Graphs

Towards Data Analytics on Attributed Graphs NGS QE Oral Presentation Student : Qi Fan Supervisor: Prof. Kian-lee Tan

Outline Attributed Graph Analytic Graph Window Query Graph Window Query Processing Experiments Future Works

Outline • Attributed Graph Analytic • Graph Window Query • Graph Window Query Processing • Experiments • Future Works

Data Analytics [1] Analytics Examples: http://en.wikipedia.org/wiki/Analytics • Data Analytics plays an important part in business [1] • Web analytics for advertising and recommendation • Customer analytics for market optimization • Portfolio analytics for risk control • Analytics on data yield: • Data products • Data-driven decision support • Insights of data model

Relational Data Analytic • Table as data representation, SQL as the query language • Analytic SQL: • Ranking • Windowing • LAG/LEAD • FIRST/LAST • SKYLINE • TOP-K • … …

Emerging of Large Linked Data [1] http://java.dzone.com/articles/mysql-vs-neo4j-large-scale • In real world, linked data are becoming emerging: • Facebook, LinkedIn, Biological network, Phone Call network, Twitter, etc. • Modeling linked data in relational way and querying using SQL is inefficient: • Graph queries are often traverse based • SQL based traversal is 100 times slower than adjacent list based [1] • Graph model is more fit for linked data!!!

Graph Data Model Vertex Edge G = (V, E, A) Attributed Graph Vertices Edges Attributes Graph Attribute Table Graph Structure +attribute dimensions

Graph Data Model • Graph Data: • Vertex – entities, i.e. User, Webpage, Molecule, etc. • Edge – relationships, i.e. follow, cite, depends-on, friends-of, etc. • Attribute – profile information for vertex/edge • Specific model depends on data, thus: • Edge – directed / undirected • Attribute – homogeneous, inhomogeneous

Graph Data Model Example People and friends relationships… People and follow relationships... Attributed Graph modelsa wealth of information Bimolecules and depends-on relationships...

Graph Data Analytics [1] Tian, Y., Hankins, R. A., & Patel, J. M. (2008, June). Efficient aggregation for graph summarization. In Proceedings of the 2008 ACM SIGMOD • Graph Database environment is growing: • Neo4j, Titan, SPARQL, Pregel etc. • Graph Data Analytics are becoming popular: • Graph Summarization[1], Graph OLAP [2] etc. • In our research, we focus on: • Discover needs of native graph analytical queries • Process graph analytical query efficiently [2] C. Chen, X. Yan, F. Zhu, J. Han, and P. S. Yu, “Graph olap: Towards online analytical processing on graphs,” in Data Mining, 2008. ICDM’08

SQL Window Query Window of a tuple contains other tuples relatedto it Window of Tuple 7 • A SQL window query: • Partitions a table • Sorts each partition • Implicitly forms window of each tuple

Graph Window Query In graph, a vertex can also have a set of relatedvertices to be its window. The aggregation on window is a personalizedanalysisover each vertex.

Graph Window Examples Summarizing the age distribution of eachuser’s friends Summarizing the activeness of eachuser’s friends Analyze the industry distribution of a user potential connections These queries focus on the neighborhoods of each user, thus the neighborhoodsforms a vertex’s window

Graph Window Examples Find how many enzymes are in eachmolecule’s pathway Find how many molecules are affected by each enzymein the pathway These queries focus on the ancestor-descendent relationship of molecules, thus ancestor-descendent is a vertex’s window

Graph Window Queries • We thus identify two types of graph window queries: • K-hop window (k-window): • A vertex’s k-hop window contains all the vertices that are its the k-hopneighbors. • Topological window (t-window): • A vertex’s topological window contains all the vertices that are its accentors / descendents

Graph Window Queries • K-hop Window: • Similar to ego-centric analysis of network analysis community • For undirected graph: • all vertices that can connect a vertex • For directed graph: • In-k-hop, for vertices that reaches a vertex in k-hop • Out-k-hop, for vertices that reached by a vertex in k-hop • K-hop, union of in-k-hop and out-k-hop • T-Window: • Requires graph to be DAG

Graph Window Queries • Graph Window Query: • INPUT: a specific window (k-hop, topological) and an aggregation function • OUTPUT: aggregated value over each vertex’s window

Related Work [1] J. Mondal and A. Deshpande, “Eagr: Supporting continuous ego-centric aggregate queries over large dynamic graphs,” SIGMOD, 2015. • In [1] a system EAGr has been proposed to process neighborhood query • Focuses on 1-hop neighbor • It uses iterative planning methods to share aggregations results between different vertex’s window • However, it assumes a large intermediate data to reside in memory, which is not reasonable for k-window () and t-window

Graph Window Query Processing • Naïve Processing I: • Compute vertex’s window sequentially • Aggregate each vertex individually • Advantage: • No large intermediate data generated • Inefficiencies: • Repeated computation of every vertex’s window: • k-window is of complexity in arbitrary graph • t-window is of complexity in arbitrary graph • Slow in individual aggregation: • Each vertex may have window size of • Total aggregation complexity can be

Graph Window Query Processing • Naïve Processing II: • Materialize each vertex’s window • On query processing, aggregate each vertex’s window individually • Advantage: • No computation of windows at run time • Inefficiencies: • Materialize is not memory efficient • All the vertex’s window can be as large as • Query processing is still slow as in Naïve Processing I

Overview of our approach • Two index schemes: • Dense Block Index: for general window and k-hop window • Parent Index: for topological window • Indexes achieves: • Completely preserve the window information for each vertex • Space efficiency • Efficient run-time query processing

Dense Block Index – Matrix View • Window Matrix: • Records vertex-window mapping • Rows represent vertex • Columns represent window

Dense Block Index – Matrix View • Window Matrix Properties: • Boolean matrix • Completely keeps the vertex-window information • Equivalent Matrices: • Window matrix can be applied with row and column permutations • Invariant: number of non-zero elements ()

Dense Block Index – Matrix View • Window matrix based aggregation: • Similar to Naïve Processing II • Traverse the matrix vertically • Aggregate the cells with value one, ignore cells with value zero • Space and Query Complexity: • in sparse matrix format • in matrix format • Note that can be as large as

Dense Block Index Same asymptotical bounds, thus can optimize both simultaneously Store row id and column id i.e. (A,B)(A,B,C) rather than 6 elements Query: Compute A+B first, then the result is shared for window (A,B,C) • Dense Blocks: • Given a matrix, dense blocks is the submatrix whose values are all non-zeros • Properties of Dense Blocks (): • Space complexity • compared to • Query complexity • compared to

Dense Block Index • Dense Block Index: • For every window to be computed, index all the dense blocks in a window matrix • A bipartite graph

Dense Block Index • Properties: • Preserves every non-zero entry of window matrix • During query, no need to access original window matrix • Query Processing: • compute partial aggregates for each dense block • compute final aggregates for every window

Dense Block Index Query Processing Compute OnGraph G Over 1-hop Window Summarizing the activeness of each user’s friends:

Dense Block Index [1] V. Vassilevska and A. Pinar, “Finding nonoverlapping dense blocks of a sparse matrix,” Lawrence Berkeley National Laboratory, 2004 • Equivalent matrices may have different optimal partitions • Find best dense block partition out of all equivalent matrices • Fixed size dense block partition is NP-hard [1] • Heuristics need to be applied

MinHash Clustering for DBI • Heuristic • Classifies similar windows together, then mining the dense blocks in each cluster • Clustering + Mining • Clustering: • Jaccard coefficient is used to measure the similarity between windows • Since each window is a set of vertices • MinHash is an efficient way to perform Jaccard coefficient based clustering

MinHash Clustering for DBI • Mining: • Build partial window matrix for each cluster • Condense the rows with identical values • For uncondensed rows, recursively cluster + mining, until stop condition achieves

MinHash Clustering for DBI MinHash Clustering Recursive cluster Outputs Outputs Split

MinHash Clustering for DBI Bottlenecks MINHASH COST: WINDOW COST: for k-window, for t-window Too HIGH in practice • DBI generation can be summarized into following steps: • Clustering Step: • Min-Hash each vertex, based on its window • Mining Step: • Generate partial matrix for each window • Group identical rows • Recursive clustering

Estimated MinHash Clustering • For K-hop, we developed an estimation scheme to speed up the index creation process. • The observation is that when hop goes larger, the overlapping between each vertex also goes larger • Thus we can use lower hop window information in the clustering phase

Comparison • MinHash Clustering • Clustering Step: • Min-Hash each vertex, based on its window • Mining Step: • Generate partial matrix for each window • Group Identical rows • Recursive clustering • Estimated Clustering • Clustering Step: • Min-Hash each vertex, based on its lower-hop window • Mining Step: • Generate partial matrix for each window • Group Identical rows • Recursive clustering The estimation reduces the indexing time since: Lower-hop window has less elements, so MinHash is faster Lower-hop window generation requires less time

Topological Window Processing • Dense Block Index can be used on Topological Window as well • However, more efficient index exists given a T-window query • Containment Relationship in T-window • If , then • Thus, when compute window of , ’s result can be directly used.

Parent Index Given , in order to use for computing , we need to materialize the difference between and For a given , the vertex with smallest difference must be one of ’s parent Thus, for each vertex, we only index its parent which has the smallest different

Parent Index • A parent index is a lookup table of three fields: • Vertex: the index entry • Parent: the closest parent id • Diff: the difference vertices between Vertexand Parent

Parent Index based Query Processing • Topologically process each vertex’ window • Use the formulae: • Topological order ensures that when processing a vertex, its parents’ results are ready

Parent Index Creation • Efficiently creation based on Topological Scan: • During scan, each vertex passes its current ancestor information to its child • Child on receiving parents’ ancestor information, union these ancestors • Child on receiving all parents information, record the portent with smallest difference

Experiments [1] Stanford Networ Analysis Platform, http://snap.stanford.edu/snap/index.html [2] H. Yildirim, V. Chaoji, and M. J. Zaki, “Dagger: A scalable index for reachability queries in large dynamic graphs,” arXiv preprint arXiv:1301.0977, 2013. • Machine: 2.27GHz CPU with 32 GB memory • Data Synthetic: • SNAP [1] generator for directed graphs • DAGGR [2] generator for DAGs

Comparing Algorithms • K-hop window: • MA: materialize ahead algorithm (materialize vertex-window mapping, individual aggregate) • KBBFS: bounded BFS for computing window of each vertex • MC: MinHash Clustering • EMC: Estimated MinHash Clustering • Topological window: • MA • DBI: dense block index • TS: Topological Scan to compute window of each vertex • PI: parent index

Effectiveness of Estimation Hop = 1 Hop = 2 Hop = 4 Hop = 3

Benefit of Estimation Degree 160 Degree 40

Index size of MC and EMC Degree = 40

Scalability of EMC V = 100k, hop =1 V = 100k, hop = 2

Effectiveness of PI V = 10k

Towards Data Analytics on Attributed Graphs

Towards Data Analytics on Attributed Graphs

Presentation Transcript

Data Analytics

Graph Analytics on Massive Collections of Small Graphs

Graph OLAP: Towards Online Analytical Processing on Graphs

DATA ANALYTICS on web scale

Data Structures on Event Graphs

Exploratory Data Analysis on Graphs

Data Analytics

G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs

Progress towards accessible analytics and data visualization

Data Analytics

Data Analytics

Data Analytics

Data Analytics Course | Data Analytics Online Course | Data Analytics Certification

data analytics

data analytics

Data Analytics

data analytics

Data analytics

analytics data

Data Analytics Training | Data Analytics Course

Data Analytics