1 / 19

Benchmarking traversal operations over graph databases

Benchmarking traversal operations over graph databases. Marek Ciglan 1 , Alex Averbuch 2 and Ladialav Hluchý 1 1 Institute of In f ormatics , Slovak Academy of sciences, Bratislava 2 Swedish Institute of Computer Science Stockholm , Sweden. Overview. Graph data management

caspar
Download Presentation

Benchmarking traversal operations over graph databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Benchmarking traversal operations over graphdatabases Marek Ciglan1, AlexAverbuch2 and Ladialav Hluchý1 1Institute of Informatics, Slovak Academy of sciences, Bratislava 2 Swedish Institute of Computer ScienceStockholm, Sweden

  2. Overview • Graph data management • Graph databases • Characteristics • Unique features • Challenges • GDB Benchmarking • Motivation • Related work • Graph traversal benchmark • Goals • Design • Preliminary results 21 November 2011

  3. Graph data management • Booming area of R&D in recent years • Reasons: • Increased availability and importance of graph data • Natural way for modelling various real world phenomena • (networks: social, information, communication) • Two dominant data management directions: • Distributed graph processing frameworks • Mining/processing of large graphs • Pregeland clones (Goden Orb, Giraph) • Graph databases • Persistent management of graph data • Neo4J, OrientDB, Dex 21 November 2011

  4. Graph databases • Property graph data model • Graph structure • Elements have properties Node K2 Attr I1: val Attr I2: val Attr I3: val L1 L3 Node K1 Attr I1: val Attr I2: val Attr I3: val Node K4 Attr I1: val Attr I2: val Attr I3: val Node K3 Attr I1: val Attr I2: val Attr I3: val L2 L1 21 November 2011

  5. Graph databases • Property graph data model • Graph structure • Elements have properties • Unique feature • Graph topology capturing the relations of objects • Graph database should be • Efficient in exploiting topology • Allows for fast traversal • Challenges • Traditionally – graph processing/traversing done in memory • Reasons: • Data driven computation • Random access pattern for data access 21 November 2011

  6. Graph database benchmarking • Motivation • Number of emerging graph data management solutions. • Which is right one for a specific problem? • Fair measurement of performance for distinct use cases. • Identify limits – what use cases have good performance. 21 November 2011

  7. Graph database benchmarking • Motivation • Number of emerging graph data management solutions. • Which is right one for a specific problem? • Fair measurement of performance for distinct use cases. • Identify limits – what use cases have good performance. • Related work • Only few works address directly graph databases • D. Dominguez-Sal et al: • Adoption of HPC benchmark for graph data processing • Design of a benchmark suitable for graph database systems • GraphBench - basic benchmarking framework implementation 21 November 2011

  8. Graph database benchmarking • Motivation • Number of emerging graph data management solutions. • Which is right one for a specific problem? • Fair measurement of performance for distinct use cases. • Identify limits – what use cases have good performance. • Traversal operation benchmarking • Graph topology – unique feature of the graph databases • Test the ability to do: • Local traversals (exploring k-hops neighbourhood) • Global traversals (traversals of whole graph) • Perform traversals in a memory constraint environment • (can we deal efficiently with data sets exceeding the physical memory?) 21 November 2011

  9. Benchmark design • Fairness • Blueprints API – effort to provide common API • https://github.com/tinkerpop/blueprints/wiki/ • Using Blueprints – one implementation of benchmark for all the benchmarked systems • Avoid bias of different implementation of benchmark for different systems • execution of the same sequence of operations on the same data • log operations and their parameters in the first run over the defined data • logs are persistent, allowing benchmarks to be rerun on different versions of a product, and the change in performance can thus be measured 21 November 2011

  10. Benchmark design • Data • Different data properties / distributions affects benchmark results • E.g. dense vs. sparse graphs • Ideally, data sets properties similar to those of real world data sets • Use: scale free networks with small world properties • social networks, the Internet, traffic networks, biological networks, and term co-occurrence networks • LFR-Benchmark generator - networks with power-law degree distribution and implanted communities within the network 21 November 2011

  11. Benchmark design • Traversal operations • Local traversals • Compute local clustering coefficient (2-hops breadth first traversal) • 3-hops breadth first traversal • Global traversals • Compute connected components • Incomming / ougoing edges • k-iterations of HITS algorithm • Memory constraint environment • Intermediate results for global traversals operations: • Kept in memory • Kept as properties on nodes 21 November 2011

  12. Benchmark implementation • Implemented on top of Blueprints API • Test performed on: • Neo4J, • DEX, • OrientDB6 , • Native RDF repository (NativeSail) • SGDB (research prototype ) • Challenge: deal with differences in underlying systems, E.g.: • triple stores – naming constraints, • some impl. do not support properties on some elements • Some impl. do not support iteration over nodes/edges • Nodes Ids generation – user provided vs. autogenerated • Transaction support / no transactions 21 November 2011

  13. Benchmark Runs • Performed on older hardware: • 2G mem • Data sets sizes: • 1K, 10K, 40K, 50K, 100K, 200K, 400K, 800K, 1M • Most systems were not able to load nets with 400K+ edges • (constraint: load 10K edges in less than 60 sec.) 21 November 2011

  14. Graphloading – elementsinsertion 21 November 2011

  15. Localtraversal – BFS 3 hops 21 November 2011

  16. Globaltraversals – connectedcomponents 21 November 2011

  17. Conclusion • Extending work on benchmarking graph databases • Focusing on graph traversal operations • Local/Global traversals • Preliminary results: • Problem just to load larger datasets into GDBs • Stable performance for local traversals with 2-3 hops • Suitable for most ego-centric node properties analysis • Bad performance for global traversal operations on larger networks 21 November 2011

  18. Thankyouforyourattention. http://ups.savba.sk/~marek/gbench.html 21 November 2011

  19. SemSets – activation spreading over network 21 November 2011

More Related