1 / 17

Distributed RDF data store on HBase .

Distributed RDF data store on HBase . . Project By: Anuj Shetye Vinay Boddula. Project Overview . Introduction Motivation HBase Our work Evaluation Related work. Future work and conclusion. . Introduction. As RDF datasets goes on increasing, therefore size of RDF

calida
Download Presentation

Distributed RDF data store on HBase .

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed RDF data store on HBase. Project By: AnujShetye VinayBoddula

  2. Project Overview • Introduction • Motivation • HBase • Our work • Evaluation • Related work. • Future work and conclusion.

  3. Introduction • As RDF datasets goes on increasing, therefore size of RDF • is much larger than traditional graph • Cardinality of vertex and edges is much larger. • Therefore large data stores are required for following reasons • Fast and efficient querying . • Scalability issues.

  4. Motivation • Research has been done to map RDF dataset onto relational databases example: Virtuoso, Jena SDB. • But dataset is stored centrally i.e. on one server. • Examples: • Jena SDB map RDF triple in relational database. – Scalability • Some try to store RDF data as a large graph but on single node example Jena TDB– Scalability

  5. HBase. Hbase is an open source distributed sorted map datastore. modelled on google big table.

  6. Contd... • Hbase is a • No SQL datbase. • High Scalability , Highly Fault Tolerant. • Fast Read/Write • Dynamic Database • Hadoop and other apps integrated. • Column family oriented data layout. • Max datasize : ~1 PB. • Read/write limits millions of queries per second. • Who uses Hbase/Bigtable • Adobe, Facebook, Twitter, Yahoo, Gmail, Google maps etc.

  7. HadoopEcoSystem Src : cloudera

  8. Our Project • Our project to create a distributed data storage capability for RDF schema using Hbase . • We developed a system which takes the Ntriple file of an RDF graph as an input and stores the triples in Hbase as a Key value pair using Map reduce jobs. • The schema is simple • we create column families of each predicates • subjects as Row keys • objects as the values

  9. System Architecture MR Job Hbase Data store Mapper I/p File MR job MR Job

  10. Data Model Logical view as ‘Records’

  11. Data Model contd.. Physical Model hasAdvisor Column family hasPaper Column family

  12. workedFor Column family

  13. Two major issues can be solved using Hbase • Data insertion • Data updation • Versioning possible (Timestamps). • Bulk loading of data. Two types • complete bulk load (hbase File Formatter, our approach ) • Incremental bulk load

  14. Evaluations • We talk about it during the demo

  15. Related Work. • CumulusRDF: Linked Data Management on Nested Key-Value Stores appeared in SSWS 2011 works on distributed key value indexing on data stores they used Casandra as the data store. • Apache Casandra is currently capable of storing rdf data and has an adapter to store data in a distributed management system.

  16. Future Work and Conclusion • Our future work lies in developing an efficient interface for sparql as querying with SQL like HIVE is slower in Hbase. • The testing of the system was done on single node, therefore testing it on multiple nodes would be an ultimate test of efficiency .

  17. Questions ??

More Related