Distributed RDF data store on HBase .

Distributed RDF data store on HBase. Project By: AnujShetye VinayBoddula

Project Overview • Introduction • Motivation • HBase • Our work • Evaluation • Related work. • Future work and conclusion.

Introduction • As RDF datasets goes on increasing, therefore size of RDF • is much larger than traditional graph • Cardinality of vertex and edges is much larger. • Therefore large data stores are required for following reasons • Fast and efficient querying . • Scalability issues.

Motivation • Research has been done to map RDF dataset onto relational databases example: Virtuoso, Jena SDB. • But dataset is stored centrally i.e. on one server. • Examples: • Jena SDB map RDF triple in relational database. – Scalability • Some try to store RDF data as a large graph but on single node example Jena TDB– Scalability

HBase. Hbase is an open source distributed sorted map datastore. modelled on google big table.

Contd... • Hbase is a • No SQL datbase. • High Scalability , Highly Fault Tolerant. • Fast Read/Write • Dynamic Database • Hadoop and other apps integrated. • Column family oriented data layout. • Max datasize : ~1 PB. • Read/write limits millions of queries per second. • Who uses Hbase/Bigtable • Adobe, Facebook, Twitter, Yahoo, Gmail, Google maps etc.

HadoopEcoSystem Src : cloudera

Our Project • Our project to create a distributed data storage capability for RDF schema using Hbase . • We developed a system which takes the Ntriple file of an RDF graph as an input and stores the triples in Hbase as a Key value pair using Map reduce jobs. • The schema is simple • we create column families of each predicates • subjects as Row keys • objects as the values

System Architecture MR Job Hbase Data store Mapper I/p File MR job MR Job

Data Model Logical view as ‘Records’

Data Model contd.. Physical Model hasAdvisor Column family hasPaper Column family

workedFor Column family

Two major issues can be solved using Hbase • Data insertion • Data updation • Versioning possible (Timestamps). • Bulk loading of data. Two types • complete bulk load (hbase File Formatter, our approach ) • Incremental bulk load

Evaluations • We talk about it during the demo

Related Work. • CumulusRDF: Linked Data Management on Nested Key-Value Stores appeared in SSWS 2011 works on distributed key value indexing on data stores they used Casandra as the data store. • Apache Casandra is currently capable of storing rdf data and has an adapter to store data in a distributed management system.

Future Work and Conclusion • Our future work lies in developing an efficient interface for sparql as querying with SQL like HIVE is slower in Hbase. • The testing of the system was done on single node, therefore testing it on multiple nodes would be an ultimate test of efficiency .

Questions ??

Distributed RDF data store on HBase .

Distributed RDF data store on HBase .

Presentation Transcript

Performance Evaluation on Hadoop Hbase

HBase

HBase

HBASE

Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase

Practical RDF Chapter 10. Querying RDF: RDF as Data

HBase

HBase

RDF, RDF, RDF….

RDF: Data Description

Practical RDF Ch.10 Querying RDF: RDF as Data

Querying Distributed RDF Data Sources with SPARQL

Big Data – Distributed Database (HBase)

Design and Implementation of An RDF Data Store

RDF triple store

C-Store: RDF Data Management Using Column Stores

HBase

HBase

CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra

A Scale-Out RDF Molecule Store for Distributed Processing of Biomedical Data