1 / 23

C-Store: RDF Data Management Using Column Stores

C-Store: RDF Data Management Using Column Stores. Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009. What is RDF data?. RDF (Resource Description Framework) The data model behind the Semantic Web . The Semantic Web’s vision is to make Web machine readable.

Download Presentation

C-Store: RDF Data Management Using Column Stores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009

  2. What is RDF data? • RDF (Resource Description Framework) • The data model behind the Semantic Web. • The Semantic Web’s vision is to make Web machine readable. • Represents data as statements of the form <subject, property, object> • To represent the notion "The sky has the color blue" • use the triple < The sky, has the color, blue>.

  3. DBFacebook RDF Graph:Triples make the graph

  4. RDF Data Is Proliferating • Swoogle: Semantic Web Search Engine • Indexes about 2,889,974 Semantic Web documents. • Number of triples could be parsed from all the documents is 699,043,992. • http://swoogle.umbc.edu/ • Simile: MIT Digital Library Data in RDF • More than 50 million triples. • http://simile.mit.edu/

  5. RDF Data Management • Early projects built their own RDF stores. • Trend now towards storing in RDBMSs. • Examines 3 approaches for storing RDF data in a RDBMS

  6. Approach 1: Triple Stores

  7. Approach 2: Property Tables

  8. Approach 3: One-table-per-property Favors Column Store

  9. Comparison Results Synopsis • Triple-store really slow on benchmark with 50M triples. • Property-tables and one-table-per-property approaches are factor of 3 faster. • One-table-per-property with column-store yields another factor of 10.

  10. Querying RDF Data • SPARQL is the dominant language. • Examples: SELECT ?name WHERE { ?x type Person . ?x name ?name } SELECT ?likes ?dislikes WHERE { ?x title “Implementation Techniques for Main Memory Databases”. ?y authorOf ?x . ?y likes ?likes . ?y dislikes ?dislikes }

  11. Translation to SQL over triples is easy

  12. SPARQL  SQL (over triple store) • Query 1 SPARQL: SELECT ?name WHERE { ?x type Person . ?x name ?name } • Query 1 SQL: SELECT B.object FROM triples AS A, triples as B WHERE A.subject = B.subject AND A.property = “type” AND A.object = “Person” AND B.predicate = “name”

  13. Characteristics of Triple Stores • Accessing multiple properties for a resource require subject-subject joins. • Path expressions require subject-object joins. • Can improve performance by: • Indexing each column • Dictionary encoding string data • Ultimately: Do not scale

  14. Property Tables Can Reduce Joins

  15. Characteristics of Property Tables • Complex to design • If narrow: reduces nulls, increases unions/joins • If wide: reduces unions/joins, increases nulls • Implemented in Jena and Oracle • But main representation of data is still triples

  16. Table-Per-Property Approach • Nulls not stored • Easy to handle multi-valued attributes • Only need to read relevant properties • Still need joins (but they are linear merge joins)

  17. Materialized Paths

  18. Accelerating Path Expressions • Materialize Common Paths • Improved property table performance by 18-38% • Improved one-table-per-property performance by 75-84% • Use automatic database designer (e.g., C-Store /Vertica) to decide what to materialize

  19. One-table-per-property  Column-Store • Can think of one-table-per-property as vertical partitioning super-wide property table. • Column-store is a natural storage layer to use for vertical partitioning. • Advantages: • Tuple Headers Stored Separately. • Column-oriented data compression. • Do not necessarily have to store the subject column • Carefully optimized merge-join code

  20. Library Benchmark • Data • Real Library Data (50 million RDF triples) • Data acquired from a variety of diverse sources (some quite unstructured). • Queries • Automatically generated from the Longwell RDF browser. • Details in Abadi’s paper .

  21. Results

  22. Future Work • build a fully-functional RDF database • Extracts and loads RDF data from structured, semi-structured, and unstructured data sources. • Translates SPARQL to queries over vertical schema. • Performs reasoning inside the DB. • Use with biology research.

  23. References • Abadi, Daniel J., Marcus, Adam, Madden, Samuel R., and Hollenbach, Kate. Scalable Semantic Web Data Management Using Vertical Partitioning. In VLDB, 2007. • Abadi, Daniel J., Marcus, Adam, Madden, Samuel R., and Hollenbach, Kate. SW-Store: A Vertically Partitioned DBMS for Semantic Web Data Management. In VLDB Journal, 2009.

More Related