Towards Scalable RDF Graph Analytics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on
  • Presentation posted in: General

Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande Kemafor Anyanwu {pravind2, vvdeshpa, [email protected] COUL - semantic CO mp U ting research L ab. Introduction. Growing interest in exploiting RDF data for decision-making

Download Presentation

Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra Vikas V. Deshpande

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Towards scalable rdf graph analytics on mapreduce padmashree ravindra vikas v deshpande

Towards Scalable RDF Graph Analytics

on MapReduce

Padmashree Ravindra

Vikas V. Deshpande

Kemafor Anyanwu

{pravind2, vvdeshpa, [email protected]

COUL - semantic

COmpUting research Lab


Introduction

Introduction

  • Growing interest in exploiting RDF data for decision-making

  • Requires support for analytical-style querying

  • - More complex than traditional SPJ queries

  • Often include multiple groupings and / or aggregations

  • Next release of SPARQL expected to include such constructs

e.g. : Sales (Cust, prod, price, loc, month, year)

* For each prod, count for each month of 2008, the sales that were between previous month’s avg sale and next month’s avg sale

(prev_avg_sale,

next_avg_sale)

* Example from [1]


Analytical query processing

Analytical Query Processing

  • Traditional OLAP techniques

    • Requires star / snowflake schema

    • Enterprise-scale

  • But Semantic Web data (RDF)

    • Semi-structured (labeled graphs)

    • Absence of star-like schema

    • Billion triple data sets

Goal : Exploit MapReduce-based frameworks to develop a scalable, cost-effective platform for Semantic Web analytics.


Mapreduce based data processing

MapReduce-based Data Processing

  • High-level dataflow languages - Pig Latin, DryadLINQ, HiveQL, JAQL

  • Hybrid approach - HadoopDB[5]

  • MapReduce in RDF processing

    • Graph pattern queries [8], [9]

    • Graph closure computation [10]

  • RAPID [6]

    • Succinct expression of complex queries

    • Optimize multiple groupings / aggregations


Rdf data model

RDF data model

Graph representation

Statements (triples)

Rankings

Groups = Stars

UserVisits


Traditional querying of rdf

Traditional Querying of RDF

  • Graph pattern matching

    • E.g. Get details about all pages visited by particular users between “1979/12/01” and “1979/12/30”

SPARQL Query

Matching graph pattern


Example analytical query on rdf data

Example Analytical Query on RDF data

Compute the average pageRank and total adRevenue for all pages visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30

  • Pattern matching

    • Star sub graphs – Rankings, UserVisits

    • Join between the stars

  • Groupingbased on value of srcIP property

  • Aggregation on value of pageRank and adRevenue


Pig data processing

Pig : Data Processing

  • Express data processing tasks using high-level query primitives

    • usability, code reuse, automatic optimization

    • Pig Latin data model : atom, tuple, bag (nesting)

    • Operators : LOAD, STORE, JOIN, GROUP BY, COGROUP, FOREACH, SPLIT, aggr. functions

    • Extensibility support via UDFs

    • Operators compile into MapReduce jobs

Partition REL A using values in age column ($1)

SPLIT A into minors IF $1 < 18,

majors IF $1 >= 18;

Equijoin on REL A (column 0) and REL B (column 1)

JOIN A by $0, B by $1;


Compiling pig latin s join to mapreduce

Compiling Pig Latin’s JOIN to MapReduce

REL B

REL A

P1

P1

map

P2

P2

Annotate based on $1 (join key)

JOIN A by $1,

B by $0;

P1

reduce

Package tuples

Reducer 1

P1

Reducer 2

P2


Pattern matching in pig approach 1

Pattern Matching in Pig : Approach 1

Rankings

type

R1

Ranking

pageRank

RankingsStarPattern =

JOIN triples1 ON Sub,

triples2 ON Sub,

triples3 ON Sub;

pageURL

11

Triple store

url1

triples1

triples2

triples3

Issues

- Self-joins on very large relations  high I/O costs

- Generate meaningless tuples  additional filtering step

(R1, type, Ranking, R1, type, Ranking, R1, type, Ranking)

Rankings star pattern = 3-way self-join

UserVisits star pattern = 5-way self-join


Approach 2 vertical partitioning

Approach 2: Vertical Partitioning

LOAD all the RDF triples

SPLIT

typeRanking

destURL

visitDate

visitDate

Sub Prop Obj

UV1 visitDate 1979/12/12

UV4 visitDate 1979/12/02

Sub Prop Obj

UV1 visitDate 1979/12/12

UV2 visitDate 1980/02/02

Sub Prop Obj

R1 type Ranking

R2 type Ranking

Sub Prop Obj

UV1 destURL url1

UV2 destURL url1

Filter

pageURL

Sub Prop Obj

R1 pageURL url1

R2 pageURL url2

typeUV

adRev

Sub Prop Obj

UV1 type userVisits

UV2 type userVisits

Sub Prop Obj

UV1 adRev 339.08142

UV2 adRev 330.51248

srcIP

UserVisits = JOIN

(compute Star Pattern)

Ranking = JOIN

(compute Star Pattern)

pageRank

Sub Prop Obj

UV1 scrIP 158.112.27.3

UV2 scrIP 159.222.21.9

Sub Prop Obj

R1 pageRank 11

R2 pageRank 27

JOIN between Ranking, UserVisits

GROUP BY srcIP

FOREACH group GENERATE aggregations


Approach 2 vertical partitioning1

Approach 2: Vertical Partitioning

LOAD all the RDF triples

SPLIT

typeRanking

destURL

visitDate

Sub Prop Obj

UV1 visitDate 1979/12/12

UV2 visitDate 1980/02/02

Sub Prop Obj

R1 type Ranking

R2 type Ranking

Sub Prop Obj

UV1 destURL url1

UV2 destURL url1

pageURL

Sub Prop Obj

R1 pageURL url1

R2 pageURL url2

typeUV

adRev

Sub Prop Obj

UV1 type userVisits

UV2 type userVisits

Sub Prop Obj

UV1 adRev 339.08142

UV2 adRev 330.51248

  • Issues

  • SPLIT : Concurrent sub flows

  • Risk of Disk spills  I/O costs

  • Structure of intermediate relations

srcIP

Ranking = JOIN

(compute Star Pattern)

pageRank

Sub Prop Obj

UV1 scrIP 158.112.27.3

UV2 scrIP 159.222.21.9

Sub Prop Obj

R1 pageRank 11

R2 pageRank 27


Compilation to mapreduce jobs

Compilation to MapReduce Jobs

Rankings

UserVisits

map1

map2

FILTER

FILTER

JOIN

JOIN

reduce1

reduce2

map3

JOIN

reduce3

map4

GROUP BY

reduce4

FOREACH

Step 1 : Pattern Matching

Step 2 : Grouping

Step 3 : Aggregation


Our approach rapid

Our Approach : RAPID+

  • Goal : Minimize I/O costs

  • Strategy:

    • Concurrent computation of star patterns using grouping-based algorithm

    • Can improve efficiency using Operator-coalescing and Look-ahead processing


Concurrent star pattern matching

Concurrent Star Pattern Matching

  • Use grouping-based algorithm on a triple storage model

    • - GROUP BY Subject

  • More efficient if prior filtering of irrelevant triples`

Compute the average pageRank and total adRevenue for all pageURLs visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30

Ranking

Filter irrelevant properties

UserVisits


Concurrent star pattern matching 2

Concurrent Star Pattern Matching -2

Filter irrelevant triples by coalescing LOAD and FILTER operators

Our Approach

Using Pig Latin

map1

map1

LOAD

LOAD

Operator

Coalescing

loadFilter

FILTER

input = LOAD ‘\data’ using

loadFilter( pageRank,

pageURL, type:Ranking,

destURL, adRevenue, srcIP, visitDate, type:UserVisits)

  • Savings by Coalescing:

  • Context switching

  • Parameter passing

  • Multiple handling of same data


Grouping based pattern matching

Grouping-based Pattern Matching

starSubgraphs = GROUP input BY $0;

GROUP BY

Subject

BUT heterogeneous bags


Filtering the groups

Filtering the Groups

BUT all possible sub patterns computed

Filter non-matching sub patterns

visitDatebetween 1979/12/01 and 1979/12/30

  • Structure-based filtering

  • eliminate sub graphs

  • with missing properties

Missing srcIP

  • Value-based filtering

  • validate each sub graph

  • against filter condition


Joining the stars look ahead processing

Joining the Stars : Look-ahead Processing

Star Pattern Matching Cycle

Next Cycle

(Joining the Stars)

Process each bag

Annotate based on

value of join property

No repeated processing 

Annotate based on Subject

map

map

Group by Subject

Process each bag

Structure-based

and value-based

filtering

Annotate based on

value of join prop

Group by Subject

Process each bag

Structure-based

and value-based

filtering

Join between the star sub graphs

reduce

reduce


Example look ahead processing

Example : Look-ahead Processing

Star Pattern Matching  Joining the Stars

Structure-based filtering

Value-based filtering

Look-Ahead - Annotate bag based on join key

Join between the star sub graphs

Eliminate properties irrelevant for future processing (join and filter prop)

 Minimize size of intermediate results


Comparison pig vs rapid

Comparison : Pig vs RAPID+


Case study

Case Study

  • Setup: 5-node / 20-node Hadoop clusters on NCSU’s Virtual Computing Lab [13]

  • Dataset: Synthetic benchmark data set [4]

  • Tasks: Baseline case

    • Task A (PM) – basic pattern matching

      (2 star patterns and a join between the stars)

    • Task B(PM+GA) – pattern matching with grouping and aggregation (two look-ahead processing opportunities)


Experimental results

Experimental Results

Cost Analysis for Task B (PM+GA)

Cost Analysis for Task A (PM)

5-node cluster

5-node cluster


Experimental results1

Experimental Results

Scalability Study

5-node vs 20-nodes

2.8GB per node

1.8GB per node


Conclusion and ongoing work

Conclusion and Ongoing work

  • Promising results even for baseline case

  • Further opportunities for improvement

    • First-class operators vs UDFs

    • Exploit combiners during aggregations

    • More efficient data structures for processing bags

    • Further look-ahead optimizations during multiple groupings and aggregations


References

References

[1] D. Chatziantoniou M. Akinde, T. Johnson, and S. Kim “The MD-join: an operator for Complex OLAP” ICDE 2001, 108–121

[2] J. Dean and S. Ghemawat. “MapReduce :  Simplified Data Processing on Large Clusters”. In Proc. Of OSDI'04, 2004

[3] C. Olston, B. Reed, U.Srivastava, R. Kumar and A.Tomkins. “Pig Latin: a not-so-foreign language for data processing”. In Proc. of ACM SIGMOD2008, p.1099 -1110

[4] A.Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. "A Comparison of Approaches to Large-Scale Data Analysis", In Proc. of SIGMOD 2009

[5] Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB 2009

[6] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009

[7] Yu,Y., Isard, M., Fetterly,D., Badiu,M ., Erlingsson,U., Gunda,P.K. , and Currey,J.: DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008

[8] A. Newman, Y. Li, J. Hunter. Scalable Semantics – The Silver Lining of Cloud Computing. eScience, 2008. IEEE Fourth International Conference on eScience '08. 2008

[9] Newman, A., Hunter, J., Li, Y-F., Bouton, C., Davis, M.: A Scale-Out RDF Molecule Store for Distributed Processing of Biomedical Data. HCLS'08 at WWW 2008.

[10] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen, "Scalable Distributed Reasoning using MapReduce," in Proceedings of the ISWC ‘09, 2009

[11] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007

[12] Prud'hommeaux, E., Seaborne, A.: SPARQL query language for RDF. Technical report, World Wide Web Consortium (2005) http://www.w3.org/TR/rdf-sparql-quer

[13] VCL Setup at NC State University, https://vcl.ncsu.edu/

[14] HiveQL, http://hadoop.apache.org/hive/

[15] JAQL, http://code.google.com/p/jaql

[16] RDF, http://www.w3.org/RDF/


Towards scalable rdf graph analytics on mapreduce padmashree ravindra vikas v deshpande

Thank You!


  • Login