Pig Latin & RDF Data Analysis

Pig Latin & RDF Data Analysis The MD-join: An Operator for Complex OLAP, ICDE 2001 MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 Pig Latin: A Not-So-Foreign Language for Data Processing, SIGMOD 2008 RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web, ISWC 2009 Towards Scalable RDF Graph Analytics on MapReduce, MDAC 2010 2010-08-20 presented by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

MapReduce Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

MapReduce • MapReduce: Simplified Data Processing on Large Cluster, 2004 • Map: (k1, v1) -> list (k2, v2) • Reduce: (k2, list (v2)) -> (k3, v3) JTeam, http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/ Center for E-Business Technology

Programming with MapReduce map (String key, String value): // key : document name // value : document contents foreach word w in value: EmitIntermediate (w, “1”); reduce (String key, Iterator values): // key : a word // values : a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(key, AsString(result)); Center for E-Business Technology

Why Is MapReduce Useful? • Scalability • Capable of handling large amount of data • Extensibility • Commodity machines • Shared-nothing architecture • Fault-tolerance • Writing intermediate results into the physical storage • Flexibility • Enabling users to write custom codes Center for E-Business Technology

Pig Latin Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

Motivation • SQL wrench programmers away from their preferred method of analyzing data, namely writing imperative scripts or code, toward writing declarative queries in SQL, which they often find unnatural, and overly restrictive • MapReduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse • We have developed a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and low-level, procedural style of MapReduce Center for E-Business Technology

Pig Latin - Example Center for E-Business Technology

Pig Latin • Data Model • Atom, Tuple, Bag, Map • Language • Expressions • Operators -> MapReduce • LOAD, STORE, SPLIT • FOREACH, FILTER • COGROUP, GROUP, JOIN • Aggregate Functions • UDF (User Defined Function) Center for E-Business Technology

Data Model • Atom • An atom contains a simple atomic value such as a string or a number, e.g., ‘alice’ • Tuple • A tuple is a sequence of fields, each of which can be any of the data types, e.g., (‘alice’, ‘lakers’) • Bag • A bag is a collection of tuples with possible duplicates • Map • A map is a collection of data items, where each item has an associated key through which it can be looked up Center for E-Business Technology

Expressions in Pig Latin Center for E-Business Technology

Compiling Pig Latin’s JOIN to MapReduce Equijoin on REL A (column 1) and REL B (column 0) JOIN A by $1, B by $0; REL A REL B JOIN A by $1, B by $0; map P1 Annotate based on (join key) P1 P2 P2 P1 reduce Package tuples Reducer 1 P1 Reducer 2 P2 Center for E-Business Technology

RDF Data Analysis Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

RDF Data Model Graph representation Statements (triples) Rankings Groups = Stars UserVisits Center for E-Business Technology

Basic Steps for Analytical Query Processing • Compute the average pageRank and total adRevenue for all pages visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30 Rankings UserVisits map1 map2 FILTER FILTER JOIN JOIN reduce1 reduce2 map3 JOIN reduce3 map4 GROUP BY reduce4 FOREACH Step 1 : Pattern Matching Step 2 : Grouping Step 3 : Aggregation

Positioning • RAPID+ • RDF Graph Pattern Matching • RAPID • Grouping & Aggregation Center for E-Business Technology

RAPID+ Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

Pattern Matching in Pig: Approach 1 RankingsStarPattern = JOIN triples1 ON Sub, triples2 ON Sub, triples3 ON Sub; Rankings type R1 Ranking pageRank pageURL 11 Triple store url1 triples1 triples2 triples3 Rankings star pattern = 3-way self-join UserVisits star pattern = 5-way self-join Center for E-Business Technology

Pattern Matching in Pig: Approach 2 LOAD all the RDF triples SPLIT typeRanking destURL typeUV visitDate Sub Prop Obj UV1 visitDate 1979/12/12 UV2 visitDate 1980/02/02 Sub Prop Obj UV1 destURL url1 UV2 destURL url1 Sub Prop Obj UV1 type userVisits UV2 type userVisits Sub Prop Obj R1 type Ranking R2 type Ranking pageRank Filter srcIP pageURL adRev visitDate Sub Prop Obj UV1 visitDate 1979/12/12 UV4 visitDate 1979/12/02 Sub Prop Obj R1 pageRank 11 R2 pageRank 27 Sub Prop Obj UV1 scrIP 158.112.27.3 UV2 scrIP 159.222.21.9 Sub Prop Obj UV1 adRev 339.08142 UV2 adRev 330.51248 Sub Prop Obj R1 pageURL url1 R2 pageURL url2 UserVisits = JOIN (compute Star Pattern) Ranking = JOIN (compute Star Pattern) JOIN between Ranking, UserVisits Center for E-Business Technology

RAPID+ • Propose UDFs for RDF pattern matching • Goal: Minimize I/O costs • Strategy: • Concurrent computation of star patterns using grouping-based algorithm • Can improve efficiency using Operator-coalescing and Look-ahead processing Center for E-Business Technology

Concurrent Star Pattern Matching Compute the average pageRank and total adRevenue for all pageURLs visited by a particular srcIP with visitDate between 1979/12/01 and 1979/12/30 Ranking Single intermediate result Filter irrelevant properties UserVisits

Operator Coalescing Filter irrelevant triples by coalescing LOAD and FILTER operators Our Approach Using Pig Latin map1 map1 LOAD LOAD Operator Coalescing loadFilter FILTER input = LOAD ‘\data’ using loadFilter( pageRank, pageURL, type:Ranking, destURL, adRevenue, srcIP, visitDate, type:UserVisits ) Center for E-Business Technology

Grouping-based Pattern Matching starSubgraphs = GROUP input BY $0; GROUP BY Subject Center for E-Business Technology

Filtering the Groups visitDatebetween 1979/12/01 and 1979/12/30 • Structure-based filtering • eliminate sub graphs • with missing properties Missing srcIP • Value-based filtering • validate each sub graph • against filter condition Center for E-Business Technology

Look-ahead Processing Star Pattern Matching  Joining the Stars Look-Ahead - Annotate bag based on join key Join between the star sub graphs Eliminate properties irrelevant for future processing (join and filter prop)  Minimize size of intermediate results Center for E-Business Technology

Comparison: Pig vs RAPID+ Center for E-Business Technology

Experiment Cost Analysis for Task A (PM) 5-node cluster Center for E-Business Technology

RAPID Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

RAPID • Propose a language, RAPID, which extends Yahoo’s Pig Latin language with query primitives for dealing with the graph structured nature of RDF • Propose an approach for achieving the scalable processing of the non-trivial analytical tasks on RDF datasets that is based on an efficient multidimensional query operator called the MD-Join Center for E-Business Technology

Architecture of RAPID Center for E-Business Technology

Data Expression • Class Expression • type:Class • Property Expression • Path Expression Center for E-Business Technology

MD-Join • The MD-Join: An Operator for Complex OLAP, ICDE 2001 • Complex OLAP query is often difficult to express and difficult to optimize using standard relational operators • ex. compute the total sales broken down by all possible combinations of attributes prod, month, state • Tight coupling of the GROUPBY and aggregation clause is one reason of the problem • MD-Join decouples the grouping and the aggregation clauses in query expressions, in order to achieve great flexibility and simplicity Center for E-Business Technology

MD-Join • The MD-Join operator MD (B, R, l, ) defines a relation with • B: base table (contains all combinations of key values representing each group on which aggregations have to be computed) • R: fact table (contains all the data that needs to be aggregated) • l: a list of aggregation functions that need to be computed over attributes •  : is a set of conditions involving the attributes of B and R Center for E-Business Technology

MD-Join • MD (B, R, sum(price), Prod=R.Prod & Month = R.Month) Center for E-Business Technology

Extending Pig Latin for Analytical Processing RDF • Generating Fact Table (GFD) • fact_dataset = LOAD ‘input.rdf’ USING GFD (…); • Generating Base Table (GBD) • base_dataset = LOAD ‘input.rdf’ USING GBD(…); • Results of GFD & GBD are stored in the same MDJ.rdf file • MD-Join • output_dataset = LOAD ‘MDJ.rdf’ USING MDJ(…); Center for E-Business Technology

Experiment • DBLP • BSBM Center for E-Business Technology

Experimental Result Center for E-Business Technology

Summary • 데이터 분석 도메인에서 MR 기반의 접근이 활용되고 있음 • MR이 toolow-level 이라 Pig 같은 도구들이 많이 사용됨 • 대용량 RDF 분석에도 MR 및 Pig의 활용이 시도되고 있음 • RDF 모델 특성 때문에 기존 도구들에 약간의 수정이 필요 • Pig 기반의 RDF 처리는 RAPID, RAPID+ 가 유이한 솔루션 • RDF에 대해 Pig 보다는 좋은 성능을 보이고 있음 Center for E-Business Technology

Pig Latin & RDF Data Analysis