A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker SIGMOD 2009 2009-10-09 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

MapReduce vs. Parallel DBMS Center for E-Business Technology

MapReduce 한재선, SearchDay2008, http://nexr.tistory.com Center for E-Business Technology

Architectural Differences Center for E-Business Technology

Benchmark Environment (1/2) • Systems • Hadoop: The most popular open-source MR implementation • DBMS-X: a parallel DBMS that stores data in a row-based format • Vertica: a column-based parallel DBMS • All Three systems were deployed on a 100-node cluster • Analytical Tasks • Data Loading • Selection Task • Aggregation Task • Join Task • UDF Aggregation Task Center for E-Business Technology

Benchmark Environment (2/2) • Dataset • Documents : 600,000 unique documents for each node • 155 million UserVisits records (20GB/node) • 18 million Rankings records (1GB/node) Center for E-Business Technology

1. Data Loading Reorganization loading time Center for E-Business Technology

2. Selection Task • The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user-defined threshold • Query • SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x; • x = 10, which yields approximately 36,000 records per data file on each node • For MR, implementing the same task with Java language Center for E-Business Technology

2. Selection Task - Result time for combining the output into a single file (Additional MR) Processing time Center for E-Business Technology

3. Aggregation Task • The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column • Query • SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; • This task always produces 2.5 million records Center for E-Business Technology

3. Aggregation Task - Result Center for E-Business Technology

4. Join Task • The join task consists of two sub-tasks that perform a complex calculation on two data sets • In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range • Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval • Query • SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP; • SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1; Center for E-Business Technology

4. Join Task - Result Center for E-Business Technology

5. UDF Aggregation Task • The final task is to compute the inlink count for each document in the dataset • Query • SELECT INTO Temp F(contents) FROM Document; • F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database • With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count • SELECT url, SUM(value) FROM Temp GROUP BY url; Center for E-Business Technology

5. UDF Aggregation Task - Result Center for E-Business Technology

Conclusion MapReduce < Parallel DBMS Center for E-Business Technology

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin VLDB 2009 2009-10-09 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea

HadoopDB • The Basic Idea (An Architectural Hybrid of MR & DBMS) • To use MR as the communication layer above multiple nodes running single-node DBMS instances • Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases Center for E-Business Technology

The Architecture of HadoopDB Center for E-Business Technology

HadoopDB – Join Task Center for E-Business Technology

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis

Presentation Transcript

Computational Methods for Large Scale DNA Data Analysis

Proteomics Analysis and integration of large-scale data sets

A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMO

Large-Scale Phylogenetic Analysis

Large scale genomic data mining

Approaches to qualitative data analysis

Analysis of Large Scale Visual Recognition

Approaches to continuous improvement using large-scale data sets Distributed Queries

Large scale genomic data mining

Approaches to Data Analysis

Integrative Analysis of multiple large-scale molecular biological data

DataMeadow A Visual Canvas for Analysis of Large-Scale Multivariate Data

A Comparison of Approaches to Investment Analysis

Large scale data processing

A Comparison of Data Analysis Packages

Web Research - Large-Scale Web Data Analysis

Large Scale Data Integration

Large Scale Data Analytics

large scale data analysis

A Comparison of Approaches to Investment Analysis

Computational Mathematics for Large-scale Data Analysis

DataMeadow A Visual Canvas for Analysis of Large-Scale Multivariate Data