240 likes | 486 Views
A Comparison of Approaches to Large-Scale Data Analysis. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker SIGMOD 2009 2009-10-09 Summarized by Jaeseok Myung. Intelligent Database Systems Lab
E N D
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker SIGMOD 2009 2009-10-09 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea
MapReduce vs. Parallel DBMS Center for E-Business Technology
MapReduce 한재선, SearchDay2008, http://nexr.tistory.com Center for E-Business Technology
Architectural Differences Center for E-Business Technology
Benchmark Environment (1/2) • Systems • Hadoop: The most popular open-source MR implementation • DBMS-X: a parallel DBMS that stores data in a row-based format • Vertica: a column-based parallel DBMS • All Three systems were deployed on a 100-node cluster • Analytical Tasks • Data Loading • Selection Task • Aggregation Task • Join Task • UDF Aggregation Task Center for E-Business Technology
Benchmark Environment (2/2) • Dataset • Documents : 600,000 unique documents for each node • 155 million UserVisits records (20GB/node) • 18 million Rankings records (1GB/node) Center for E-Business Technology
1. Data Loading Reorganization loading time Center for E-Business Technology
2. Selection Task • The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user-defined threshold • Query • SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x; • x = 10, which yields approximately 36,000 records per data file on each node • For MR, implementing the same task with Java language Center for E-Business Technology
2. Selection Task - Result time for combining the output into a single file (Additional MR) Processing time Center for E-Business Technology
3. Aggregation Task • The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column • Query • SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; • This task always produces 2.5 million records Center for E-Business Technology
3. Aggregation Task - Result Center for E-Business Technology
4. Join Task • The join task consists of two sub-tasks that perform a complex calculation on two data sets • In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range • Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval • Query • SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP; • SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1; Center for E-Business Technology
4. Join Task - Result Center for E-Business Technology
5. UDF Aggregation Task • The final task is to compute the inlink count for each document in the dataset • Query • SELECT INTO Temp F(contents) FROM Document; • F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database • With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count • SELECT url, SUM(value) FROM Temp GROUP BY url; Center for E-Business Technology
5. UDF Aggregation Task - Result Center for E-Business Technology
Conclusion MapReduce < Parallel DBMS Center for E-Business Technology
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin VLDB 2009 2009-10-09 Summarized by Jaeseok Myung Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea
HadoopDB • The Basic Idea (An Architectural Hybrid of MR & DBMS) • To use MR as the communication layer above multiple nodes running single-node DBMS instances • Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases Center for E-Business Technology
The Architecture of HadoopDB Center for E-Business Technology
HadoopDB – Join Task Center for E-Business Technology