a comparison of approaches to large scale data analysis n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
A Comparison of Approaches to Large-Scale Data Analysis PowerPoint Presentation
Download Presentation
A Comparison of Approaches to Large-Scale Data Analysis

Loading in 2 Seconds...

play fullscreen
1 / 20

A Comparison of Approaches to Large-Scale Data Analysis - PowerPoint PPT Presentation


  • 119 Views
  • Uploaded on

A Comparison of Approaches to Large-Scale Data Analysis. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker SIGMOD 2009 2009-10-09 Summarized by Jaeseok Myung. Intelligent Database Systems Lab

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Comparison of Approaches to Large-Scale Data Analysis' - lilian


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a comparison of approaches to large scale data analysis

A Comparison of Approaches to Large-Scale Data Analysis

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker

SIGMOD 2009

2009-10-09

Summarized by Jaeseok Myung

Intelligent Database Systems Lab

School of Computer Science & Engineering

Seoul National University, Seoul, Korea

slide2

MapReduce vs. Parallel DBMS

Center for E-Business Technology

mapreduce
MapReduce

한재선, SearchDay2008, http://nexr.tistory.com

Center for E-Business Technology

architectural differences
Architectural Differences

Center for E-Business Technology

benchmark environment 1 2
Benchmark Environment (1/2)
  • Systems
    • Hadoop: The most popular open-source MR implementation
    • DBMS-X: a parallel DBMS that stores data in a row-based format
    • Vertica: a column-based parallel DBMS
  • All Three systems were deployed on a 100-node cluster
  • Analytical Tasks
    • Data Loading
    • Selection Task
    • Aggregation Task
    • Join Task
    • UDF Aggregation Task

Center for E-Business Technology

benchmark environment 2 2
Benchmark Environment (2/2)
  • Dataset
    • Documents : 600,000 unique documents for each node
    • 155 million UserVisits records (20GB/node)
    • 18 million Rankings records (1GB/node)

Center for E-Business Technology

1 data loading
1. Data Loading

Reorganization

loading time

Center for E-Business Technology

2 selection task
2. Selection Task
  • The selection task is a lightweight filter to find the pageURLs in the Rankings table(1GB/node) with a pageRank above a user-defined threshold
  • Query
    • SELECT pageURL, pageRank FROM Rankings WHERE pageRank > x;
    • x = 10, which yields approximately 36,000 records per data file on each node
  • For MR, implementing the same task with Java language

Center for E-Business Technology

2 selection task result
2. Selection Task - Result

time for combining the output into a single file

(Additional MR)

Processing time

Center for E-Business Technology

3 aggregation task
3. Aggregation Task
  • The aggregation task is calculating the total adRevenue generated for each sourceIP in the UserVisits(20GB/node), grouped by the sourceIP column
  • Query
    • SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP;
    • This task always produces 2.5 million records

Center for E-Business Technology

3 aggregation task result
3. Aggregation Task - Result

Center for E-Business Technology

4 join task
4. Join Task
  • The join task consists of two sub-tasks that perform a complex calculation on two data sets
    • In the first part of the task, each system must find the sourceIP that generated the most revenue within a particular date range
    • Once these intermediate records are generated, the system must then calculate the average pageRank of all the pages visited during this interval
  • Query
    • SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘2000-01-15’) AND Date(‘2000-01-22’) GROUP BY UV.sourceIP;
    • SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1;

Center for E-Business Technology

4 join task result
4. Join Task - Result

Center for E-Business Technology

5 udf aggregation task
5. UDF Aggregation Task
  • The final task is to compute the inlink count for each document in the dataset
  • Query
    • SELECT INTO Temp F(contents) FROM Document;
      • F : a user-defined function that parses the contents of each record in the Documents table and emits URLs into the database
      • With this function F, we populate a temporary table with a list of URLs and then can execute a simple query to calculate the inlink count
    • SELECT url, SUM(value) FROM Temp GROUP BY url;

Center for E-Business Technology

5 udf aggregation task result
5. UDF Aggregation Task - Result

Center for E-Business Technology

conclusion
Conclusion

MapReduce < Parallel DBMS

Center for E-Business Technology

hadoopdb an architectural hybrid of mapreduce and dbms technologies for analytical workloads

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, Alexander Rasin

VLDB 2009

2009-10-09

Summarized by Jaeseok Myung

Intelligent Database Systems Lab

School of Computer Science & Engineering

Seoul National University, Seoul, Korea

hadoopdb
HadoopDB
  • The Basic Idea (An Architectural Hybrid of MR & DBMS)
    • To use MR as the communication layer above multiple nodes running single-node DBMS instances
  • Queries are expressed in SQL, translated into MR by extending existing tools, and as much work as possible is pushed into the higher performing single node databases

Center for E-Business Technology

the architecture of hadoopdb
The Architecture of HadoopDB

Center for E-Business Technology

hadoopdb join task
HadoopDB – Join Task

Center for E-Business Technology