1 / 27

Dremel: Interactive Analysis of Web-Scale Datasets

Dremel: Interactive Analysis of Web-Scale Datasets Ridvan Dongelci Department of Information and Computer Science Aalto University, School of Science and Technology ridvan.dongelci@aalto.fi April 15 , 2013. Dremel: Interactive Analysis of Web-Scale Datasets. Dremel and Motivation

joylyn
Download Presentation

Dremel: Interactive Analysis of Web-Scale Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dremel: Interactive Analysis of Web-Scale DatasetsRidvan DongelciDepartment of Information and Computer ScienceAalto University, School of Science and Technologyridvan.dongelci@aalto.fiApril 15, 2013

  2. Dremel: Interactive Analysis of Web-Scale Datasets • Dremel and Motivation • Columnar Storage • Query Language and Execution • Experiments • Observations and Conclusion Dremel: Interactive Analysis of Web-Scale Datasets 2

  3. Background • Web-scale Dataset • Data Exploration, Rapid Prototyping • Speed Matters SLOW! Dremel: Interactive Analysis of Web-Scale Datasets

  4. Dremel: Interactive Analysis of Web-Scale Datasets Dremel • Interactive Analysis of Web-Scale Dataset • Scalable, Fault Tolerant, Fast • Analysis on in situ data • Bigtable, Google File System • Widely Used In Google • BigQuery, Google Books, Web Analysis In place data

  5. Dremel: Interactive Analysis of Web-Scale Datasets Dremel Key Concepts • Nested Columnar Storage • Google Protocol Buffer for Processing and Storing • SQL-like Query Language • Execution with Serving Trees • Inspired from Web Search

  6. Dremel: Interactive Analysis of Web-Scale Datasets Data Model • Record-wise vs. Columnar Storage • SELECT SUM(A.B.C) FROM t

  7. Dremel: Interactive Analysis of Web-Scale Datasets Nested Columnar Storage • Repetition and Definition Levels r1.Name1.Language.Code ‘en-us’ r1.Name1.Language.Code ‘en’ r1.Name2 r1.Name3 ‘en-gb’ r2.Name1

  8. Dremel: Interactive Analysis of Web-Scale Datasets Nested Columnar Storage • Splitting Records into Columns • Record Assembly

  9. Dremel: Interactive Analysis of Web-Scale Datasets Tablet Layout and Tricks • Tablet Storage and Horizontal Partitioning • Save Space • Nulls are not stored • Definition levels are not stored if always defined • Repetition levels are only stored when needed • Levels are packed as bit sequence

  10. Dremel: Interactive Analysis of Web-Scale Datasets Query Language • SQL-like Language • Efficient on columnar storage • Input one or multiple table and their schema • Outputs a table and its schema • WHERE prunes branches • Support Following Operations • Nested sub-queries, inter/intra-record aggregation • Top K queries, Joins, User defined functions

  11. Dremel: Interactive Analysis of Web-Scale Datasets Query Language Example

  12. Dremel: Interactive Analysis of Web-Scale Datasets Query Execution • Many Queries are one pass • Execution on Serving Trees • Parallel scheduling and aggregation • Fault tolerance and deal with stragglers • Root Server Receives Incoming Query • Fetches Metadata and Schema • Determines Tablets • Rewrites Query • Sent to Serving Tree • Aggregate the Results

  13. Dremel: Interactive Analysis of Web-Scale Datasets Query Execution Example SELECT A, SUM(c) FROM (R1 UNION ALL ... RN) GROUP BY A SELECT A, COUNT(B) FROM T GROUP BY A Ri = SELECT A, COUNT(B) AS c FROM Ti GROUP BY A

  14. Dremel: Interactive Analysis of Web-Scale Datasets Beyond One-Pass and Query Dispatcher • Dremel Supports More Than One-Pass • Broadcast Join • Repartition the Data • SELECT-INTO • Query Dispatch Based on Priority and Load Balance • Fault tolerance with rescheduling • Slots and Histograms • Approximation and Tablet percentage

  15. Dremel: Interactive Analysis of Web-Scale Datasets Experiment Environment • Real Google Datasets • Uncompressed about one Petabyte • Three-way Replicated except one • 100K to 800K tablets

  16. Dremel: Interactive Analysis of Web-Scale Datasets Local Disk Performance • Trade of Columnar vs. Record Oriented • 1 GB Data on Dual Core Intel with 70 MB/s read Bandwidth • Columnar, 375 MB Light Compression • Record Oriented, Same size Heavier Compression

  17. Dremel: Interactive Analysis of Web-Scale Datasets MR and Dremel • Average Term Frequency is Analyzed • 3000 Map Reduce workers and 3000 Dremel Nodes • 0.5 TB read on columnar as oppose to 87TB record oriented • Overhead of launching and scheduling jobs, assembling records

  18. Dremel: Interactive Analysis of Web-Scale Datasets Serving Tree Topology • Queries with Different Number of Levels • First query reads about 60 GB • Second query read about 180GB • 2-Level is 1:2900, 3-Level is 1:100:2900, • 4-Level is 1:10:100:2900

  19. Dremel: Interactive Analysis of Web-Scale Datasets Per Tablet Histogram • Tablet process Rates are Investigated • %99 for First query is done in 1 second • %99 for Second query is done in 2 second

  20. Dremel: Interactive Analysis of Web-Scale Datasets Within-Record Aggregation • Effect of Nesting and Columnar Storage • Only 13GB is read due to columnar storage • Without nesting query would be much more expensive

  21. Dremel: Interactive Analysis of Web-Scale Datasets Scalability • Top 20 aid’s on 4.2TB compressed data • 1000 to 4000 nodes • CPU time is nearly identical 300K seconds • Near linear scalability

  22. Dremel: Interactive Analysis of Web-Scale Datasets Stragglers • Only two-way replication on T5 • 99% done in 5 seconds • Less replication more Stragglers

  23. Dremel: Interactive Analysis of Web-Scale Datasets Observations • Scan-based queries can be executed on Web Scale • Near-linear scalability is achievable • Mapreduce can benefit from columnar storage • Parallel DBMS can benefit from serving trees • Record assembly and parsing is expensive • Mapreduce and Dremel can be used complementarily

  24. MapReduce in Heterogeneous Environments Thank You for Patience Questions & Comments

  25. MapReduce in Heterogeneous Environments References • Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy H. Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. In Richard Draves and Robbert van Renesse, editors, OSDI, pages 29–42. USENIX Association, 2008 • Hadoop, http://lucene.apache.org/hadoop • Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2 • EC2 Case Studies, tinyurl.com/46vyut

  26. Dremel: Interactive Analysis of Web-Scale Datasets Algorithms

  27. Dremel: Interactive Analysis of Web-Scale Datasets Algorithms

More Related