1 / 23

Shark:SQL and Rich Analytics at Scale

Shark:SQL and Rich Analytics at Scale. Presentaed By Kirti Dighe Drushti Gawade. What is Shark? A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive data, metastores , and queries( HiveQL , UDFs, etc) Similar speedups of up to 100x

waldo
Download Presentation

Shark:SQL and Rich Analytics at Scale

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shark:SQL and Rich Analytics at Scale Presentaed By KirtiDighe DrushtiGawade

  2. What is Shark? • A new data analysis system • Built on the top of the RDD and spark • Compatible with Apache Hive data, metastores, and queries(HiveQL, UDFs, etc) • Similar speedups of up to 100x • Supports low-latency, interactive queries through in-memory computation • Supports both SQL and complex analytics such as machine learning

  3. Shark Architecture • Used to query an existing Hive warehouse returns result much faster without modification • Diagram of Architecture

  4. Spark • Support partial DAG execution • Optimization of joint algorithm Features of shark • Supports general computation • Provides in-memory storage abstraction-RDD • Engine is optimized for low latency

  5. RDD • Sparks main abstraction-RDD • Collection stored in external storage system or derived data set • Contains arbitrary data types Benefits of RDD’s • Return at the speed of DRAM • Use of lineage • Speedy recovery • Immutable-foundation for relational processing.

  6. Fault tolerance guarantees • Shark can tolerate the loss of any set of worker nodes. • Recovery is parallelized across the cluster. • The deterministic nature of RDDs also enables straggler mitigation • Recovery works even in queries that combine SQL and machine learning UDFs

  7. Executing sql over RDDs Process of executing sql queries which includes • Query parsing • Logical plan generation • Physical plan generation

  8. Engine extension Partial DAG execution(PDE) • Static query optimization • Dynamic query optimization • Modification of statistics Example of statistics • Partition size record count • List of “heavy hitters” • Approximate histogram

  9. Join Optimization Skew handling and degree parallelism Task scheduling overhead

  10. Columnar Memory Store Simply catching records as JVM objects is insuffiecient Shark employs column oriented storage , a partition of columns is one MaoReduce “record” Benefits: compact representation, cpu efficient compression, cache locality

  11. Machine learning support • Shark supports machine learning-first class citizen • Programming model design to express machine learning algorithm: 1. Language Integration Shark allows queries to perform logistic regression over a user database. Ex: Data analysis pipeline that performs logistic regression over database.

  12. 2. Execution Engine Integration • Common abstraction allows machine learning computation and SQl queries to share workers and cached data. • Enables end to end fault tolerance

  13. Implementation How to improve Query Processing Speed • Minimize tail latency • CPU cost processing of each • Memory-based shuffle • Temporary object creation • Bytecode compilation of expression evaluation

  14. Experiments Evaluation of the shark using database • Pavlo et al. Benchmark: 2.1 TB of data reproducing Pavlo et al.’s comparison of MapReduce vs. analytical DBMSs [25]. • TPC-H Dataset: 100 GB and 1 TB datasets generated by the DBGEN program [29]. • Real Hive Warehouse: 1.7 TB of sampled Hive warehouse data from an early industrial user of Shark. • Machine Learning Dataset: 100 GB synthetic dataset to measure the performance of machine learning algorithms. Shark perform 100x faster than hive

  15. Methodology and cluster setup Amazon EC2 with 100m2.4xlarge nodes 8 virtual code 68 GB of memory 1.6 TB of local storage Pavloetal. Benchmarks 1 GB/node ranking table 20 GB/node uservisits table • Selection Query (cluster index) SELECT pageURL, pageRank FROM rankings WHERE pageRank > X;

  16. Aggregation Queries SELECT sourceIP, SUM(adRevenue) FROM uservisits GROUP BY sourceIP; SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 7);

  17. Join Query SELECT INTO Temp sourceIP, AVG(pageRank), SUM(adRevenue) as totalRevenue FROM rankings AS R, uservisits AS UV WHERE R.pageURL = UV.destURLAND UV.visitDate BETWEEN Date(’2000-01-15’) AND Date(’2000-01-22’) GROUP BY UV.sourceIP; Join query runtime from Join stategies Pavlo Benchmark chosen by optimizers

  18. Data Loading To query data in HDFS directly,which means its data ingress rate is at least as fast as Hadoop’s. Micro-Benchmarks • Aggregation performance SELECT [GROUP_BY_COLUMN], COUNT(*) FROM lineitemGROUP BY [GROUP_BY_COLUMN]

  19. Join selection at runtime • Fault tolerence Measuring sharks performance in presence of node failures –simulate failures and measure query performance, before,during and after failure recovery.

  20. Real hive warehouse 1. Query 1 computes summary statistics in 12 dimensions for users of a specific customer on a specific day. 2. Query 2 counts the number of sessions and distinct customer/client combination grouped by countries with filter cates on eight columns. 3. Query 3 counts the number of sessions and distinct users for all but 2 countries. 4. Query 4 computes summary statistics in 7 dimensions grouping by a column, and showing the top groups sorted in descending order.

  21. Machine learning Algorithms Compare performance of shark running the same work flow in Hive and Hadoop Workflow consisted of three steps: 1)Selecting the data of interesr from the warehouse using SQL 2)Extracting Features 3)Applying Iterartive Algorithms • Logistic Regresion • K-Means Clustering

  22. Logistic Regression,pre-iterarion runtime(seconds) K-means Cllustering,pre-iteration algorithm

  23. Conclusion • Warehouse combining relational queries and complex analytics • Generalizes map reduce using both • Traditional Databse Techniques • Novel Partial DAG Execution • Shark faster than Hive and Hadoop

More Related