1 / 22

Shark

Shark. Hive on Spark. Cliff Engle, Antonio Lupher , Reynold Xin , Matei Zaharia , Michael Franklin, Ion Stoica , Scott Shenker. Spark Review. Resilient distributed datasets (RDDs): Immutable, distributed collections of objects Can be cached in memory for fast reuse Operations on RDDs:

Download Presentation

Shark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shark Hive on Spark Cliff Engle, Antonio Lupher, ReynoldXin, MateiZaharia, Michael Franklin, Ion Stoica, Scott Shenker

  2. Spark Review • Resilient distributed datasets (RDDs): • Immutable, distributed collections of objects • Can be cached in memory for fast reuse • Operations on RDDs: • Transformations: define a new RDD (map, join, …) • Actions: return or output a result (count, save, …)

  3. Generality of RDDs • Despite coarse-grained interface, RDDs can express surprisingly many parallel algorithms • These naturally apply the same operation to many items • Capture many current programming models • Data flow models: MapReduce, Dryad, SQL, … • Specialized models for iterative apps:BSP (Pregel), iterative MapReduce, incremental (CBP) • Support new apps that these models don’t

  4. Spark Review: Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) HDFSFile FilteredRDD MappedRDD filter(func = _.startsWith(...)) map(func = _.split(...))

  5. Background: Apache Hive • Data warehouse solution developed at Facebook • SQL-like language called HiveQL to query structured data stored in HDFS • Queries compile to HadoopMapReduce jobs

  6. Hive Architecture

  7. Hive Principles • SQL provides a familiar interface for users • Extensible types, functions, and storage formats • Horizontally scalable with high performance on large datasets

  8. Hive Applications • Reporting • Ad hoc analysis • ETL for machine learning…

  9. Hive Downsides • Not interactive • Hadoop startup latency is ~20 seconds, even for small jobs • No query locality • If queries operate on the same subset of data, they still run from scratch • Reading data from disk is often bottleneck • Requires separate machine learning dataflow

  10. Shark Motivation • Working set of data can often fit in memory to be reused between queries • Provide low latency on small queries • Integrate distributed UDF’s into SQL

  11. Introducing Shark • Shark = Spark + Hive • Run HiveQL queries through Spark with Hive UDF, UDAF, SerDe • Utilize Spark’s in-memory RDD caching and flexible language capabilities

  12. Shark in the AMP Stack Bagel (Pregel on Spark) Streaming Spark Shark … Hadoop MPI Spark Debug Tools … Mesos Private Cluster Amazon EC2

  13. Shark • ~2500 lines of Scala/Java code • Implements relational operators using RDD transformations • Scalable, fault-tolerant, fast • Compatible with Hive • Run HiveQL queries on existing HDFS data using Hive metadata, without modifications

  14. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Spark: lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(1)) messages.cache() messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count Shark: CREATE TABLE log(header string, message string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LOCATION “hdfs://...”; CREATE TABLE errors_cached AS SELECT message FROM log WHEREheader == “ERROR”; SELECT count(*) FROM errors_cachedWHERE message LIKE “%foo%”; SELECT count(*) FROM errors_cachedWHEREmessage LIKE “%bar%”;

  15. Shark Architecture • Reuse as much Hive code as possible • Convert logical query plan generated from Hive into Spark execution graph • Fully support Hive UDFs, UDAFs, storage formats, SerDe’s to ensure compatibility • Rely on Spark’s fast execution, fault tolerance, and in-memory RDD’s

  16. Shark Architecture

  17. Preliminary Benchmarks • Brown/Stonebraker benchmark 70GB1 • Also used on Hive mailing list2 • 10 Amazon EC2 High Memory Nodes (30GB of RAM/node) • Naively cache input tables • Compare Shark to Hive 0.7 1http://database.cs.brown.edu/projects/mapreduce-vs-dbms/ 2https://issues.apache.org/jira/browse/HIVE-3961

  18. Benchmarks: Query 1 • 30GB input table SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;

  19. Benchmark: Query 2 • 5 GB input table SELECT pagerank, pageURL FROM rankings WHERE pagerank > 10;

  20. Benchmark: Query 3 • 30 GB input table SELECT sourceIP, SUM(adRevenue) FROM uservisits GROUP BY sourceIP;

  21. Current Status • Most of HiveQL fully implemented in Shark • User selected caching with CTAS • Adding in optimizations such as Map-Side Join • Performing alpha testing of Shark on Conviva cluster

  22. Future Work • Automatic caching based on query analysis • Multi-query optimization • Distributed UDFs using Shark + Spark • Allow users to implement sophisticated algorithms as UDFs in Spark • Shark operators and Spark UDFs take/emit RDDs • Query processing UDFs are streamlined

More Related