1 / 43

SQL on Hadoop

SQL on Hadoop. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. All of These. But HAWQ specifically C ause that's what I know. Problem!. “MapReduce is great, but all of my data dudes don’t know Java” Well, Pig and Hive exist.. . They are kind of SQL

brigit
Download Presentation

SQL on Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SQL on Hadoop CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

  2. All of These But HAWQ specifically Cause that's what I know

  3. Problem! • “MapReduce is great, but all of my data dudes don’t know Java” • Well, Pig and Hive exist... They are kind of SQL • “But Pig and Hive are slow and they aren’t really SQL... How can I efficiently use all of my SQL scripts that I have today?” • Well, that's why all these companies are buildingSQL on Hadoop engines... Like HAWQ.

  4. SQL Engines for Hadoop • Massive Parallel Processing (MPP) frameworks to run SQL queries against data stored in HDFS • Not MapReduce, but still brings the code to the data • SQL for big data sets, but not stupid huge ones • Stupid huge ones should still use MapReduce

  5. Current SQL Landscape • Apache Drill (MapR) • Cloudera Impala • Facebook Presto • Hive Stinger (Hortonworks) • Pivotal HAWQ* • Shark – Hive on Spark (Berkeley) * Not open source

  6. Why? • Ability to execute complex multi-staged queries in-memory against structured data • Available SQL-based machine learning libraries can be ported to work on the system • A well-known and common query language to express data crunching algorithms • Not all queries need to run for hours on end and be super fault tolerant

  7. Okay, tell me more... • Many visualization and ETL tools speak SQL, and need to do some hacked version for HiveQL • Can now connect these tools and legacy applications to “big data” stored in HDFS • You can start leveraging Hadoop with what you know and begin to explore other Hadoop ecosystem projects • Your Excuse Here

  8. SQL on Hadoop • Built for analytics! • OLAP vs OLTP • Large I/O queries against append-only tables • Write-once, read-many much like MapReduce • Intent is to retrieve results and run deep analytics in ~20 minutes • Anything longer, you may want to consider using MapReduce

  9. Architectures • Architectures are all very similar Query Executor HDFS Query Executor Master Query Executor Query Planner Query Executor

  10. Playing Nicely • In the past it was just a DataNode and a TaskTracker... Now we have • DataNode • NodeManager • HBaseRegionServer • SQL-on-Hadoop Query Executor • Storm Supervisor • ??? • All of which need memory and CPU time

  11. HAWQ Overview • Greenplum database re-platformed on Hadoop/HDFS • HAWQ provides all major features found in Greenplum database • SQL Completeness: 2003 Extensions • Cost-Based Query Optimizer • UDFs • Row or Column-Oriented Table Storage • Parallel Loading and Unloading • Distributions • Multi-level Partitioning • High-speed data redistribution • Views • External Tables • Compression • Resource Management • Security • Authentication

  12. Basic HAWQ Architecture

  13. HAWQ Master • Located on a separate node from the NameNode • Does not contain any user data • Contains Global System Catalog • Authenticates client connections, processes SQL, distributes work between segments, coordinates results returned by segments, presents final client results

  14. HAWQ Transactions • No global transaction management • No updates or deletes.. • Transactions at the HAWQ master level • Single phase commit

  15. HAWQ Segments • A HAWQ segment within a Segment Host is an HDFS client that runs on a DataNode • Multiple segments per Segment Host • Segment is the basic unit of parallelism • Multiple segments work together to form a parallel query processing system • Operations execute in parallel across all segments

  16. Segments Access Data Stored in HDFS • Segments are stateless • Segments communicate with NameNode to obtain block lists where data is located • Segments access data stored in HDFS

  17. HAWQ Parser Clients JDBC SQL • Enforces syntax and semantics • Converts SQL query into a parse tree data structure describing details of the query

  18. Parallel Query Optimizer • Cost-based optimization looks for the most efficient plan • Physical plan contains scans, joins, sorts, aggregations, etc. • Directly inserts ‘motion’ nodes for inter-segment communication

  19. Parallel Query Optimizer Continued • Inserts motionnodes for efficient non-local join processing (Assume table A is distributed across all segments – i.e. each has AK) • Broadcast Motion (N:N) • Every segment sends AK to all other segments • Redistribute Motion (N:N) • Every segment rehashes AK (by join column) and redistributes each row • Gather Motion (N:1) • Every segment sends its AK to a single node (usually the master)

  20. Parallel Query Optimization Example SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= date '1994-08-01' and o_orderdate < date '1994-08-01' + interval '3 month' and l_returnflag = 'R' and c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue desc

  21. HAWQ Query Optimizer

  22. HAWQ Dispatcher and Query Executor

  23. HAWQ Dynamic Pipelining • Parallel data flow using a UDP-based interconnect • No materialization of intermediate results, unlike MapReduce

  24. User Defined Function Support • C functions • User Defined Operators • PL/PGSQL • PGCrypto • Future: • User Defined Types • Nested Functions Oracle functions PL/R, PL/Python User Defined Aggregates

  25. Machine Learning Support • Open-Source MADlib • Classification • Regression • Clustering • Topic Modeling • Association Rule Mining • Descriptive Statistics • Validation • PostGIS

  26. HAWQ Data Storage and I/O Overview • DataNodes are responsible for serving read and write requests from HAWQ segments • Data stored external to HAWQ can be read using Pivotal Xtension Framework (PXF) external tables • Data stored in HAWQ can be written to HDFS for external consumption using PXF Writable HDFS Tables • MapReduce can access data stored in HAWQ using provided Input/Outputformats

  27. HAWQ Storage Formats • Append Only • Read-optimized • Distributions • Partitioning • Column Store • Compressions: quicklz, zlib, RLE • MR Input/Output format • Parquet • Open-source format • Snappy, gzip

  28. HAWQ Data Locality • Think back... how does a client write blocks?

  29. Accessed through libhdfs3 • Refactored libhdfs resulting in libhdfs3 • C-based library interacting with HDFS • Leverages protocol buffers to achieve greater performance • libhdfs3 is used to access blocks from HAWQ • AKA short-circuit reads • libhdfs3 gives huge performance gains over JNI-based libhdfs

  30. Sharded Data and Segment Processors • Data is physically sharded in HDFS using directory structure • Each segment gets their own directory, and a block is written locally to obtain local disk access read times • Affinity between HAWQ segments and shards provides significant I/O gains

  31. Data Distributions • Every table has a distribution method • DISTRIBUTED BY (column) • Uses a hash distribution • DISTRIBUTED RANDOMLY • Uses a random distribution which is not guaranteed to provide a perfectly even distribution

  32. Partitioning • Reduces the amount of data to be scanned by reading only the relevant data needed to satisfy a query • Supports range partitioning and list partitioning • There can be a large number of partitions depending on the partition granularity • Every partition is a file in HDFS

  33. Multi-Level Partitioning • Use Hash Distribution to evenly spread data across all nodes Segment 1D Segment 1A Segment 1B Segment 1C • Use Range Partition within a node to minimize scan work Segment 2D Segment 2A Segment 2B Segment 2C Jan 2007 Feb 2007 Mar 2007 Apr 2007 May 2007 Segment 3D Segment 3A Segment 3B Segment 3C Jun 2007 Jul 2007 Aug 2007 Sep 2007 Oct 2007 Nov 2007 Dec 2007

  34. HAWQ Parquet • Store multiple columns in a single block • Reduces number of files for each table • Added support for HAWQ complex data types • Only 4-5 formats supported in Parquet • No UDT or arrays (for now) • Added support for append operations with Parquet • Set at table and partition level • One partition uses Parquet, another uses AO, etc.

  35. HAWQ Fault Tolerance • Fault tolerance through HDFS replication • Replication factor decided when creating a file space in HDFS • When a segment fails the shard is accessible from another node • via the NameNodeand then the DataNode to where the shard was replicated

  36. HAWQ Master Standby • Master standby on a separate host from the HAWQ Master • Warm standby kept up to date by transactional log replication • Replicated logs used to reconstruct state of HAWQ Master • System catalogs synchronized

  37. Pivotal Xtension Framework • External table interface inside HAWQ to read data stored in Hadoop ecosystem • External tables can be used to • Load data into HAWQ from Hadoop • Query Hadoop data without materializing it into HAWQ • Enables loading and querying of data stored in • HDFS • HBase • Hive

  38. PXF Features • Applies data locality optimizations to reduce resources and network traffic • Supports filtering through predicate push down in HBase • <, >, <=, >=, =, != between a column and a constant • Can AND between these (but not OR) • Supports Hive table partitioning • Supports ANALYZE for gathering HDFS file statistics and having it available for the query planner at run time • Extensible framework via Java to enable custom data sources and formats

  39. PXF Supported Data Formats • Text/CSV • SequenceFiles • Avro • JSON • Hive • HBase • Accumulo

  40. PXF Architecture Components • Fragmenter • Accessor • Resolver • Analyzer • Often able to leverage custom Hadoop I/O Formats

  41. PXF Loading into HAWQ • To load data into HAWQ use a variation of • INSERT INTO <hawq-target-table> SELECT * FROM <pxf-external-table>; • Data can be transformed in-flight before loading • Data from PXF can also be joined in-flight with native tables • Number of segments responsible for connecting and concurrent reading of data can be tuned

  42. References • Apache Drill • Cloudera Impala • Facebook Presto • Hive Stinger • Pivotal HAWQ • Shark

More Related