In Memory OLAP Engine

In Memory OLAP Engine Farha Shah Team 3, SJSU

In memory OLAP engine in Java • Java Data Mining (JDM 2.0) is a standard Java API for developing data mining applications and tools. • icCube is a JAVA in-memory OLAP server. It can be either embedded in your application or you can access it via our XMLA client library or any other XMLA JAVA client library. Note that the engine itself is quite light. • ActivePivot is used by many banks to aggregate large amount of data in memory.

Hive Integration Example

Hive

Hadoop / Hive integration

Hive’s simplication RDBMS> select key, count(1) from kv1 where key > 100 group by key; vs. $ cat > /tmp/reducer.sh uniq -c | awk '{print $2"\t"$1}‘ $ cat > /tmp/map.sh awk -F '\001' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoopdfs –cat /tmp/largekey/part*

What is Hive? • A system for managing and querying structured data built on top of Hadoop • Map-Reduce for execution • HDFS for storage • Metadata on raw files • Key Building Principles: • SQL as a familiar data warehousing tool • Extensibility – Types, Functions, Formats, Scripts • Scalability and Performance

Simplifying Hadoop RDBMS> select key, count(1) from kv1 where key > 100 group by key; vs. hive> select key, count(1) from kv1 where key > 100 group by key;

Why HIVE? • Large installed base of SQL users  • ie. map-reduce is for ultra-geeks • much much easier to write sql query • Analytics SQL queries translate really well to map-reduce • Files as insufficient data management abstraction • Tables, Schemas, Partitions, Indices • Metadata allows optimization, discovery, browsing • Love the programmability of Hadoop • Hate that RDBMS are closed • Why not work on data in any format? • Complex data types are the norm

Hive Data Model

Hive Components Source: Yahoo - Hadoop 2009 summit

Hive • Inside the HiveWritten in Java, Hive is a specialized execution front end for Hadoop. Hive lets you write data queries in an SQL-like language -- the Hive Query Language (HQL) -- that are converted to map/reduced tasks, which are then executed by the Hadoop framework. You're using Hadoop, but it feels like you're talking SQL to an RDBMS. • Employing Hadoop's distributed file system (HDFS) as data storage, Hive inherits all of Hadoop's fault tolerance, scalability, and adeptness with huge data sets. When you run Hive, you are deposited into a shell, within which you can execute Hive Data Definition Language (DDL) and HQL commands. A future version of Hive will include JDBC and ODBC drivers, at which time you will be able to create fully executable "Hive applications" in much the same way that you can write a Java database application for your favorite RDBMS. (The current version of Hive -- 0.3.0 -- does have limited support for JDBC, but can only dispatch queries and fetch results.) • To install Hive, you simply install Hadoop and add a couple of download and configuration steps. Or if you'd rather just get straight to testing Hive without all the installation nonsense, download a VMware virtual machine image with Hadoop and Hive pre-installed. The virtual machine image is featured in an excellent Hive tutorial video available at the same Web site.

Hive • HQL and SQLAlthough Hive's principal goal is to provide an SQL-like query mechanism for Hadoop-based data, mimicry of SQL in such an environment can -- for a variety of reasons -- go only so far. First, HDFS was built for batchlike applications that pour large quantities of data into massive files that are subsequently processed by Hadoop map/reduce tasks. It is a write-once, read-often-and-sequentially file system. HDFS does not currently support random write operations and likely never will. Hence, HQL's closest approach to an SQL INSERT INTO command is INSERT OVERWRITE, which overwrites a table's existing content with new data. For example, suppose you have already created a Hive database table called TA, and you want to add new data to it from table TB. The HQL for this is: • INSERT OVERWRITE TA SELECT * FROM(SELECT * FROM TA UNIONSELECT * FROM TB) The new data is added by overwriting the old table with the concatenation of its original content and the data in TB. • In addition, Hive does not store database tables in a specialized file format. Instead, it causes ordinary HDFS files to "appear" to be database files. This illusion becomes apparent when you export data into a Hive table from a file stored in a standard Linux file system. No special conversion takes place; the file is copied byte for byte into Hive from its source image in the Linux directory. This means that you have to describe the structure of the file at the time you CREATE it as a Hive table.

Although Hive query language (HQL) commands are usually executed from within the Hive shell, you can launch the Hive Web Interface service and run HQL queries from within a browser. You can start multiple queries, and the Web interface will let you monitor the status of each.

Hive DDL • DDL • Complex columns • Partitions • Buckets • Example • CREATE TABLE sales ( id INT, items ARRAY<STRUCT<id:INT, name:STRING>>, extra MAP<STRING, STRING>) PARTITIONED BY (ds STRING)CLUSTERED BY (id) INTO 32 BUCKETS;

Hive Query Language • SQL • Where • Group By • Equi-Join • Sub query in "From" clause • Example • SELECT r.*, s.*FROM r JOIN ( SELECT key, count(1) as count FROM s GROUP BY key) sON r.key = s.keyWHERE s.count > 100;

Group By • 4 different plans based on: • Does data have skew? • partial aggregation • Map-side hash aggregation • In-memory hash table in mapper to do partial aggregations • 2-map-reduce aggregation • For distinct queries with skew and large cardinality

Join • Normal map-reduce Join • Mapper sends all rows with the same key to a single reducer • Reducer does the join • Map-side Join • Mapper loads the whole small table and a portion of big table • Mapper does the join • Much faster than map-reduce join

Sampling • Efficient sampling • Table can be bucketed • Each bucket is a file • Sampling can choose some buckets • Example • SELECT product_id, sum(price)FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) GROUP BY product_id

Multi-table Group-By/Insert FROM users INSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY '/user/facebook/tmp/pv_age_sum.dir' SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir' SELECT country, gender, count(DISTINCT userid) GROUP BY country, gender;

Join Optimizations • Map Joins • User specified small tables stored in hash tables on the mapper backed by jdbm • No reducer needed INSERT INTO TABLE pv_users SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age FROM page_viewpv JOIN user u ON (pv.userid = u.userid); • Future Exploit table/column statistics for deciding strategy Source Facebook

Hive QL – Map Join Hash table page_view pv_users user

Hive QL – Group By SELECT pageid, age, count(1) FROM pv_users GROUP BY pageid, age;

Shuffle Sort Hive QL – Group By in Map Reduce pv_users Reduce Map

Group by Optimizations • Map side partial aggregations • Hash-based aggregates • Serialized key/values in hash tables • 90% speed improvement on Query • SELECT count(1) FROM t; • Load balancing for data skew • Optimizations being Worked On: • Exploit pre-sorted data for distinct counts • Exploit table/column statistics for deciding strategy

Columnar Storage • CREATE table columnTable (key STRING, value STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.ColumnarSerDe' STORED AS RCFILE; • Saved 25% of space compared with SequenceFile • Based on one of the largest tables (30 columns) inside Facebook • Both are compressed with GzipCodec • Speed improvements in progress • Need to propagate column-selection information to FileFormat • *Contribution from Yongqiang He (outside Facebook)

Speed Improvements over Time • QueryA: SELECT count(1) FROM t; • QueryB: SELECT concat(concast(concat(a,b),c),d) FROM t; • QueryC: SELECT * FROM t; • Time measured is map-side time only (to avoid unstable shuffling time at reducer side). It includes time for decompression and compression (both using GzipCodec). • * No performance benchmarks for Map-side Join yet.

Overcoming Java Overhead • Reuse objects • Use Writable instead of Java Primitives • Reuse objects across all rows • *40% speed improvement on Query C • Lazy deserialization • Only deserialize the column when asked • Very helpful for complex types (map/list/struct) • *108% speed improvement on Query A

Generic UDF and UDAF • Let UDF and UDAF accept complex-type parameters • Integrate UDF and UDAF with Writables public IntWritable evaluate(IntWritable a, IntWritable b) { intWritable.set((int)(a.get() + b.get())); return intWritable; }

HQL Optimizations • Predicate Pushdown • Merging n-way join • Column Pruning

<k1, v1> <k2, v2> <k3, v3> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk1, nv1> <nk3, nv3> <nk1, nv6> <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk1, 2> <nk3, 1> Local Sort Local Reduce Local Map Global Shuffle <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk2, 3> <k4, v4> <k5, v5> <k6, v6> <nk2, nv4> <nk2, nv5> <nk1, nv6> (Simplified) Map Reduce Review Machine 1 Machine 2

Hive QL – Join pv_users page_view • SQL: INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_viewpv JOIN user u ON (pv.userid = u.userid); user X =

Hive QL – Join in Map Reduce page_view pv_users Shuffle Sort Reduce Map user

Hive QL – Group By pv_users pageid_age_sum • SQL: • INSERT INTO TABLE pageid_age_sum • SELECT pageid, age, count(1) • FROM pv_users • GROUP BY pageid, age;

Hive QL – Group By in Map Reduce pageid_age_sum pv_users Shuffle Sort Map Reduce

Hive QL – Group By with Distinct page_view • SQL • SELECT pageid, COUNT(DISTINCT userid) • FROM page_view GROUP BY pageid result

Hive Extensibility Features

Introduction Hive comprises of 3 main components: • Serializers/Deserializers (trunk/serde) - This component has the framework libraries that allow users to develop serializers and deserializers for their own data formats. This component also contains some builtin serialization/deserialization families. • MetaStore (trunk/metastore) - This component implements the metadata server which is used to hold all the information about tables and partitions that are in the warehouse. • Query Processor (trunk/ql) - This component implements the processing framework for converting SQL to a graph of map/reduce jobs and also the execution time framework to run those jobs in the order of dependencies. Apart from these major components, Hive also contains a number of other components. These are as follows: • Command Line Interface (trunk/cli) - This component has all the java code used by the Hive command line interface. • Hive Server (trunk/service) - This component implements all the APIs that can be used by other clients (such as JDBC drivers) to talk to Hive. • Common (trunk/common) - This component contains common infrastructure needed by the rest of the code. Currently, this contains all the java sources for managing and passing Hive configurations(HiveConf) to all the other code components. • Ant Utilities (trunk/ant) - This component contains the implementation of some ant tasks that are used by the build infrastructure. • Scripts (trunk/bin) - This component contains all the scripts provided in the distribution including the scripts to run the Hive cli(bin/hive). • The following top level directories contain helper libraries, packaged configuration files etc..: • trunk/conf - This directory contains the packaged hive-default.xml and hive-site.xml. • trunk/data - This directory contains some data sets and configurations used in the hive tests. • trunk/ivy - This directory contains the ivy files used by the build infrastructure to manage dependencies on different hadoop versions. • trunk/lib - This directory contains the run time libraries needed by Hive. • trunk/testlibs - This directory contains the junit.jar used by the junit target in the build infrastructure. • trunk/testutils (Deprecated)

Dealing with Structured Data • Type system • Primitive types • Recursively build up using Composition/Maps/Lists • ObjectInspector interface for user-defined types • To recursively list schema • To recursively access fields within a row object • Generic (De)Serialization Interface (SerDe) • Serialization families implement interface • Thrift DDL based SerDe • Delimited text based SerDe • You can write your own SerDe (XML, JSON …)

SerDe • What is !SerDe • !SerDe is a short name for Serializer and Deserializer. • Hive uses SerDe (and !FileFormat) to read from/write to tables. • HDFS files !InputFileFormat)> <key, value> --(Deserializer> Row object • Row object Serializer)> <key, value> --(!OutputFileFormat> HDFS files • Note that the "key" part is ignored when reading, and is always a constant when writing. Basically the row object is only stored into the "value". • One principle of Hive is that Hive does not own the HDFS file format - Users should be able to directly read the HDFS files in the Hive tables using other tools, or use other tools to directly write to HDFS files that can be read by Hive through "CREATE EXTERNAL TABLE", or can be loaded into Hive through "LOAD DATA INPATH" which just move the file into Hive table directory. • Note that org.apache.hadoop.hive.serde is the deprecated old serde library. Please look at org.apache.hadoop.hive.serde2 for the latest version.

SerDe (Cont) • Hive currently use these FileFormat classes to read and write HDFS files: • !TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format. • !SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop !SequenceFile format. • Hive currently use these !SerDe classes to serialize and deserialize data: • !MetadataTypedColumnsetSerDe: This !SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (sorry, quote is not supported yet.) • !ThriftSerDe: This !SerDe is used to read/write thrift serialized objects. The class file for the Thrift object must be loaded first. • !DynamicSerDe: This !SerDe also read/write thrift serialized objects, but it understands thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including !TBinaryProtocol, !TJSONProtocol, TCTL!SeparatedProtocol (which writes data in delimited records).

How to write your own !SerDe: • In most cases, users want to write a Deserializer instead of a !SerDe, because users just want to read their own data format instead of writing to it. • For example, the !RegexDeserializer will deserialize the data using the configuration parameter 'regex', and possibly a list of column names (see serde2.MetadataTypedColumnsetSerDe). Please see serde2/Deserializer.java for details. • If your !SerDe supports DDL (basically, !SerDe with parameterized columns and column types), you probably want to implement a Protocol based on !DynamicSerDe, instead of writing a !SerDe from scratch. The reason is that the framework passes DDL to !SerDe through "thrift DDL" format, and it's non-trivial to write a "thrift DDL" parser. • Some important points of !SerDe: • !SerDe, not the DDL, defines the table schema. Some !SerDe implementations use the DDL for configuration, but !SerDe can also override that. • Column types can be arbitrarily nested arrays, maps and structures. • The callback design of !ObjectInspector allows lazy deserialization with CASE/IF or when using complex or nested types.

ObjectInspector • Hive uses !ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns. • !ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including: • Instance of a Java class (Thrift or native Java) • A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map) • A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field) • A complex object can be represented by a pair of !ObjectInspector and Java Object. The !ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.

MetaStore • MetaStore contains metadata regarding tables, partitions and databases. This is used by Query Processor during plan generation. • Metastore Server - This is the thrift server (interface defined in metastore/if/hive_metastore.if) that services metadata requests from clients. It delegates most of the requests underlying meta data store and the Hadoop file system which contains data. • Object Store - ObjectStore class handles access to the actual metadata is stored in the SQL store. The current implementation uses JPOX ORM solution which is based of JDA specification. It can be used with any database that is supported by JPOX. New meta stores (file based or xml based) can added by implementing the interface MetaStore. FileStore is a partial implementation of an older version of metastore which may be deprecated soon. • Metastore Client - There are python, java, php thrift clients in metastore/src. Java generated client is extended with HiveMetaStoreClient which is used by Query Processor (ql/metadta). This is the main interface to all other Hive components.

MetaStore (Contd.) • Stores Table/Partition properties: • Table schema and SerDe library • Table Location on HDFS • Logical Partitioning keys and types • Partition level metadata • Other information • Thrift API • Current clients in Php (Web Interface), Python interface to Hive, Java (Query Engine and CLI) • Metadata stored in any SQL backend • Future • Statistics • Schema Evolution

Query Processor The following are the main components of the Hive Query Processor: • Parse and SemanticAnalysis (ql/parse) - This component contains the code for parsing SQL, converting it into Abstract Syntax Trees, converting the Abstract Syntax Trees into Operator Plans and finally converting the operator plans into a directed graph of tasks which are executed by Driver.java. • Optimizer (ql/optimizer) - This component contains some simple rule based optimizations like pruning non referenced columns from table scans (column pruning) that the Hive Query Processor does while converting SQL to a series of map/reduce tasks. • Plan Components (ql/plan) - This component contains the classes (which are called descriptors), that are used by the compiler (Parser, SemanticAnalysis and Optimizer) to pass the information to operator trees that is used by the execution code. • MetaData Layer (ql/metadata) - This component is used by the query processor to interface with the MetaStore in order to retrieve information about tables, partitions and the columns of the table. This information is used by the compiler to compile SQL to a series of map/reduce tasks. • Map/Reduce Execution Engine (ql/exec) - This component contains all the query operators and the framework that is used to invoke those operators from within the map/reduces tasks. • Hadoop Record Readers, Input and Output Formatters for Hive (ql/io) - This component contains the record readers and the input, output formatters that Hive registers with a Hadoop Job. • Sessions (ql/session) - A rudimentary session implementation for Hive. • Type interfaces (ql/typeinfo) - This component provides all the type information for table columns that is retrieved from the MetaStore and the SerDes. • Hive Function Framework (ql/udf) - Framework and implementation of Hive operators, Functions and Aggregate Functions. This component also contains the interfaces that a user can implement to create user defined functions. • Tools (ql/tools) - Some simple tools provided by the query processing framework. Currently, this component contains the implementation of the lineage tool that can parse the query and show the source and destination tables of the query.

Compressed Data Storagein HIVE • Keeping data compressed in Hive tables has, in some cases, known to give better performance that uncompressed storage; both, in terms of disk usage and query performance. • You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example: • CREATE TABLE raw (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'; LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw; • The table 'raw' is stored as a TextFile, which is the default storage. However, in this case Hadoop will not be able to split your file into chunks/blocks and run multiple maps in parallel. This can cause under-utilization of your cluster's 'mapping' power. • The recommended practice is to insert data into another table, which is stored as a SequenceFile. A SequenceFile can be split by Hadoop and distributed across map jobs where as a GZIP file cannot be. For example: • CREATE TABLE raw (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'; CREATE TABLE raw_sequence (line STRING) STORED AS SEQUENCEFILE; LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw; SET hive.exec.compress.output=true; SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below) INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw; • The value for io.seqfile.compression.type determines how the compression is performed. Record compresses each value individually while BLOCK buffers up 1MB (default) before doing compression.

Details on Each … • File Format • SerDe • Map/Reduce Scripts (Transform) • UDF • UDAF

File Formats • TextFile: • Easy for other applications to write/read • Gzip text files are not splittable • SequenceFile: • Only hadoop can read it • Support splittable compression • RCFile: Block-based columnar storage • Use SequenceFile block format • Columnar storage inside a block • 25% smaller compressed size • On-par or better query performance depending on the query

SerDe • Serialization/Deserialization • Row Format • CSV (LazySimpleSerDe) • Thrift (ThriftSerDe) • Regex (RegexSerDe) • Hive Binary Format (LazyBinarySerDe) • LazySimpleSerDe and LazyBinarySerDe • Deserialize the field when needed • Reuse objects across different rows • Text and Binary format

In Memory OLAP Engine