Hadoop echoSystem and IBM Big Insights

Hadoop echoSystem andIBM Big Insights Rafie Tarabayeng_rafie@mans.edu.egmrafie@eg.ibm.com

When you have one of the next 3 items When you have Big Data? • Variety: Manage and benefit from diverse data types and data structures • Velocity: Analyze streaming data and large volumes of persistent data • Volume: Scale from terabytes to zettabytes

BI vs Big Data Analysis BI : Business Users determine what question to ask, then IT Structures the data to answer that question. Sample of BI tasks: Monthly sales reports, Profitability analysis, Customer surveys Big Data Approach: IT delivers a platform to enable creative discovery, then Business Explores what questions could be asked Sample of BigData tasks: Brand sentiment, Product strategy, Maximum asset utilization

Common data representation formats used for big data include: Row- or record-based encodings: • −Flat files / text files • −CSV and delimited files • −Avro / SequenceFile • −JSON • −Other formats: XML, YAML Column-based storage formats: • −RC / ORC file • −Parquet NoSQL Database Data representation formats used for Big Data

What is Parquet, RC/ORC file formats, and Avro? Parquet Parquet is a columnar storage format, Allows compression schemes to be specified on a per-column level Offer better write performance by storing metadata at the end of the file Provides the best results in benchmark performance tests RC/ORC file formats developed to support Hive and use a columnar storage format Provides basic statistics such as min, max, sum, and count, on columns Avro Avro data files are a compact, efficient binary format

NoSQL Databases NoSQL is a new way of handling variety of data. NoSQL DB can handle Millions of Queries per Sec while normal RDBMS can handle Thousands of Queries per Sec only, and both are follow CAP Theorem. Types of NoSQL datastores: • Key-value stores: MemCacheD, REDIS, and Riak • Column stores: HBase and Cassandra • Document stores: MongoDB, CouchDB, Cloudant, and MarkLogic • Graph stores: Neo4j and Sesame

CAP Theorem CAP Theorem states that in the presence of a network partition, one has to choose between consistency and availability. * Consistency means Every read receives the most recent write or an error * Availability means Every request receives a (non-error) response (without guarantee that it contains the most recent write) HBase, and MongoDB ---> CP [give data Consistency but not Availability] Cassandra , CouchDB ---> AP [give data Availability but not Consistency] while traditional Relational DBMS are CA [support Consistency and Availability but not network partition]

Time line for Hadoop

Hadoop

Apache Hadoop Stack The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. Hadoop HDFS : [IBM has alternative file system for Hadoop with name GPFS] • where Hadoop stores data • a file system that spans all the nodes in a Hadoop cluster • links together the file systems on many local nodes to make them into one large file system that spans all the data nodes of the cluster Hadoop MapReduce v1 : an implementation for large-scale data processing. MapReduce engine consists of : - JobTracker : receive client applications jobs and send orders to the TaskTrackes who are nearest to the data as possible. - TaskTracke: exists on cluster's nodes to receive the orders from JobTracker YARN (it is the newer version of MapReduce): each cluster has a Resource Manager, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.

Advantages and disadvantages of Hadoop • Hadoop is good for: • processing massive amounts of data through parallelism • handling a variety of data (structured, unstructured, semi-structured) • using inexpensive commodity hardware • Hadoop is not good for: • processing transactions (random access) • when work cannot be parallelized • Fast access to data • processing lots of small files • intensive calculations with small amounts of data What hardware is not used for Hadoop? • RAID • Linux Logical Volume Manager (LVM) • Solid-state disk (SSD)

HSFS

Hadoop Distributed File System (HDFS) principles • Distributed, scalable, fault tolerant, high throughput • Data access through MapReduce • Files split into blocks (aka splits) • 3 replicas for each piece of data by default • Can create, delete, and copy, but cannot update • Designed for streaming reads, not random access • Data locality is an important concept: processing data on or near the physical storage to decrease transmission of data

HDFS: architecture • Master / Slave architecture • NameNode • Manages the file system namespace and metadata • Regulates access to files by clients • DataNode • Many DataNodes per cluster • Manages storage attached to the nodes • Periodically reports status to NameNode • Data is stored across multiple nodes • Nodes and components will fail, so for reliability data is replicated across multiple nodes NameNode File1 a b c d a b a c b a d b d c c d DataNodes

Hadoop HDFS: read and write files from HDFS Create sample text file on Linux # echo “My First Hadoop Lesson” > test.txt Linux List system files to ensure file creation # ls -lt List current files on your home directory in HDFS # hadoop fs -ls / Create new directory in HDFS – name it test # hadoop fs -mkdir test Load test.txt file into Hadoop HDFS # hadoop fs -put test.txt test/ View contents of HDFS file test.txt # hadoop fs -cat test/test.txt

hadoop fs - Command Reference ls <path> Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry. lsr <path> Behaves like -ls, but recursively displays entries in all subdirectories of path. du <path> Shows disk usage, in bytes, for all the files which match path; filenames are reported with the full HDFS protocol prefix. dus <path> Like -du, but prints a summary of disk usage of all files/directories in the path. mv <src><dest> Moves the file or directory indicated by src to dest, within HDFS. cp <src> <dest> Copies the file or directory identified by src to dest, within HDFS. rm <path> Removes the file or empty directory identified by path. rmr <path> Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path).

hadoop fs - Command Reference put <localSrc> <dest> Copies the file or directory from the local file system identified by localSrc to dest within the DFS. copyFromLocal <localSrc> <dest> Identical to -put moveFromLocal <localSrc> <dest> Copy file or directory from the local file system identified by localSrc to dest within HDFS, and then deletes the local copy on success. get [-crc] <src> <localDest> Copies the file or directory in HDFS identified by src to the local file system path identified by localDest. getmerge <src> <localDest> Retrieves all files that match the path src in HDFS, and copies them local to a single, merged file localDest. cat <filen-ame> Displays the contents of filename on stdout. copyToLocal <src> <localDest> Identical to -get moveToLocal <src> <localDest> Works like -get, but deletes the HDFS copy on success. mkdir <path> Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., mkdir -p in Linux).

hadoop fs - Command Reference stat [format] <path> Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y). tail [-f] <file2name> Shows the last 1KB of file on stdout. chmod [-R] mode,mode,... <path>... Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes if no scope is specified and does not apply an umask. chown [-R] [owner][:[group]] <path>... Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified. chgrp [-R] group <path>... Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified. help <cmd-name> Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd.

YARN

Sometimes called MapReduce 2.0, YARN decouples scheduling capabilities from the data processing component Hadoop clusters can now run interactive querying and streaming data applications simultaneously. Separating HDFS from MapReduce with YARN makes the Hadoop environment more suitable for operational applications that can't wait for batch jobs to finish. YARN

YARN

HBase is a NoSQL column family database that runs on top of Hadoop HDFS (it is the default Hadoop Database ). • Can handle large tables which have billions of rows and millions of columns with fault tolerance and horizontal scalability. • HBase concept was inspired by Google’s Big Table. • Schema does not need to be defined up front • support high performance random r/w applications • Data is stored in HBase table(s) • Tables are made of rows and columns • Row stored in order by row keys • Query data using get/put/scan only HBASE For more information https://www.tutorialspoint.com/hbase/index.htm

PIG

PIG Apache Pig is used for querying data stored in Hadoop clusters. It allows users to write complex MapReduce transformations using high-level scripting language called Pig Latin. Pig translates the Pig Latin script into MapReduce tasks by using its Pig Engine component so that it can be executed within YARN for access to a single dataset stored in the HDFS. Programmers need not write complex code in Java for MapReduce tasks rather they can use Pig Latin to perform MapReduce tasks. Apache Pig provides nested data types like tuples, bags, and maps that are missing from MapReduce along with built-in operators like joins, filters, ordering etc. Apache Pig can handle structured, unstructured, and semi-structured data. For more information https://www.tutorialspoint.com/apache_pig/apache_pig_overview.htm

Hive

Hive The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax. Allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements. Hive shell, JDBC, and ODBC are supported Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase For more information https://www.tutorialspoint.com/hive/hive_introduction.htm

Create Database/Tables in Hive hive> CREATE DATABASE IF NOT EXISTS userdb; hive> SHOW DATABASES; hive> DROP DATABASE IF EXISTS userdb; hive> DROP DATABASE IF EXISTS userdb CASCADE; (to drop all tables also) hive> CREATE TABLE IF NOT EXISTS employee (id int, name String, salary String, destination String) COMMENT ‘Employee details’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’ STORED AS TEXTFILE; hive> ALTER TABLE employee RENAME TO emp; hive> ALTER TABLE employee CHANGE name ename String; hive> ALTER TABLE employee CHANGE salary salary Double; hive> ALTER TABLE employee ADD COLUMNS (dept STRING COMMENT 'Department name’); hive> DROP TABLE IF EXISTS employee; hive> SHOW TABLES;

Select, Views hive> SELECT * FROM employee WHERE Id=1205; hive> SELECT * FROM employee WHERE Salary>=40000; hive> SELECT 20+30 ADD FROM temp; hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP; hive> SELECT round(2.6) from temp; hive> SELECT floor(2.6) from temp; hive> SELECT ceil(2.6) from temp; hive> CREATE VIEW emp_30000 AS SELECT * FROM employee WHERE salary>30000; hive> DROP VIEW emp_30000;

Index, Order by, Group by, Join hive> CREATE INDEX inedx_salary ON TABLE employee(salary) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'; hive> DROP INDEX index_salary ON employee; hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT; hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT; hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID); hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID); hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

Java example for Hive JDBC import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveQLOrderBy { public static void main(String[] args) throws SQLException { Class.forName(org.apache.hadoop.hive.jdbc.HiveDriver); // Register driver and create driver instance Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/userdb", "", ""); // get connection Statement stmt = con.createStatement(); // create statement Resultset res = stmt.executeQuery("SELECT * FROM employee ORDER BY DEPT;"); // execute statement System.out.println(" ID \t Name \t Salary \t Designation \t Dept "); while (res.next()) { System.out.println(res.getInt(1) + " " + res.getString(2) + " " + res.getDouble(3) + " " + res.getString(4) + " " + res.getString(5)); } con.close(); } } $ javac HiveQLOrderBy.java $ java HiveQLOrderBy

Phoenix

Apache Phoenix Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the noSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native noSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of noSQL stores. Apache phoenix is a good choice for low-latency and mid-size table (1M - 100M rows) Apache phoenix is faster than Hive, and Impala

Phoenix main features support transaction Support User defined functions Support Secondary Indexes supports view syntax

Solr

Solr (enterprise search engine) Solr is used to build search applications deliver high performance with support for execution of parallel SQL queries... It was built on top of Lucene (full text search engine). Solr can be used along with Hadoop to search large volumes of text-centric data.. Not only search, Solr can also be used for storage purpose. Like other NoSQL databases, it is a non-relational data storage and processing technology. Support Fulltext Search, It utilizes the RAM, not the CPU. PDF, Word document indexing Auto-suggest, Stop words, synonyms, etc. Supports replication Communicate with the search server via HTTP (it can even return Json, Native PHP/Ruby/Python) Index directly from the database with custom queries For more information: https://www.tutorialspoint.com/apache_solr/apache_solr_overview.htm

Elasticsearch-Hadoop

Elasticsearch-Hadoop (ES-Hadoop) ES-Hadoop is a real-time FTS search and analytics engine Connect the massive data storage and deep processing power of Hadoop with the real-time search and analytics of Elasticsearch. ES-Hadoop connector lets you get quick insight from your big data and makes working in the Hadoop ecosystem even better. ES-Hadoop lets you index Hadoop data into the Elastic Stack to take full advantage of the speedy Elasticsearch engine and beautiful Kibana visualizations. With ES-Hadoop, you can easily build dynamic, embedded search applications to serve your Hadoop data or perform deep, low-latency analytics using full-text, geospatial queries and aggregations. ES-Hadoop lets you easily move data bi-directionally between Elasticsearch and Hadoop while exposing HDFS as a repository for long-term archival.

Sqoop

• Import data from relational database tables into HDFS • Export data from HDFS into relational database tables Sqoop works with all databases that have a JDBC connection. JDBC driver JAR files should exists in $SQOOP_HOME/lib It uses MapReduce to import and export the data Imports data can be stored as  Text files  Binary files  Into HBase  Into Hive Sqoop

Oozie

Oozie Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed environment. It allows to combine multiple complex jobs to be run in a sequential order to achieve a bigger task. Within a sequence of task, two or more jobs can also be programmed to run parallel to each other. One of the main advantages of Oozie is that it is tightly integrated with Hadoop stack supporting various Hadoop jobs like Hive, Pig, Sqoop as well as system-specific jobs like Java and Shell. Oozie has three types of jobs: Oozie Workflow Jobs − These are represented as Directed Acyclic Graphs (DAGs) to specify a sequence of actions to be executed. Oozie Coordinator Jobs − These consist of workflow jobs triggered by time and data availability. Oozie Bundle − These can be referred to as a package of multiple coordinator and workflow jobs. For more information https://www.tutorialspoint.com/apache_oozie/apache_oozie_introduction.htm

R-Hadoop

R Hadoop RHadoop is a collection of five R packages thatallow users to manage and analyze data with Hadoop. Rhdfs: Connect HDFSto R. Rhbase: connect HBASE to R Rmr2: enable R to perform statistical analysis using MapReduce Ravro: enable R to read and write avro files from local and HDFS Plyrmr: enable user to perform common data manipulation operations, as found in plyr and reshape2, on data sets stored on Hadoop

Spark

Spark with Hadoop 2+ • Spark is an alternative in-memory framework to MapReduce • Supports general workloads as well as streaming, interactive queries and machine learning providing performance gains • Spark jobs can be written in Scala, Python, or Java; APIs are available for all three • Run Spark Scala shells by (spark-shell) Spark Python shells by (pyspark) Apache Spark was the world record holder in 2014 for sorting. By sorting 100 TB of data on 207 machines in 23 minutes but Hadoop MapReduce took 72 minutes on 2100 machines.

Spark libraries • Spark SQL: is a Spark module for structured data processing, in which in-memory processing is its core. Using Spark SQL, can read the data from any structured sources, like JSON, CSV, parquet, avro, sequencefiles, jdbc , Hive etc.example: • scala> sqlContext.sql("SELECT * FROM src").collect • scala> hiveContext.sql("SELECT * FROM src").collect • Spark Streaming: Write applications to process streaming data in Java or Scala. • Receives data from: Kafka, Flume, HDFS / S3, Kinesis, Twitter • Pushes data out to: HDFS, Databases, Dashboard • MLLib: Spark2+ has new optimized library support machine learning functions on a cluster based on new DataFrame-based API in the spark.ml package. • GraphX: API for graphs and parallel computation

Flume

Hadoop echoSystem and IBM Big Insights