Big Data Technology: Introduction to Hadoop

Big Data Technology: Introduction to Hadoop Antonino Virgillito

Motivation • The main characterization of Big Data is mostly to be…well… “Big” • Intuitive definition: a size that “creates problems” when handled with ordinary tools and methods • However, the exact definition of “Big” is a moving target • Where do we draw a line? • Big Data tools in IT were specifically tailored to handle those cases when common data handling tools fail for some reasons • E.g. Google, Facebook…

Motivation • Large size that grows continuously and indefinitely • Difficult to define a size of the storage that can fit • Processing and querying huge data sets require a lot of memory and CPU • No matter how much you expand the technical specifications: if data is “Big” you eventually hit the roof…

Is Big Data Big in Official Statistics? • Do we really have to handle those massive dimensions? • Think about the largest dataset you ever used… • Yes • The example of scanner data in Istat • Maybe • We should be ready when it will happen • No • Big Data technology can still be useful for complex processing of «normal» data sets

Big Data Technology • Handling volume -> Distributed platforms • The standard: Hadoop • Handling variety -> NoSQL databases

Hadoop • Open source platform for distributed processing of large data • Distributed: works on a cluster of servers • Functions: • Distribution of data and processing across machine • Management of the cluster • Distribution is transparent for the programmer-analyst

Hadoop scalability • Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model • Huge clusters can be made up using (cheap) commodity hardware • A 1000-CPU machine would be much more expensive than 1000 single-CPU or 250 quad-core machines • Cluster can easily scale up with little or no modifications to the programs

Hadoop Components • HDFS: Hadoop Distributed File System • Abstraction of a file system over a cluster • Stores large amount of data by transparently spreading it on different machines • MapReduce • Simple programming model that enables parallel execution of data processing programs • Executes the work on the data near the data • In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work

Hadoop is basically a middleware platforms that manages a cluster of machines Hadoop Principle I’m one big data set The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster Hadoop HDFS The cluster can grow indefinitely simply by adding new nodes

MapReduce and Hadoop Hadoop MapReduce is logically placed on top of HDFS MapReduce HDFS

MapReduce and Hadoop MR works on (big) files loaded on HDFS Hadoop Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores MR MR MR MR HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is

The MapReduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase execution x 4 x 5 Map x 3 Reduce Data elements are classified into categories An algorithm is applied to all the elements of the same category

Hadoop pros & cons • Good for • Repetitive tasks on big size data • Not good for • Replacing a RDMBS • Complex processing requiring various phases and/or iterations • Processing small to medium size data

Hadoop vs. RDBMS • Hadoop • is not transactional • is not optimized for random access • does not natively support data updates • privileges long-running, batch work • RDBMS • disk space is more expensive • cannot scale indefinitely

Hadoop Distributions • Hadoopis an open source projectpromoted by the Apache Foundation • Assuch, it can be downloaded and used for free • However, all the configuration and maintenance of all the components must be done by the user, mainly with command-line tools • Software vendorsprovideHadoopdistributionsthat facilitate in various ways the use of the platform • Distributions are normally free butthereis a paid-for support • Additionalfeatures • User interface • Management console • Installation tools

Common Hadoop Distribution • Hortonworks • Completely open-source • Also have a Windows version • Used in: Big Data Sandbox • Cloudera • Mostly standard Hadoop but extended with proprietary components • Highlights: Cloudera Manager (console) and Impala (high-performance query) • Used in: Istat Big Data Platform

Tools for Data Analysis with Hadoop Pig Hive Hadoop Statistical Software MapReduce HDFS

Hive • Hive is a SQL interface for Hadoop that facilitates queries of data on the file system and the analysis of large datasets stored in Hadoop • Hive provides a SQL-like language called HiveQL • Well, it is SQL • Due its straightforward SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop

Using Hive • Files in tabular format stored in HDFS can be represented as tables • Sets of typed columns • Tables are treated in the traditional way like in a relational database • However, a query translates triggers ore or more MapReduce jobs • Things can get slow… • All common SQL constructs can be used • Joins, subqueries, functions

Hive vs. RDBMS • Hive works on flat files and does not support indexes and transactions • Hive does not support updates and deletes. Rows can only be added incrementally • A table is actually a directory in HDFS, so rows are inserted just by adding new files in the directory • In this sense, Hive works more as a datawarehouse than as a DBMS

Pig • Tool for querying data on Hadoop clusters • Widely used in the Hadoop world • Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts • Allows to write data manipulation scripts written in a high-level language called Pig Latin • Interpreted language: scripts are translated into MapReduce jobs • Mainly targeted at joins and aggregations

Pig: Motivations • Pig is another high-level interface to MapReduce • Scripts written in PigLatin translate into MapReduce jobs • However, working in Pig is much simpler than writing native MapReduce programs

Pig Commands Loading datasets from HDFS users = load 'Users.csv' usingPigStorage(',') as (username: chararray, age: int); pages = load 'Pages.csv' usingPigStorage(',') as (username: chararray, url: chararray);

Pig Commands Filtering data users_1825 = filter users by age>=18 and age<=25;

Pig Commands Join datasets joined = join users_1825 by username, pages by username;

Pig Commands Group records grouped = group joined byurl; Creates a new dataset with an elements named group and joined. There will be one record for each distinct url: dump grouped; (www.twitter.com, {(alice, 15), (bob, 18)}) (www.facebook.com, {(carol, 24), (alice, 14), (bob, 18)})

Pig Commands Apply function to records in a dataset summed = foreach grouped generate group as url, COUNT(joined) AS views;

Pig Commands Sort a dataset sorted = order summed by views desc; Filter first n rows top_5 = limit sorted 5;

Pig Commands Writes a dataset to HDFS store top_5 into 'top5_sites.csv';

Word Count in Pig A = load '/tmp/bible+shakes.nopunc'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = group C by word; E = foreach D generate COUNT(C) as count, group as word; F = order E by count desc; store F into '/tmp/wc';

Pig: Used Defined Functions • There are times when Pig’s built in operators and functions will not suffice • Pig provides ability to implement your own • Filter • Ex: res = FILTER bag BY udfFilter(post); • Load Function • Ex: res = load 'file.txt' using udfLoad(); • Eval • Ex: res = FOREACH bag GENERATE udfEval($1) • Choice between several programming languages • Java, Python, Javascript

Hive vs. Pig • Hive • Usesplain SQL so itisstraightforward to start with • Requires data to be in tabular format • Onlyallow single queries to be issued • Pig • Requireslearning a new language • Allows to work on data in a free schema • Allows to write scripts with multiple processing steps • Bothlanguages can be used for pre-processing and analysis

Interactive Querying in Hadoop • Responsetimes of MapReduce are typically slow and makesunsuitable for interactiveworkloads • Hadoopdistributionsprovide alternative solutions for querying data with lowlatency • Hortonworks: Hive-on-Tez • Cloudera: Impala • The idea is to bypass the MapReducemechanism and avoidits high latency • Great advantage for aggregationqueries • PlainHivestillmakessense for low-throughput data transformations

Using Hadoop from Statistical Software • R • packagesrhdfs, rmr • Issue HDFS commandsand writeMapReducejobs • SAS • SAS In-Memory Statistics • SAS/ACCESS • Makes data stored in Hadoopappearas native SAS datasets • UsesHiveinterface • SPSS • Transparentintegration with Hadoop data

RHadoop • Set of packages that allows integration of R with HDFS and MapReduce • Hadoop provides the storage while R brings the analysis • Just a library • Not a special run-time, Not a different language, Not a special purpose language • Incrementally port your code and use all packages • Requires R installed and configured on all nodes in the cluster

WordCount in R wordcount = function( input, output = NULL, pattern = " "){ wc.reduce = function(word, counts ) { keyval(word, sum(counts))} mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)} wc.map = function(., lines) { keyval( unlist( strsplit( x = lines, split = pattern)), 1)}

Big Data Technology: Introduction to Hadoop