HADOOP ADMIN: Session -2

BIG DATA HADOOP ADMIN: Session -2 What is Hadoop?

AGENDA • Hadoop Demo using Cygwin • HDFS Daemons • Map Reduce Daemons • Hadoop Ecosystem Projects

Hadoop Using Cygwin • What is Cygwin? • Hadoop needs Java version 1.6 or higher • bin/hadoop • bin/hadoop jar hadoop-examples-1.0.4.jar Word count input output • Word count example • Tokenization problem • Modifying the Program

HDFS Daemons Name Node Meta Data in RAM Rename new edits Read Heart Beats Copy Fsimage and edits Roll edits Send New Fs image Block Report Read Data Block 1 Not a backup node/stand by Node Data Node 1 Secondary Name Node Replay all edits and create new fs image

Map Reduce V1 Daemons • Job Tracker • Task Tracker Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker

Word Count over a Given Set of Web Pages • see 1 • bob 1 • throw 1 • see 1 • spot 1 • run 1 • bob 1 • run 1 • see 2 • spot 1 • throw 1 • see bob throw • see spot run Can we do word count in parallel?

The MapReduce Framework (pioneered by Google)

Automatic Parallel Execution in MapReduce (Google) Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to avoid a slow task slowing down the whole job

MapReduce in Hadoop (1)

MapReduce in Hadoop (2)

Data Flow in a MapReduce Program in Hadoop • InputFormat • Map function • Partitioner • Sorting & Merging • Combiner • Shuffling • Merging • Reduce function • OutputFormat  1:many

Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job

Lifecycle of a MapReduce Job Time Reduce Wave 1 Input Splits Reduce Wave 2 Map Wave 1 Map Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?

190+ parameters in Hadoop Set manually or defaults are used Job Configuration Parameters

Hadoop Ecosystem/Sub Projects

PIG • One frequent complaint about MR is that it’s difficult to program • One criticism of MapReduce is that the development cycle is very long • As you implement the program in MapReduce, you’ll have to think at the level of mapper and reducer functions and job chaining • Pig started as a research project within Yahoo! in the summer of 2006, joining Apache Incubator in September of 2007 • Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin • Pig is a Hadoop extension that simplifies Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliability • Yahoo runs 40% of all its hadoop jobs with Pig. Twitter use PIG • Indeed, itwas created at Yahoo! to make it easier for researchers and engineers to mine the huge datasets there

PIG::How I look like: Loads data file into a relation,with a defined schema Not a variable, relation

Word count example in PIG • Text=LOAD ‘text’ USING Textloader()Loads each line as one column • Tokens=FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word; • Wordcount=FOREACH(GROUP tokens BY word)GENERATE group as word • COUNT_STAR($1) MR TRANSFORMATION PIG JOB MR JOBS HDFS

PIG Vs Hive • Pig is a new language, easy to learn if you know languages similar to Perl • Hive is a sub-set of SQL with very simple variations to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for you • Hive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL). • Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.

HIVE(HQL) • Hive is a data ware house infrastructure built on top of Hadoop that can compile SQL queries into MR jobs and run on hadoop cluster • Invented at Facebook for their own problems . • SQL like query language(HQL/Hive QL) to retrieve the data and process it. • JDBC/ODBC access is provided • Currently used with respect to Hbase

Hbase • HBase is not about being a high level language that compiles to map-reduce, • Hbase is about allowing Hadoop to support lookups/transactions on key/value pairs. HBase allows you to do quick random lookups, versus scan all of data sequentially, do insert/update/delete from middle, not just add/append.

Sqoop • To load bulk data into Hadoop from relational databases • Imports individual tables or entire databases to files in HDFS • Provides the ability to import from SQL databases straight into your Hive data warehouse • Importing this table into HDFS could be done with the command: • you@db$ sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ --local --hive-import- See more at:

HADOOP ADMIN: Session -2