Introduction to MapReduce and Hadoop

Introduction to MapReduceand Hadoop Team: 3 MdLiakat AliAbdulazizAltowayanAndreeaCotoranuStephanie HaughtonGene LocklearLeslie Meadows

Contents • Hadoop • MapReduce • Download Links • Install and Tutorials

Software platform that lets one easily write and run applications that process vast amounts of data. It includes: • MapReduce – offline computing engine • HDFS – Hadoop distributed file system • HBase (pre-alpha) – online data access • Here's what makes it especially useful: • Scalable: It can reliably store and process petabytes. • Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). • Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. • Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. What is Hadoop?

Hadoop implements Google’s MapReduce, using HDFS • MapReduce divides applications into many small blocks of work. • HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. • MapReduce can then process the data where it is located. What does it do?

A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. • Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search • AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. • Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; • FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning • University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data; Example Applications and Organizations using Hadoop

More Hadoop Applications • Adknowledge - to build the recommender system for behavioral targeting, plus other clickstream analytics; clusters vary from 50 to 200 nodes, mostly on EC2. • Contextweb - to store ad serving log and use it as a source for Ad optimizations/ Analytics/reporting/machine learning; 23 machine cluster with 184 cores and about 35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage. • Cornell University Web Lab: Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive) • NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used for crawling, processing, serving and log analysis • The New York Times : Large scale image conversions ; EC2 to run hadoop on a large virtual cluster • Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon EC2 ; data storage in Amazon S3

• MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. • MapReduceprogram is composed of • Map() procedure(method) that performs filtering and sorting, and • Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). What is MapReduce?

• Pioneered by Google • – Processes 20 PB of data per day • Sort/merge based distributed computing • Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations • • Popularized by open-source Hadoop project • – Used by Yahoo!, Facebook, Amazon, … What is MapReduce?

What is MapReduce used for? • At Google: – Index building for Google Search – Article clustering for Google News – Statistical machine translation • At Yahoo!: – Index building for Yahoo! Search – Spam detection for Yahoo! Mail • At Facebook: – Data mining – Ad optimization – Spam detection • In research: – Analyzing Wikipedia conflicts (PARC) – Natural language processing (CMU) – Bioinformatics (Maryland) – Particle physics (Nebraska) – Ocean climate simulation (Washington)

MapReduce Goals 1. Scalability to large data volumes: – Scan 100 TB on 1 node @ 50 MB/s = 24 days – Scan on 1000-node cluster = 35 minutes 2. Cost-efficiency: – Commodity nodes (cheap, but unreliable) – Commodity network – Automatic fault-tolerance (fewer admins) – Easy to use (fewer programmers)

Hadoop Components • • Distributed file system (DFS) • The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. • highly fault-tolerant and is designed to be deployed on low-cost hardware. • provides high throughput access to application data and is suitable for applications that have large data sets. • part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/core/.

• Files split into 128MB blocks • Blocks replicated across several datanodes (usually 3) • Namenode stores metadata (file names, locations, etc) • Optimized for large files, sequential reads • Files are append-only Hadoop Distributed File System

Hadoop Components • MapReduce framework – Executes user jobs specified as “map” and “reduce” functions – Manages work distribution & fault-tolerance

MapReduce Programming Model

Example: Word Count

Word Count Execution

An Optimization: The Combiner

Word Count with Combiner

Sample applications: Search

Sample applications: Sort

Apache HBase • HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. • It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS • it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data) • Hbaseis now serving several data-driven websites, including Facebook's Messaging Platform.

Get Started with Hadoop Installation Download links Sandbox How to Download and Install Tutorials http://hortonworks.com/

Hadoop- Download Running Hadoop on Ubuntu Linux (Single-Node Cluster) http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ Sandbox http://hortonworks.com/products/hortonworks-sandbox/#install Some other links: http://hadoop-skills.com/setup-hadoop/ https://hadoop.apache.org/

Sandbox • The easiest way to get started with Enterprise Hadoop • Sandbox is a personal, portable Hadoop environment , comes with a dozen interactive Hadoop tutorials that will guide through the basics of Hadoop. • Includes the Hortonworks Data Platform in an easy to use form. We can add our own datasets, and connect it to our existing tools and applications. • We can test new functionality with the Sandbox before we put it into production. Simply, easily and safely.

Install any one of the following on your host machine : • VirtualBox • VMware Fusion • Hyper-V • Oracle VirtualBox • Version 4.2 or later (https://www.virtualbox.org/wiki/Downloads) • Hortonworks Sandbox virtual appliance for VirtualBox • Download the correct virtual appliance file for your environment • http://hortonworks.com/products/hortonworks-sandbox/#install Download and Install

CPU - A 64-bit machine with a multi-core CPU that supports virtualization. • BIOS - Has been enabled for virtualization support • RAM - At least 4 GB of RAM • Browsers - Chrome 25+, IE 9+ (Sandbox will not run on IE 10), Safari 6+ • Complete Installation guide can be found at: • http://hortonworks.com/wp-content/uploads/2015/07/Import_on_Vbox_7_20_2015.pdf Download and Install

Cloudera Hadoop VM setuphortonworks hadoop installation Thank You

Introduction to MapReduce and Hadoop