1 / 24

Hadoop

Hadoop. Joshua Nester, Garrison Vaughan , Calvin Sauerbier , Jonathan Pingilley , and Adam Albertson. Overview – What is Hadoop ?. Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce .

Download Presentation

Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop Joshua Nester, Garrison Vaughan, Calvin Sauerbier, Jonathan Pingilley, and Adam Albertson

  2. Overview – What is Hadoop? • Hadoopis a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce. • It provides a distributed filesystem (HDFS) that can store data across thousands of servers, and a means of running work (Map/Reduce jobs) across those machines, running the work near the data. • It runs on Java 1.6.x or higher with support for both Linux and Windows

  3. Overview – What does it do? • Hadoopprovides a way to solve problems using big data. • It can be used to interact with and analyze data that doesn't fit neatly in a database (financial portfolios, targeted product advertising).

  4. Overview – Brief History • Hadoop was built by Doug Cutting and Michael J. Cafarella. The name Hadoop came from Doug’s son’s stuffed toy elephant, which is also where Hadoop got its logo. • Hadoopwas originally built as an infrastructure for the Nutch project, which crawls the web and builds a search engine index for the crawled pages.

  5. Overview – How does it work? • MapReduceexpresses large distributed computation as a sequence of distributed operations on data sets of key/value pairs. • A Map/Reduce computation has two phases, a map phase and a reduce phase.

  6. Overview – How does it work? • Map • The framework splits the input data set into a large number of • fragments and assigns each fragment to a map task. • Each map task consumes key/value pairs from its assigned fragment and produces a set of intermediate key/value pairs. • Following the map phase the framework sorts the intermediate data set by key and produces a set of tuples so that all the values associated with a particular key appear together. • The set of tuples are partitioned into a number of fragments equal to the number of reduce tasks that will be performed.

  7. Overview – How does it work? • Reduce • Each reduce task consumes the fragment of tuples assigned to it and transmutes the tuple into an output key/value pair as described by the user defined reduce function. • Once again, the framework distributes the many reduce tasks across the cluster of nodes and deals with shipping the appropriate fragment of intermediate data to each reduce task. • Tasks in each phase are executed in a fault-tolerant manner, if nodes fail in the middle of a computation the tasks assigned to them are re-distributed among the remaining nodes and any data that can be recovered is also re-distributed. • Having many map and reduce tasks enables good load balancing and allows failed tasks to be re-run with small runtime overhead.

  8. Overview - Architecture • Hadoop has a master/slave architecture. • A single master server or jobtracker and several slave servers or tasktrackers, one per node in the cluster. • The jobtracker is the point of interaction between users and the framework. • Users submit map/reduce jobs to the jobtracker, which puts them in a queue of pending jobs and executes them on a first-come/first-served basis. • The jobtracker manages the assignment of map and reduce tasks to the tasktrackers. • The tasktrackers execute tasks upon instruction from the jobtracker and also handle data motion between the map and reduce phases.

  9. Overview - Hardware • Hadoop is designed to run on a large number of machines that don’t share any memory or disks. (aka. most PCs and servers) • Scaling is best with nodes that have dual processors/cores with 4-8GB of RAM. • The best cost/performance places the machines at ½ to 1/3 of the cost of application servers but above the cost of standard desktop machines. • Bandwidth needed depends on the jobs being run and the number of nodes. Most average jobs produce around 100MB/s of data.

  10. Overview - Hadoop DFS • Designed to reliably store very large files across machines in a large cluster. • Inspired by the Google File System. • Each file is stored as a sequence of blocks, all blocks in a file except the last block are the same size. • Blocks belonging to a file are replicated for fault tolerance. • The block size and replication factor are configurable per file. • Files in HDFS are "write once" and have strictly one writer at any time.

  11. Overview - HDFS Architecture • HDFS also follows a master/slave architecture. • Installation consists of a single Namenode, a master server that manages the filesystem namespace and regulates access to files by clients. • There are a number of Datanodes, one per node in the cluster, which manage storage attached to the nodes that they run on. • The Namenode makes filesystem namespace operations like opening, closing, renaming etc. of files and directories and also determines the mapping of blocks to Datanodes. • The Datanodes are responsible for serving read and write requests from filesystem clients, they also perform block creation, deletion, and replication upon instruction from the Namenode.

  12. Overview - Developing for Hadoop • A Hadoop adopter must be more sophisticated than a relational database adopter. • There are not that many “turn-key” applications that are ready to use. • Each company that uses Hadoop will likely have to be adept enough to create their own programs in house. • Hadoop is written in Java but APIs are available for multiple languages and many vendors are now providing tools to assist.

  13. What Hadoop is not… • Asilver bullet that will solve all application/datacenter problems. • Areplacement for a database or SAN. • A replacement for an existing NFS. • Always the fastest solution. If operations rely on the output preceding operations Hadoop/MapReduce will likely provide no benefit. • Aplace to learn java. • Aplace to learn networking. • Aplace to learn Unix/Linux system administration.

  14. Proposal - Purpose • The purpose of this assignment is to setup up and configure a Hadoop cluster so that we can effectively distributed processes between different machines. • MapReducealgorithms can then be run on the machine to process extremely large amounts of data • Hadoop will also allow us to effectively see how see fault tolerance.

  15. Proposal - Outline • Hardware • Network Topology • OS/System Software • Hadoop • Benchmark Tool • MapReduce • Applying Distributed Environment

  16. Proposal - Hardware • Utilizing 2 pods + possible servers or other computers • 2 Compatible Computers at each pod • 64-bit architecture machines • 1 Control System if access to network is wanted we can setup one node with access • All computers are potential slaves • 1x24-port Cisco Switch • Hadoop can also run on a single-node cluster

  17. Proposal – Network Topology • High Level View

  18. Proposal – OS/Software • Supported operating systems • Linux and Windows: BSD, Max OS/X and OpenSolaris would also work. • Since Linux is an openSource OS we will use it for our installation. • Needs Java 1.6.x or higher to run • Platform Specifics • Most of Hadoop is built with Java, but a growing number is written in C and C++ • Makes it harder to port because of functionality issues • Successfully tested setup • Linux 10.04 LTS, 8.10, 8.04 LTS, 7.10 • Hadoop 1.0.3, released May 2012

  19. Proposal – OS/Software Cont. • Relies heavily on Unix/Linux system knowledge • Linux provides configurability for the following (must know): • SSH, how to use both ssh and scp • Ifconfig, nslookup, other network programs • Known logs • Setup and mount file systems • Basically, we must have a good working knowledge of a Linux system already (or know how to use Google) • A working knowledge of Java Programming is also needed to work through possible errors if needed.

  20. Proposal - Releases • Hadoop Releases: http://hadoop.apache.org/releases.html • Software library framework allows for distributed processing • Library is designed to detect and handle failures at an application level. • Delivers high availability • Includes common utilities to support Hadoop modules • Contains framework for jobs • Scheduling and cluster resource management • Implementation of MapReduce • Hadoop Distributed File System (HDFS) • Runs on commodity hardware • Fault tolerance • Restarting tasks • Data replication

  21. Proposal – Why Hadoop • Companies that implement Hadoop • IBM • LinkedIN • Adobe • Twitter • Facebook • Amazon • Yahoo! • Rackspace • We can clearly see that Hadoop is worth the hype!!!

  22. Proposal - Benchmarking • HadoopMapReduce • HadoopStreaming allows shells to be used to execute map or reduce functions. • Operation can be run in parallel on different lists of data. • Pushes out program to machines • Output saved to distribution filesystem • Job tracker keeps track of MapReduce Jobs • Success and failure • Works to complete entire job • Provides its own distributed filesystem and runs jobs near data stored on each filesystem

  23. Proposal – Benchmarking Cont • Maximum parallelism • Maps and Reduces must be stateless • You can’t control the order in which it maps runs or the reductions • Won’t get data back until the entire mapping has completed • Nodes do report back periodically just not in full feature • Used in several different environments • Multi-core and many-core systems, desktop grids, dynamic cloud environments • Example: • grep -Eh <regex> <inDir>/* | sort | uniq -c | sort –nr • Used to count lines in all files in directory that match a regex condition. • We can see powerful and not so powerful applications dealing with MapReduce

  24. Proposal – Useful Applications • Word Counting projects. • Generating PDF files for many articles as scanned images • Google implementation-locate roads connected to given intersection • Rendering maps • Finding nearest features • Page Ranking • Machine Translation

More Related