Distributed and Parallel Processing Technology Chapter1. Meet Hadoop

Distributed and Parallel Processing TechnologyChapter1.Meet Hadoop Sun Jo

Data! • We live in the data age. • Estimates 0.18 ZB in 2006 and forecasting a tenfold growth by 2011 to 1.8 ZB • 1 ZB = 1021 bytes = 1,000 EB = 1,000,000 PB = 1,000,000,000 TB • The flood of data is coming from many sources • New York Stock Exchange generates 1 TB of new trade data per day • Facebook hosts about 10 billion photos taking up 1 PB (=1,000 TB) of storage • Internet Archive stores around 2 PB, and is growing at a rate of 20 TB per month • ‘Big Data’ can affects smaller organizations or individuals • Digital photos, individual’s interactions – phone calls, emails, documents – are captured and stored for later access • The amount of data generated by machines will be even greater than that generated by people • Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions

Data! • Data can be shared for anyone to download and analyze • Public Data Sets on Amazon Web Services, Infochimps.org, theinfo.org • Astrometry.net project • Watches the astrometry group on Flickr for new photos of the night sky • Analyzes each image and identifies the sky The project shows that are possible when data is made available and used for something that was not anticipated by the creator • Big Data is here. We are struggling to store and analyze it.

Data Storage and Analysis • The storage capacities have increased but access speeds haven’t kept up • Writing is even slower! • Solution : Read and write data in parallel to/from multiple disks • Problem • To solve hardware failure  replication • RAID : Redundant copies of the data are kept in case of failure • To combine the data in a disk with the others • What Hadoop provides • A reliable shared storage (HDFS) • Efficient analysis (MapReduce)

Comparison with Other Systems - RDBMS • RDBMS • B-Tree index • Optimized for accessing and updating a small proportion of records • MapReduce • Efficient for updating the large data, uses Sort/Merge to rebuild the DB • Good for the needs to analyze the whole dataset in a batch fashion • Structured vs. Semi- or Unstructured Data • Structured data : particular predefined schema  RDBMS • Semi- or Unstructured data : looser or no particular internal structure  MapReduce • Normalization • To retain the integrity and remove redundancy, relational data is often normalized • MapReduce performs high-speed streaming reads and writes, and records that is not normalized are well-suited to analysis with MapReduce.

Comparison with Other Systems - RDBMS • RDBMS vs. MapReduce • Co-evolution of RDBMS and MapReduce systems • RDBs start incorporating some of the ideas from MapReduce • Higher-level query languages built on MapReduce • Making MapReduce systems more approachable to traditional database programmers

Comparison with Other Systems – Grid Computing • Grid Computing • High Performance Computing(HPC) and Grid Computing communities have been doing large-scale data processing • Using APIs as Message Passing Interface(MPI) • HPC • Distribute the work across a cluster of machines, which access a shared filesystem, hosted by a SAN • Works well for compute-intensive jobs • Meets a problem when nodes need to access larger data volumes – hundreds of GB, since the network bandwidth is the bottleneck and compute nodes become idle • Data locality, the heart of MapReduce • MapReduce collocates the data with the compute node, so data access is fast since it is local • MPI vs. MapReduce • MPI programmers need to handle the mechanics of the data flow • MapReduce programmers think in terms of functions of key and value pairs, and the data flow is implicit

Comparison with Other Systems – Grid Computing • Partial failure • MapReduce is a shared-nothing architecture  tasks have no dependence on one other.  the order in which the tasks run doesn’t matter. • MPI programs have to manage the check-pointing and recovery

Comparison with Other Systems – Volunteer Computing • Volunteer computing projects • Breaking the problem into chunks called work units • Sending to computers around the world to be analyzed • The Results are sent back to the server when the analysis is completed • The client gets another work unit • SETI@home • to analyze radio telescope data for signs of intelligent life outside earth • SETI@home vs. MapReduce • SETI@home • very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world. Volunteers are donating CPU cycles, not bandwidth • Runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality • MapReduce • Designed to run jobs that last minutes or hours on HW running in a single data center with very high aggregate bandwidth interconnects

A Brief History of Hadoop • Hadoop • Created by Doug Cutting, the creator of Apache Lucene, text search library • Has its origin in Apache Nutch, an open source web search engine, a part of the Lucene project • ‘Hadoop’ was the name that Doug’s kid gave to a stuffed yellow elephant toy • History • In 2002, Nutch was started • A working crawler and search system emerged • Its architecture wouldn’t scale to the billions of pages on the Web • In 2003, Google published a paper describing the architecture of Google’s distributed filesystem, GFS • In 2004, Nutch project implemented the GFS idea into the Nutch Distributed Filesystem, NDFS • In 2004, Google published the paper introducing MapReduce • In 2005, Nutch had a working MapReduce implementation in Nutch • By the middle of that year, all the major Nutch algorithms had been ported to run using MapReduce and NDFS

A Brief History of Hadoop • History • In Feb. 2006, Doug Cutting started an independent subproject of Lucene, called Hadoop • In Jan. 2006, Doug Cutting joined Yahoo! • Yahoo! Provided a dedicated team and the resources to turn Hadoop into a system at web scale • In Feb. 2008, Yahoo! announced its search index was being generated by a 10,000 core Hadoop cluster • In Apr. 2008, Hadoop broke a world record to sort a terabytes of data • In Nov. 2008, Google reported that its MapReduce implementation sorted one terabytes in 68 seconds. • In May 2009, Yahoo! used Hadoop to sort one terabytes in 62 seconds

Apache Hadoop and the Hadoop Ecosystem • The Hadoop projects that are covered in this book are following • Common – a set of components and interfaces for filesystems and I/O. • Avro – a serialization system for RPC and persistent data storage. • MapReduce – a distributed data processing model. • HDFS – a distributed filesystem running on large clusters of machines. • Pig – a data flow language and execution environment for large datasets. • Hive – a distributed data warehouse providing SQL-like query language. • HBase– a distributed, column-oriented database. • ZooKeeper – a distributed, highly available coordination service. • Sqoop – a tool for efficiently moving data between relational DB and HDFS.

Distributed and Parallel Processing Technology Chapter1. Meet Hadoop