1 / 12

Distributed and Parallel Processing Technology Chapter1. Meet Hadoop

Distributed and Parallel Processing Technology Chapter1. Meet Hadoop. Sun Jo. Data!. We live in the data age. Estimates 0.18 ZB in 2006 and forecasting a tenfold growth by 2011 to 1.8 ZB 1 ZB = 10 21 bytes = 1,000 EB = 1,000,000 PB = 1,000,000,000 TB

Download Presentation

Distributed and Parallel Processing Technology Chapter1. Meet Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed and Parallel Processing TechnologyChapter1.Meet Hadoop Sun Jo

  2. Data! • We live in the data age. • Estimates 0.18 ZB in 2006 and forecasting a tenfold growth by 2011 to 1.8 ZB • 1 ZB = 1021 bytes = 1,000 EB = 1,000,000 PB = 1,000,000,000 TB • The flood of data is coming from many sources • New York Stock Exchange generates 1 TB of new trade data per day • Facebook hosts about 10 billion photos taking up 1 PB (=1,000 TB) of storage • Internet Archive stores around 2 PB, and is growing at a rate of 20 TB per month • ‘Big Data’ can affects smaller organizations or individuals • Digital photos, individual’s interactions – phone calls, emails, documents – are captured and stored for later access • The amount of data generated by machines will be even greater than that generated by people • Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions

  3. Data! • Data can be shared for anyone to download and analyze • Public Data Sets on Amazon Web Services, Infochimps.org, theinfo.org • Astrometry.net project • Watches the astrometry group on Flickr for new photos of the night sky • Analyzes each image and identifies the sky The project shows that are possible when data is made available and used for something that was not anticipated by the creator • Big Data is here. We are struggling to store and analyze it.

  4. Data Storage and Analysis • The storage capacities have increased but access speeds haven’t kept up • Writing is even slower! • Solution : Read and write data in parallel to/from multiple disks • Problem • To solve hardware failure  replication • RAID : Redundant copies of the data are kept in case of failure • To combine the data in a disk with the others • What Hadoop provides • A reliable shared storage (HDFS) • Efficient analysis (MapReduce)

  5. Comparison with Other Systems - RDBMS • RDBMS • B-Tree index • Optimized for accessing and updating a small proportion of records • MapReduce • Efficient for updating the large data, uses Sort/Merge to rebuild the DB • Good for the needs to analyze the whole dataset in a batch fashion • Structured vs. Semi- or Unstructured Data • Structured data : particular predefined schema  RDBMS • Semi- or Unstructured data : looser or no particular internal structure  MapReduce • Normalization • To retain the integrity and remove redundancy, relational data is often normalized • MapReduce performs high-speed streaming reads and writes, and records that is not normalized are well-suited to analysis with MapReduce.

  6. Comparison with Other Systems - RDBMS • RDBMS vs. MapReduce • Co-evolution of RDBMS and MapReduce systems • RDBs start incorporating some of the ideas from MapReduce • Higher-level query languages built on MapReduce • Making MapReduce systems more approachable to traditional database programmers

  7. Comparison with Other Systems – Grid Computing • Grid Computing • High Performance Computing(HPC) and Grid Computing communities have been doing large-scale data processing • Using APIs as Message Passing Interface(MPI) • HPC • Distribute the work across a cluster of machines, which access a shared filesystem, hosted by a SAN • Works well for compute-intensive jobs • Meets a problem when nodes need to access larger data volumes – hundreds of GB, since the network bandwidth is the bottleneck and compute nodes become idle • Data locality, the heart of MapReduce • MapReduce collocates the data with the compute node, so data access is fast since it is local • MPI vs. MapReduce • MPI programmers need to handle the mechanics of the data flow • MapReduce programmers think in terms of functions of key and value pairs, and the data flow is implicit

  8. Comparison with Other Systems – Grid Computing • Partial failure • MapReduce is a shared-nothing architecture  tasks have no dependence on one other.  the order in which the tasks run doesn’t matter. • MPI programs have to manage the check-pointing and recovery

  9. Comparison with Other Systems – Volunteer Computing • Volunteer computing projects • Breaking the problem into chunks called work units • Sending to computers around the world to be analyzed • The Results are sent back to the server when the analysis is completed • The client gets another work unit • SETI@home • to analyze radio telescope data for signs of intelligent life outside earth • SETI@home vs. MapReduce • SETI@home • very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world. Volunteers are donating CPU cycles, not bandwidth • Runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality • MapReduce • Designed to run jobs that last minutes or hours on HW running in a single data center with very high aggregate bandwidth interconnects

  10. A Brief History of Hadoop • Hadoop • Created by Doug Cutting, the creator of Apache Lucene, text search library • Has its origin in Apache Nutch, an open source web search engine, a part of the Lucene project • ‘Hadoop’ was the name that Doug’s kid gave to a stuffed yellow elephant toy • History • In 2002, Nutch was started • A working crawler and search system emerged • Its architecture wouldn’t scale to the billions of pages on the Web • In 2003, Google published a paper describing the architecture of Google’s distributed filesystem, GFS • In 2004, Nutch project implemented the GFS idea into the Nutch Distributed Filesystem, NDFS • In 2004, Google published the paper introducing MapReduce • In 2005, Nutch had a working MapReduce implementation in Nutch • By the middle of that year, all the major Nutch algorithms had been ported to run using MapReduce and NDFS

  11. A Brief History of Hadoop • History • In Feb. 2006, Doug Cutting started an independent subproject of Lucene, called Hadoop • In Jan. 2006, Doug Cutting joined Yahoo! • Yahoo! Provided a dedicated team and the resources to turn Hadoop into a system at web scale • In Feb. 2008, Yahoo! announced its search index was being generated by a 10,000 core Hadoop cluster • In Apr. 2008, Hadoop broke a world record to sort a terabytes of data • In Nov. 2008, Google reported that its MapReduce implementation sorted one terabytes in 68 seconds. • In May 2009, Yahoo! used Hadoop to sort one terabytes in 62 seconds

  12. Apache Hadoop and the Hadoop Ecosystem • The Hadoop projects that are covered in this book are following • Common – a set of components and interfaces for filesystems and I/O. • Avro – a serialization system for RPC and persistent data storage. • MapReduce – a distributed data processing model. • HDFS – a distributed filesystem running on large clusters of machines. • Pig – a data flow language and execution environment for large datasets. • Hive – a distributed data warehouse providing SQL-like query language. • HBase– a distributed, column-oriented database. • ZooKeeper – a distributed, highly available coordination service. • Sqoop – a tool for efficiently moving data between relational DB and HDFS.

More Related