1 / 6

Hadoop

Hadoop. Data Explosion. IDC estimate put the size of the “digital universe” at - 0.18 zettabytes in 2006 forecasting a tenfold growth by 2011 to 1.8 zettabytes The New York Stock Exchange generates about one terabyte of new trade data per day

michel
Download Presentation

Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop

  2. Data Explosion • IDC estimate put the size of the “digital universe” at • - 0.18 zettabytes in 2006 • forecasting a tenfold growth by 2011 to 1.8 zettabytes • The New York Stock Exchange generates about one terabyte of new trade • data per day • Facebook hosts approximately 10 billion photos, taking up one petabyte • of storage. • The Internet Archive stores around 2 petabytes of data, and is growing at • a rate of 20 terabytes per month. • The Large Hadron Collider near Geneva, Switzerland, will produce about • 15 petabytes of data per year.

  3. Hadoop Projects • Common • A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures). • Avro • A serialization system for efficient, cross-language RPC, and persistent data storage. • MapReduce • A distributed data processing model and execution environment that runs on large clusters of commodity machines. • HDFS A • Distributed filesystem that runs on large clusters of commodity machines. • Pig • A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.

  4. Hadoop Projects • Hive • A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. • Hbase • A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). • ZooKeeper • A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. • Sqoop • A tool for efficiently moving data between relational databases and HDFS.

  5. RDBMS Compared to MapReduce • MapReduce can be seen as a complement to an RDBMS • MapReduce is a good fit for problems that need to analyze the whole dataset, • in a batch fashion, particularly for ad hoc analysis. • An RDBMS is good for point queries or updates, where the dataset has been • indexed to deliver low-latency retrieval and update times of a relatively small • amount of data. • MapReduce suits applications where the data is written once, and read many • times, whereas a relational database is good for datasets that are continually • updated.

  6. RDBMS Compared to MapReduce

More Related