1 / 77

A BigData Tour – HDFS, Ceph and MapReduce

A BigData Tour – HDFS, Ceph and MapReduce. These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce Tutorial. Data Management and Processing.

agodsey
Download Presentation

A BigData Tour – HDFS, Ceph and MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce Tutorial

  2. Data Management and Processing • Data intensive computing • Concerns with the production, manipulation and analysis of data in the range of hundreds of megabytes (MB) to petabytes (PB) and beyond • A range of supporting parallel and distributed computing technologies to deal with the challenges of data representation, reliable shared storage, efficient algorithms and scalable infrastructure to perform analysis

  3. Challenges Ahead • Challenges with data intensive computing • Scalable algorithms that can search and process massive datasets • New metadata management technologies that can scale to handle complex, heterogeneous and distributed data sources • Support for accessing in-memory multi-terabyte data structures • High performance, highly reliable petascale distributed file system • Techniques for data reduction and rapid processing • Software mobility to move computation where data is located • Hybrid interconnect with support for multi-gigabyte data streams • Flexible and high performance software integration technique • Hadoop • A family of related project, best known for MapReduce and Hadoop Distributed File System (HDFS)

  4. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  5. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  6. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  7. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  8. Why Hadoop • Drivers • 500M+ unique users per month • Billions of interesting events per day • Data analysis is key • Need massive scalability • PB’s of storage, millions of files, 1000’s of nodes • Need to do this cost effectively • Use commodity hardware • Share resources among multiple projects • Provide scale when needed • Need reliable infrastructure • Must be able to deal with failures – hardware, software, networking • Failure is expected rather than exceptional • Transparent to applications • very expensive to build reliability into each application • The Hadoop infrastructure provides these capabilities

  9. Introduction to Hadoop • Apache Hadoop • Based on 2004 Google MapReduce Paper • Originally composed of HDFS (distributed F/S), a core-runtime and an implementation of Map-Reduce • Open Source – Apache Foundation project • Yahoo! is Apache Platinum Sponsor • History • Started in 2005 by Doug Cutting • Yahoo! became the primary contributor in 2006 • Yahoo! scaled it from 20 node clusters to 4000 node clusters today • Portable • Written in Java • Runs on commodity hardware • Linux, Mac OS/X, Windows, and Solaris

  10. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  11. HPC vs Hadoop • HPC attitude – “The problem of disk-limited, loosely-coupled data analysis was solved by throwing more disks and using weak scaling” • Flip-side: A single novice developer can write real, scalable, 1000+ node data-processing tasks in Hadoop-family tools in an afternoon • MPI... less so Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  12. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  13. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  14. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  15. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  16. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  17. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  18. Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  19. Everything is converging – 1/2 Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  20. Everything is converging – 2/2 Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

  21. Big Data Analytics Stack Amir Payberah https://www.sics.se/~amir/dic.htm

  22. Big Data – Storage (sans POSIX) Amir Payberah https://www.sics.se/~amir/dic.htm

  23. Big Data - Databases Amir Payberah https://www.sics.se/~amir/dic.htm

  24. Big Data – Resource Management Amir Payberah https://www.sics.se/~amir/dic.htm

  25. YARN – 1/3 • To address Hadoop v1 deficiencies with scalability, memory usage and synchronization, the Yet Another Resource Negotiator (YARN) Apache sub-project was started • Previously a JobTracker service ran on each node. Its roles were then split into separate daemons for • Resource management • Job scheduling/monitoring Hortonworks http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

  26. YARN – 2/3 • YARN splits the JobTracker’s responsibilities into • Resource management – the global Resource Manager daemon • Per application Application Master • The resource manger and per-node slave Node Managers allow generic node management • The resource manager has a pluggable scheduler Hortonworks http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

  27. YARN – 3/3 • The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network • The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager. • The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container. Hortonworks http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

  28. Big Data – Execution Engine Amir Payberah https://www.sics.se/~amir/dic.htm

  29. Big Data – Query/Scripting Languages Amir Payberah https://www.sics.se/~amir/dic.htm

  30. Big Data – Stream Processing Amir Payberah https://www.sics.se/~amir/dic.htm

  31. Big Data – Graph Processing Amir Payberah https://www.sics.se/~amir/dic.htm

  32. Big Data – Machine Learning Amir Payberah https://www.sics.se/~amir/dic.htm

  33. Hadoop Big Data Analytics Stack Amir Payberah https://www.sics.se/~amir/dic.htm

  34. Spark Big Data Analytics Stack Amir Payberah https://www.sics.se/~amir/dic.htm

  35. Hadoop Ecosystem Hortonworks http://hortonworks.com/industry/manufacturing/

  36. Hadoop Ecosystem • 2008 onwards – usage exploded • Creation of many tools on top of Hadoop infrastructure

  37. The Need For Filesystems Amir Payberah https://www.sics.se/~amir/dic.htm

  38. Distributed Filesystems Amir Payberah https://www.sics.se/~amir/dic.htm

  39. Hadoop Distributed File System (HDFS) • A distributed file system designed to run on commodity hardware • HDFS was originally built as infrastructure for the Apache Nutch web search engine project, with the aim to achieve fault tolerance, ability to run on low-cost hardware and handle large datasets • It is now an Apache Hadoop subproject • Share similarities with existing distributed file systems and supports traditional hierarchical file organization • Reliable data replication and accessible via Web interface and Shell commands • Benefits: Fault tolerant, high throughput, streaming data access, robustness and handling of large data sets • HDFS is not a general purpose F/S

  40. Assumptions and Goals • Hardware failures • Detection of faults, quick and automatic recovery • Streaming data access • Designed for batch processing rather than interactive use by users • Large data sets • Applications that run on HDFS have large data sets, typically in gigabytes to terabytes in size • Optimized for batch reads rather than random reads • Simple coherency model • Applications need a write-once, read-many times access model for files • Computation migration • Computation is moved closer to where data is located • Portability • Easily portable between heterogeneous hardware and software platforms

  41. What HDFS is not good for Amir Payberah https://www.sics.se/~amir/dic.htm

  42. HDFS Architecture • The Hadoop Distributed File System (HDFS) • Offers a way to store large files across multiple machines, rather than requiring a single machine to have disk capacity equal to/greater than the summed total size of the files • HDFS is designed to be fault-tolerant • Using data replication and distribution of data • When a file is loaded into HDFS, it is replicated and broken up into "blocks" of data • These blocks are stored across the cluster nodes designated for storage, a.k.a. DataNodes. http://www.revelytix.com/?q=content/hadoop-ecosystem

  43. Files and Blocks – 1/3 Amir Payberah https://www.sics.se/~amir/dic.htm

  44. Files and Blocks – 2/3 Amir Payberah https://www.sics.se/~amir/dic.htm

  45. Files and Blocks – 3/3 Amir Payberah https://www.sics.se/~amir/dic.htm

  46. HDFS Daemons • HDFS cluster is manager by three types of processes • Namenode • Manages the filesystem, e.g., namespace, meta-data, and file blocks • Metadata is stored in memory • Datanode • Stores and retrieves data blocks • Reports to Namenode • Runs on many machines • Secondary Namenode • Only for checkpointing. • Not a backup for Namenode Amir Payberah https://www.sics.se/~amir/dic.htm

  47. Hadoop Server Roles http://www.revelytix.com/?q=content/hadoop-ecosystem

  48. NameNode – 1/3 • The HDFS namespace is a hierarchy of files and directories • These are represented in the NameNode using inodes • Inodes record attributes • permissions, modification and access times; • namespace and disk space quotas. • The file content is split into large blocks (typically 128 megabytes, but user selectable file-by-file), and each block of the file is independently replicated at multiple DataNodes (typically three, but user selectable file-by-file) • The NameNode maintains the namespace tree and the mapping of blocks to DataNodes • A Hadoop cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently http://www.revelytix.com/?q=content/hadoop-ecosystem

More Related