1 / 31

A BigData Tour – HDFS, Ceph and MapReduce

A BigData Tour – HDFS, Ceph and MapReduce. These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce Tutorial. EXTRA MATERIAL. CEPH – A HDFS replacement.

myarbrough
Download Presentation

A BigData Tour – HDFS, Ceph and MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce Tutorial

  2. EXTRA MATERIAL

  3. CEPH – A HDFS replacement

  4. What is Ceph? • Ceph is a distributed, highly available unified object, block and file storage system with no SPOF running on commodity hardware

  5. Ceph Architecture – Host Level • At the host level… • We have Object Storage Devices (OSDs) and Monitors • Monitors keep track of the components of the Ceph cluster (i.e. where the OSDs are) • The device, host, rack, row, and room are stored by the Monitors and used to compute a failure domain • OSDs store the Ceph data objects • A host can run multiple OSDs, but it needs to be appropriately provisioned http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  6. Ceph Architecture – Block Level • At the block device level... • Object Storage Device (OSD) can be an entire drive, a partition, or a folder • OSDs must be formatted in ext4, XFS, or btrfs (experimental). https://hkg15.pathable.com/static/attachments/112267/1423597913.pdf?1423597913

  7. Ceph Architecture – Data Organization Level • At the data organization level… • Data are partitioned into pools • Pools contain a number of Placement Groups (PGs) • Ceph data objects map to PGs (via a modulo of hash of name) • PGs then map to multiple OSDs. https://hkg15.pathable.com/static/attachments/112267/1423597913.pdf?1423597913

  8. Ceph Placement Groups • Ceph shards a pool into placement groups distributed evenly and pseudo-randomly across the cluster • The CRUSH algorithm assigns each object to a placement group, and assigns each placement group to a set of OSDs—creating a layer of indirection between the Ceph client and the OSDs storing the copies of an object • The CRUSH algorithm dynamically assigns each object to a placement group and then assigns each placement group to a set of Ceph OSDs • This layer of indirection allows the Ceph storage cluster to re-balance dynamically when new Ceph OSD come online or when Ceph OSDs fail RedHatCeph Architecture v1.2.3

  9. Ceph Architecture – Overall View https://www.terena.org/activities/tf-storage/ws16/slides/140210-low_cost_storage_ceph-openstack_swift.pdf

  10. Ceph Architecture – RADOS • An Application interacts with a RADOS cluster • RADOS (Reliable Autonomic Distributed Object Store) is a distributed object service that manages the distribution, replication, and migration of objects • On top of that reliable storage abstraction Ceph builds a range of services, including a block storage abstraction (RBD, or RadosBlock Device) and a cache-coherent distributed file system (CephFS). http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  11. Ceph Architecture – RADOS Components http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  12. Ceph Architecture – Where Do Objects Live? http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  13. Ceph Architecture – Where Do Objects Live? • Contact a Metadata server? http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  14. Ceph Architecture – Where Do Objects Live? • Or calculate the placement via static mapping? http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  15. Ceph Architecture – CRUSH Maps http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  16. Ceph Architecture – CRUSH Maps • Data objects are distributed across Object Storage Devices (OSD), which refers to either physical or logical storage units, using CRUSH (Controlled Replication Under Scalable Hashing) • CRUSH is a deterministic hashing function that allows administrators to define flexible placement policies over a hierarchical cluster structure (e.g., disks, hosts, racks, rows, datacenters) • The location of objects can be calculated based on the object identifier and cluster layout (similar to consistent hashing), thus there is no need for a metadata index or server for the RADOS object store http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  17. Ceph Architecture – CRUSH – 1/2 http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  18. Ceph Architecture – CRUSH – 2/2 http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  19. Ceph Architecture – librados http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  20. Ceph Architecture – RADOS Gateway http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  21. Ceph Architecture – RADOS Block Device (RBD) – 1/3 http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  22. Ceph Architecture – RADOS Block Device (RBD) – 2/3 • Virtual Machine storage using RDB • Live Migration using RBD http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  23. Ceph Architecture – RADOS Block Device (RBD) – 3/3 • Direct host access from Linux http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  24. Ceph Architecture – CephFS – POSIX F/S http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

  25. Ceph – Read/Write Flows https://software.intel.com/en-us/blogs/2015/04/06/ceph-erasure-coding-introduction

  26. Ceph Replicated I/O RedHatCeph Architecture v1.2.3

  27. Ceph – Erasure Coding – 1/5 • Erasure Code is a theory started at 1960s. The most famous algorithm is the Reed-Solomon. Many variations came out, like the Fountain Codes, Pyramid Codes and Local Repairable Codes. • Erasure Codes usually defines the number of total disks (N) and the number of data disks (K), and it can tolerate N – K failures with overhead of N/K • E,g, a typical Reed Solomon scheme: (8, 5), where 8 is the total disks, 5 is the data disks. In this case, the data in disks would be like: • RS (8, 5) can tolerate 3 arbitrary failures. If there’s some data chunks missing, then one could use the rest available data to restore the original content. https://software.intel.com/en-us/blogs/2015/04/06/ceph-erasure-coding-introduction

  28. Ceph – Erasure Coding – 2/5 • Like replicated pools, in an erasure-coded pool the primary OSD in the up set receives all write operations • In replicated pools, Ceph makes a deep copy of each object in the placement group on the secondary OSD(s) in the set • For erasure coding, the process is a bit different. An erasure coded pool stores each object as K+M chunks. It is divided into K data chunks and M coding chunks. The pool is configured to have a size of K+M so that each chunk is stored in an OSD in the acting set. • The rank of the chunk is stored as an attribute of the object. The primary OSD is responsible for encoding the payload into K+M chunks and sends them to the other OSDs. It is also responsible for maintaining an authoritative version of the placement group logs. https://software.intel.com/en-us/blogs/2015/04/06/ceph-erasure-coding-introduction

  29. Ceph – Erasure Coding – 3/5 • 5 OSDs (K+M=5); sustain loss of 2 (M=2) • Object NYAN with data “ABCDEGHI” is split into 3 chunks; padded if length is not a multiple of K • Coding blocks are YXY and QGC RedHatCeph Architecture v1.2.3

  30. Ceph – Erasure Coding – 4/5 • On reading object NYAN from an erasure coded pool, decoding function retrieves chunks 1, 2, 3 and 4 • If any two chunks are missing (ie an erasure is present), decoding function can reconstruct other chunks RedHatCeph Architecture v1.2.3

  31. Ceph – Erasure Coding – 4/5 • 5 OSDs (K+M=5); sustain loss of 2 (M=2) • Object NYAN with data “ABCDEGHI” is split into 3 chunks; padded if length is not a multiple of K • Coding blocks are YXY and QGC RedHatCeph Architecture v1.2.3

More Related