1 / 19

Managing Big Data in Distributed Systems

Learn how to manage big data in distributed settings using Google File System (GFS) and Hadoop Distributed File System (HDFS).

rhitchcock
Download Presentation

Managing Big Data in Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Managementwith Google File SystemPramod Bhatotiawp.mpi-sws.org/~bhatotia Distributed Systems, 2016

  2. From the last class:How big is “Big Data”? processes 20 PB a day (2008) crawls 20B web pages a day (2012) >10 PB data, 75B DB calls per day (6/2012) >100 PB of user data + 500 TB/day (8/2012) S3: 449B objects, peak 290k request/second (7/2011) 1T objects (6/2012)

  3. Distributed systems for “Big Data” management Single disk Multiple disks Bandwidth: ~180 MB/sec, Read time 1PB: ~65 days! Aggregate bandwidth: ~ No. of disks*180 MB/sec

  4. In today’s class How to manage “Big data” in distributed setting? Google File System (GFS) Open source: Hadoop Distributed File System (HDFS)

  5. Read File Network Data Client Server Distributed file system From Google search: “A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer.” E.g.: NFS, AFS

  6. Key design requirements Large files Scalable Performance Write once (or append only) Reliable Namespace Available Concurrency

  7. Beauty of GFS/HDFS MapReduce Applications GFS/HDFS

  8. Interface/API Simple shell interface $ hadoopdfs -<fs commands> <arguments> <fs commands>: mv, cp, chmod, copyFromLocal, copyToLocal, rm, du, etc. Additional commands for snapshots, appends Similar programming APIs to access HDFS Not compatible w/ POSIX API; HDFS only allows appends

  9. Architecture • Namespace • ACL • Data management • Lease management • Garbage collection • Reliability of chunks Master node (NameNode) Distributed file system: GFS/HDFS

  10. Architecture Master node (NameNode) Data Node Data Node Data Node Data Node

  11. Basic functioning Namenode (metadata) Metadata Client (caches metadata) Input file Data (read/write) Data Node Data Node

  12. Read operation Master node (NameNode) Metadata (red chunk) Client Read red chunk Data Node Data Node

  13. Write operation Master node (NameNode) Metadata (red chunk) Client C B A Send data to buffer Data Node Data Node Data Node

  14. Commit order Using lease management by the master Primary (Replica A) Secondary (Replicas B & C) C B A Data Node Data Node Data Node U3, U2, U1 U1, U2, U3 U1, U3, U2 Update buffer at B Update buffer at A Update buffer at C

  15. Primary commit order Primary (Replica A) Secondary (Replicas B & C) Client Write complete C B A Data Node Data Node Data Node U1, U2, U3 U1, U2, U3 U1, U2, U3 Commit order Commit order Commit order

  16. Consistency semantics Updates (only for GFS) File region may end up containing mingled fragments from different clients E.g., writes to different chunks may be ordered differently by their different primary replica Thus, writes are consistent but undefined in GFS Appends Append causes data to be appended atomically at least once Offset chosen by HDFS, not by the client

  17. Discussion: Other design details Chunk replication: placement and re-balancing Namespace management & locking Garbage collection (lazy) Data integrity Snapshots (using copy-on-write) Master replication

  18. References GFS [SOSP’03] Original Google File System paper Tachyon [SoCC’13] Distributed in-memory file system for Spark TidyFS [USENIX ATC’11] A distributed file system by Microsoft Resources: www.hadoop.apache.org

  19. Thanks! Pramod Bhatotiawp.mpi-sws.org/~bhatotia

More Related