1 / 15

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data. Fay Chang, Jeffrey Dean, Sanjay Ghemawat , Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes , Robert E. Gruber OSDI 2006. Presented By : Anumeha Srivastava. What is Bigtable?.

willem
Download Presentation

Bigtable : A Distributed Storage System for Structured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber OSDI 2006 Presented By : Anumeha Srivastava

  2. What is Bigtable? • It is a distributed storage system for managing structured data that is designed to scale to peta bytes of storage across thousands of commodity servers. Wide Applicability Scalability Bigtable High Performance High Availability Properties of Bigtable

  3. Data Model • Bigtable does not support a full relational model. • Bigtable is a sparse, distributed, multi-dimensional sorted map , indexed by a row key, a column key and a timestamp. Each value is an un interpreted array of bytes. (row:string, column:string, time:int64) ----> string • Row Keys – • Arbitrarystrings upto 64 KB in size. • Data is arranged in lexicographical order by row keys • The row range for a table is dynamically partitioned. Each row range is called a tablet. (a unit of distribution and load balancing) • Column Families – • Sets of column keys. • Basic unit of access control – column familiy must be created before any data can be stored under a key in that family.

  4. Data Model • Column Families (contd)– • A column key is named as family : qualifier. • Access control is performed at column family level. • Timestamps – • 64 bit integers assigned by Bigtable or explicitly by client. • Enable storage of multiple versions of data. • Client can specify to keep n latest copies or copies n days old. This data model was chosen keeping a variety of potential uses in mind. For example, to store a copy of a large collection of web pages.

  5. API Create and delete Tables Can be used with Map Reduce both as an input source or an output target Create and delete Column families API Functions Perform single row transactions which can be used to perform atomic read/ write Changing metadata for cluster, table or column family Execution of client supplied scripts in the server’s address space

  6. Building Blocks • Bigtable uses the distributed Google File System to store log and files. • It operates on a shared pool of machines and depends on a cluster management system for scheduling jobs and managing resources on shared machines • SSTable - Internally, the Google SSTable file format is used to store Bigtable data. SSTable provides a persistent ordered immutable map from keys to values where both keys and values are arbitrary byte strings. In turn, SStable internally consists of blocks (usually 64 KB but of adjustable size) with a block index stored at the end. This index is loaded into memory when the SSTable is opened. • Chubby - Highly available, distributed lock service which keeps 5 active replicas, one of which is the master and actively serves requests. Bigtableuses Chubby for a variety of tasks: to ensure that there is at most one active master at any time, to store Bigtableschema information (the column family information for each table) and to store access control lists.

  7. Implementation • It consists of 3 major components • A library linked to every client • One master server • Many tablet servers • The master assigns tablets to tablet servers, detects addition and expiration of tablet servers and balances tablet server load. • Each tablet server manages a set of tablets(10 to 1000 tablets) and handles read and write requests to tablets it has loaded. • A Bigtable cluster stores a number of tables. Each table has a set of tablets. Each tablet has data associated with row range. Initially, each table has one tablet but as more data is added, it is split automatically into multiple tablets.

  8. Implementation • Tablet Location – • The client library caches tablet locations. • If the client’s cache is empty, the location algorithm requires 3 network • round trips including ne read from chubby.

  9. Implementation • Tablet Assignment • Each tablet is assigned to one tablet server at a time. • Master server keeps tracks of the assignments. • Bigtable uses Chubby to keep track of tablet servers. • When a tablet server starts, it creates and acquires a lock on a uniquely named file in a specific Chubby directory. • If this lock is lost, a tablet will stop serving it’s tablets. • If the tablet server is not able to acquire the lock again, it kills itself. • In such a situation, the master is either notified or it discovers the failed tablet on it’s own and reassigns that server’s tablets.

  10. Implementation • When a read or write op arrive, they are checked for well formedness and that the sender is authorized to perform that operation. • Read Op – Performed on a merged view of SSTables and memtable • Write Op – A valid mutation is written to a commit log.

  11. Refinements • Locality – Clients can group multiple column families into a single locality group. Segregating column families that are not accessed together into different locality groups makes reads more efficient. • Compression – Clients can control whether Sstables for a locality group are to be compressed or not. User Specified compression format is used. • In an experiment done by the authors, they stored a large number of documents in a compressed locality group and achieved a ratio of 10:1. This is much better compared to 4:1 or 3:1 by Gzip.

  12. Refinements • Caching for read performance – two levels of caching is used • Scan cache – Higher level cache used for caching the key-value pairs returned by the SSTable interface to the tablet server code. (useful when applications read same data repeatedly) • Block Cache - lower-level cache that caches SSTables blocks that were read from GFS. (useful when applications read data close to the data that they read recently, like in sequential reads )

  13. Refinements • Commit – log Implementation – • we append mutations to a single commit log per tablet server, co-mingling mutations for different tablets in the same physical log. • This makes recovery difficult because now the tablets will be assigned to different servers. To recover the state of a tablet the new tablet server will need to re apply the mutations written by it’s original server in the commit log. We optimize this process by sorting the commit log in order of keys.

  14. Real Applications • Google Analytics - helps web masters analyze web traffic on their websites. Uses two tables • the raw click table which maintains a row for each end user session. The row name is a tuple containing the website’s name and the time at which the session started. • The summary table – predefined summaries for various websites. • Google Earth – provides users with access to high-resolution satellite imagery of the world's surface, both through the web-based Google Maps interface (maps.google.com) and through the Google Earth (earth.google.com) custom client software. • The preprocessing pipeline relies heavily on Map Reduce over BigTable to transform data.

  15. Lessons • Large distributed systems are prone to various types of failures due to reasons like memory and network corruption, hung machines, bugs in other systems that are being used (like Chubby) etc • Proper system level monitoring is important. For example – Every Bigtable cluster is registered with Chubby hence allowing to track down how big all the clusters are and how much traffic they are receiving and if there are any problems like latencies. • The value of simple design – The authors realized that when a system is very large and goes on evolving over time, clarity of design and code are very important for maintenance and debugging.

More Related