Building a Database on S3

Building a Database on S3 Matthias Brantner, Daniela Florescu, David Graf, Donald Kossmann, Tim Kraska Xiang Zhang 2014-04-10

Outline • Introduction • What is AWS • Using S3 as a disk • Basic commit protocols • Transactional properties • Experiments and results • Conclusion and future works

Introduction • What is AWS • Using S3 as a disk • Basic commit protocols • Transactional properties • Experiments and results • Conclusion and future works

Background • Utility services need to provide users the basic ingredients such as storage with high scalability and full availability • Amazon’s Simple Storage Service(S3) is most successful for storing large objects which are rarely updated such as multi-media objects • Eventual consistency : S3 only guarantee that updates will eventually become visible to all clients and the updates persist

Goal • Explore how Web-based database applications(at any scale) can be implemented on top of S3 • Preserving scalability and availability • Maximizing the level of consistency

Contributions • Show how small objects which are frequently updated by concurrent clients can be implemented on S3 • Show how a B-tree can be implemented on S3 • Present protocols that show how different levels of consistency can be implemented on S3 • Study the cost(response time and $) of running a Web-based application at different levels of consistency on S3

Simple Storage System • S3 is an infinite store for objects of variable size(min 1 Byte, max 5 GB) • S3 allows clients to read and update S3 objects remotely using a SOAP or REST-based interface: • get(uri): returns an object identified with uri • put(uri,bytestream): writes a new version of the object • get-if-modified-since(uri,timestamp): retrieve new version of an object only if the object has changed since the specified timestamp

Simple Storage System • S3 is not free. Users need to pay by use: • $0.15 to store 1 GB data for 1 month • $0.01 per 10,000 get request • $0.01 per 1,000 put request • $0.10~0.18 per GB of network bandwidth consumed • Services like SmugMug use their own servers to cache the data in order to save the money as well as reducing the latency

Simple Storage System Small objects are clustered into pages and page is the unit of transfer

Simple Queue Service • SQS allows users to manage an infinite number of queues with infinite capacity • Each queue is referenced by a URI and supports sending and receiving messages via a HTTP or REST-based interface: • createQueue(uri) : creates a new queue identified with uri • send(uri,msg) : sends a message to the queue identified with uri • receive(uri,number-of-msg,timeout) : receives number-of-msg messages from the queue. The returned messages are locked for the specified timeout period • delete(uri,msg-id) : deletes a message from the queue • addGrant(uri,user) : allows another user to access the queue

Simple Queue Service • SQS is also not free, users need to pay by use: • $0.01 for sending 1,000 messages • $0.10 per GB of data transferred • SQS only makes a best-effort when returning messages in a FIFO manner: there is no guarantee that SQS returns the messages in the right order

Simple Queue Service Round trip time: the total time between initiating the request from the application and the delivery of the result or ack

Client-Server Architecture • Page Manager : coordinates read and write requests to S3 and buffers pages from S3 in local main memory or disk • Record Manager : provides a record-oriented interface, organizes records on pages and carries out free-space management for the creation of new records

Record Manager • A record is composed of a key and payload data • A collection is implemented as a bucket in S3 and all the pages that store records of that collection are stored as S3 objects in that bucket • A collection is identified by a URI

Record Manager • The record manager provides functions to operate on records: • Create(key, payload, uri) : creates a new record into the collection identified by uri • Read(key, uri) : reads the payload information of a record given the key of the record and the uri of the collection • Update(key, payload, uri) : updates the payload information of a record given the key of the record and the uri of the collection • Delete(key, uri) : delete a record given the key of the record and the uri of the collection • Scan(uri) : scans through all records of a collection identified by uri

Page Manager • It implements a buffer pool for S3 pages • It supports reading pages from S3, pinning the pages in the buffer pool, updating the pages in the buffer pool and marking the pages as updated • It provides a way to create new pages in S3 • It implements commit and abort methods

Page Manager • If an application commits a transaction, all the updates are propagated to S3 and all the affected pages are marked as unmodified in the client’s buffer pool • If an application aborts a transaction, all pages marked modified or new are simply discarded from the client’s buffer pool

B-tree Indexes • A B-tree is identified by the URI of its root page • Each node contains a pointer to its right sibling at the same level • B-trees can be implemented on top of the page manager • The root and intermediate nodes of the B-tree are stored as pages on S3 and contain (key, uri) pairs: uri refers to the appropriate page at the next lower level • The leaf pages of a primary index contain entries of (key, payload) • The leaf pages of a secondary index contain entries of (search key, record key)

Logging • It is assumed that the log records are idempotent : applying a log records twice or more often has the same effect as applying the log record only once • There are 3 types of log records: • (insert, key, payload) : describes the creation of a new record • (delete, key) : describes the deletion of a record • (update, key, payload) : describes the update of a record

Issue to be addressed • When concurrent clients commit updates to records stored on the same page, then the updates of one client may be overwritten by the other client, even if they updates different records • The reason is that the transfer unit is a page rather than an individual record

Overview of commit protocol • Step 1. Commit: the client generates log records for all the updates that are committed as part of the transaction and sends them to SQS • Step 2. Checkpoint: the log records are applied to the pages stored on S3

Overview of commit protocol • The commit step can be carried out in constant assuming the number of messages per commit is constant • The checkpoint step can be carried out asynchronously and outside the execution of a client application, thus avoid being blocked by end users

Overview of commit protocol • This protocol is resilient to failures: when a client crashes during commit , then the client resends all log records when it restarts • This protocol does not guarantee atomicity: if a client crashes during commit and it never comes back again or loses the uncommitted log records, then some of the remaining log records will never be applied • This protocol provides eventual consistency : eventually all updates become visible to everyone

PU Queues(Pending Update Queues) • Clients propagate their log records to PU Queues • Each B-tree has one PU queue associated to it (insert and delete ) • One PU queue is associated to each leaf node of a primary B-tree of a collection (update )

Checkpoint Protocol for Data Pages • The input of a checkpoint is a PU queue • To make sure that nobody else is concurrently carrying out a checkpoint on the same PU queue, a LOCK queue is associated to each PU queue

Checkpoint Protocol for Data Pages • receive(URIofLOCKQueue, 1, timeout) :If the token message is returned, continue with step 2;Otherwise, terminateSet up a proper timeout period • If the leaf page is cached at the client, refresh the cached copy;Otherwise, get a copy from S3

Checkpoint Protocol for Data Pages 3. receive(URIofPUQueue, 256, 0) : Receive as many log records from the PU queue as possible 4. Apply the log records to the local copy of the page

Checkpoint Protocol for Data Pages 5. If step 2~4 are done within the timeout period, put the new version of the page to S3 Otherwise, terminate 6. If step 5 is successful within timeout, delete all log records received in step 3 from PU queue

Checkpoint Protocol for B-trees • Obtain the token from the LOCK queue • Receive log records from the PU queue • Sort the log records by key • Find the first log record and go to the leaf node associated with this record, refresh that leaf node from S3 • Apply all log records that are relevant to that leaf node • If timeout is not expired, put the new version of node to S3 • If timeout is not expired, delete the log records in step 5 • If not all log records have been processed yet, go to step 4Otherwise, terminate

Checkpoint Strategies • A checkpoint on a Page X can be carried out by: • Reader :A reader of X • Writer : A client who just committed updates to X • Watchdog : A process which periodically checks PU queues of X • Owner : A special client which periodically checks PU queues of X • This paper uses writers in general and readers in exceptional cases to carry out checkpoint

Checkpoint Strategies A writer initiates a checkpoint under the following conditions: 1. Each data page records the timestamp of the last checkpoint in its header. 2. When a client commits a log record, it computes the difference between its current time and the timestamp for the last checkpoint. 3. If the absolute value is bigger than a threshold (checkpoint interval ), then the client carries out the checkpoint

Checkpoint Strategies • Flaw in writer-only strategy: it is possible that a page which is updated once and then never again is checkpointed. As a result, the update never becomes visible • Solution: readers can also initiate checkpoints if they see a page whose last checkpoint was a long time ago With a probability proportional to 1/x in which x is the time period since the last checkpoint

Atomicity • Problem: the basic commit protocol mentioned above cannot achieve full atomicity • Solution: rather than committing log records to PU queue, the client commits to its ATOMIC queue on SQS first. Every log record is annotated with an ID which uniquely identifies the transactionIf a client fails and comes back: • It checks its ATOMIC queue, log records which carry the same ID will be propagated to PU queue and be deleted from ATOMIC queue after all the records have been propagated • Other log records without the same ID will be deleted immediately from the ATOMIC queue

Consistency Levels • Monotonic Reads :If a client reads the value of a data item x, any successive read operation on x by that client will always return the same value or a more recent value • Monotonic Writes : A write operation by a client on data item x is completed before any successive write operation on x by the same client • Read your writes : The effect of a write operation by a client on data item x will always be seen by a successive read operation on x by the same client • Write follows read : A write operation by a client on data item x following a previous read operation on x by the same client, is guaranteed to take place on the same or a more recent value of x that was read

Isolation: The Limits • The idea of Isolation is to serialize transaction in the order of the time they started • Snapshot Isolation or BOCC can be implemented on S3, but they need to use a global counter which may become the bottleneck of the system

Experiment • Naïve way as a baseline: • Write all dirty pages directly to S3 • 3 levels of consistency: • Basic: the protocol only supports commit and checkpoint step, it supports eventual consistency • Monotonicity: the protocol supports full client-side consistency • Atomicity: Above Monotonicity and Basic, it holds the highest level of consistency • The only difference in these 4 variants is the implementation in commit and checkpoint

TPC-W Benchmark • Models an online bookstore with queries asking for the availability of products and an update workload that involves the placement of orders • Retrieve the customer record from the database • Search for 6 specific products • Place orders for 3 of the 6 products

Running Time[secs] • Each transaction simulates about 12 clicks(1 sec each) of a user • The higher the level of consistency, the lower the overall running times • Naïve writes all affected pages while other approaches only propagate log record • Latency can be reduced by sending several messages in parallel to S3 and SQS

Cost • The cost was computed by running a large number of transactions and taking the cost measurements of AWS • Cost increases as the level of consistency increases • Cost can be reduced by setting the checkpoint interval to a larger value

Vary Checkpoint Interval • Checkpoint interval below 10 seconds can effectively initiate a checkpoint for every update that was committed • Increase the checkpoint interval above 10 seconds can decrease the cost

Conclusion and future works • This paper focuses on high scalability and availability, but there might be some scenarios where ACID properties are more important • Some new algorithm needs to be devised. For instance, this system is not able to carry out chained I/O in order to scan through several pages on S3 • The right security infrastructure will be crucial for a S3-based information system

Thank you Q&A

Building a Database on S3

Building a Database on S3

Presentation Transcript

Building database searches

Building a Database Application

Building the Agile Database

CONTENT: A model for collaborative database building

Building The Database

Building a Global Ocean Profile Database

Building a Database

Building the Database

Building on a Base:

S3.

Fig. S3.a

Building on a Rock

On building a high performance gazetteer database Amittai Axelrod MetaCarta Inc

Lesson 29: Building a Database

BUILDING A DATABASE SYSTEM FOR ORDER

Building on a Landfill

Figure S3 A-D

Building a Spatial Database in PostgreSQL

B2B Data Building | Database Building Services

Building a Database Application

On building a high performance gazetteer database Amittai Axelrod MetaCarta Inc

Building a database on PSC a Dutch perspective