1 / 48

Building a Database on S3

Building a Database on S3. Matthias Brantner , Daniela Florescu , David Graf, Donald Kossmann , Tim Kraska Xiang Zhang 2014-04-10. Outline. Introduction What is AWS Using S3 as a disk Basic commit protocols Transactional properties Experiments and results

ull
Download Presentation

Building a Database on S3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a Database on S3 Matthias Brantner, Daniela Florescu, David Graf, Donald Kossmann, Tim Kraska Xiang Zhang 2014-04-10

  2. Outline • Introduction • What is AWS • Using S3 as a disk • Basic commit protocols • Transactional properties • Experiments and results • Conclusion and future works

  3. Introduction • What is AWS • Using S3 as a disk • Basic commit protocols • Transactional properties • Experiments and results • Conclusion and future works

  4. Background • Utility services need to provide users the basic ingredients such as storage with high scalability and full availability • Amazon’s Simple Storage Service(S3) is most successful for storing large objects which are rarely updated such as multi-media objects • Eventual consistency : S3 only guarantee that updates will eventually become visible to all clients and the updates persist

  5. Goal • Explore how Web-based database applications(at any scale) can be implemented on top of S3 • Preserving scalability and availability • Maximizing the level of consistency

  6. Contributions • Show how small objects which are frequently updated by concurrent clients can be implemented on S3 • Show how a B-tree can be implemented on S3 • Present protocols that show how different levels of consistency can be implemented on S3 • Study the cost(response time and $) of running a Web-based application at different levels of consistency on S3

  7. Introduction • What is AWS • Using S3 as a disk • Basic commit protocols • Transactional properties • Experiments and results • Conclusion and future works

  8. Simple Storage System • S3 is an infinite store for objects of variable size(min 1 Byte, max 5 GB) • S3 allows clients to read and update S3 objects remotely using a SOAP or REST-based interface: • get(uri): returns an object identified with uri • put(uri,bytestream): writes a new version of the object • get-if-modified-since(uri,timestamp): retrieve new version of an object only if the object has changed since the specified timestamp

  9. Simple Storage System • S3 is not free. Users need to pay by use: • $0.15 to store 1 GB data for 1 month • $0.01 per 10,000 get request • $0.01 per 1,000 put request • $0.10~0.18 per GB of network bandwidth consumed • Services like SmugMug use their own servers to cache the data in order to save the money as well as reducing the latency

  10. Simple Storage System Small objects are clustered into pages and page is the unit of transfer

  11. Simple Queue Service • SQS allows users to manage an infinite number of queues with infinite capacity • Each queue is referenced by a URI and supports sending and receiving messages via a HTTP or REST-based interface: • createQueue(uri) : creates a new queue identified with uri • send(uri,msg) : sends a message to the queue identified with uri • receive(uri,number-of-msg,timeout) : receives number-of-msg messages from the queue. The returned messages are locked for the specified timeout period • delete(uri,msg-id) : deletes a message from the queue • addGrant(uri,user) : allows another user to access the queue

  12. Simple Queue Service • SQS is also not free, users need to pay by use: • $0.01 for sending 1,000 messages • $0.10 per GB of data transferred • SQS only makes a best-effort when returning messages in a FIFO manner: there is no guarantee that SQS returns the messages in the right order

  13. Simple Queue Service Round trip time: the total time between initiating the request from the application and the delivery of the result or ack

  14. Introduction • What is AWS • Using S3 as a disk • Basic commit protocols • Transactional properties • Experiments and results • Conclusion and future works

  15. Client-Server Architecture • Page Manager : coordinates read and write requests to S3 and buffers pages from S3 in local main memory or disk • Record Manager : provides a record-oriented interface, organizes records on pages and carries out free-space management for the creation of new records

  16. Record Manager • A record is composed of a key and payload data • A collection is implemented as a bucket in S3 and all the pages that store records of that collection are stored as S3 objects in that bucket • A collection is identified by a URI

  17. Record Manager • The record manager provides functions to operate on records: • Create(key, payload, uri) : creates a new record into the collection identified by uri • Read(key, uri) : reads the payload information of a record given the key of the record and the uri of the collection • Update(key, payload, uri) : updates the payload information of a record given the key of the record and the uri of the collection • Delete(key, uri) : delete a record given the key of the record and the uri of the collection • Scan(uri) : scans through all records of a collection identified by uri

  18. Page Manager • It implements a buffer pool for S3 pages • It supports reading pages from S3, pinning the pages in the buffer pool, updating the pages in the buffer pool and marking the pages as updated • It provides a way to create new pages in S3 • It implements commit and abort methods

  19. Page Manager • If an application commits a transaction, all the updates are propagated to S3 and all the affected pages are marked as unmodified in the client’s buffer pool • If an application aborts a transaction, all pages marked modified or new are simply discarded from the client’s buffer pool

  20. B-tree Indexes • A B-tree is identified by the URI of its root page • Each node contains a pointer to its right sibling at the same level • B-trees can be implemented on top of the page manager • The root and intermediate nodes of the B-tree are stored as pages on S3 and contain (key, uri) pairs: uri refers to the appropriate page at the next lower level • The leaf pages of a primary index contain entries of (key, payload) • The leaf pages of a secondary index contain entries of (search key, record key)

  21. Logging • It is assumed that the log records are idempotent : applying a log records twice or more often has the same effect as applying the log record only once • There are 3 types of log records: • (insert, key, payload) : describes the creation of a new record • (delete, key) : describes the deletion of a record • (update, key, payload) : describes the update of a record

  22. Introduction • What is AWS • Using S3 as a disk • Basic commit protocols • Transactional properties • Experiments and results • Conclusion and future works

  23. Issue to be addressed • When concurrent clients commit updates to records stored on the same page, then the updates of one client may be overwritten by the other client, even if they updates different records • The reason is that the transfer unit is a page rather than an individual record

  24. Overview of commit protocol • Step 1. Commit: the client generates log records for all the updates that are committed as part of the transaction and sends them to SQS • Step 2. Checkpoint: the log records are applied to the pages stored on S3

  25. Overview of commit protocol • The commit step can be carried out in constant assuming the number of messages per commit is constant • The checkpoint step can be carried out asynchronously and outside the execution of a client application, thus avoid being blocked by end users

  26. Overview of commit protocol • This protocol is resilient to failures: when a client crashes during commit , then the client resends all log records when it restarts • This protocol does not guarantee atomicity: if a client crashes during commit and it never comes back again or loses the uncommitted log records, then some of the remaining log records will never be applied • This protocol provides eventual consistency : eventually all updates become visible to everyone

  27. PU Queues(Pending Update Queues) • Clients propagate their log records to PU Queues • Each B-tree has one PU queue associated to it (insert and delete ) • One PU queue is associated to each leaf node of a primary B-tree of a collection (update )

  28. Checkpoint Protocol for Data Pages • The input of a checkpoint is a PU queue • To make sure that nobody else is concurrently carrying out a checkpoint on the same PU queue, a LOCK queue is associated to each PU queue

  29. Checkpoint Protocol for Data Pages • receive(URIofLOCKQueue, 1, timeout) :If the token message is returned, continue with step 2;Otherwise, terminateSet up a proper timeout period • If the leaf page is cached at the client, refresh the cached copy;Otherwise, get a copy from S3

  30. Checkpoint Protocol for Data Pages 3. receive(URIofPUQueue, 256, 0) : Receive as many log records from the PU queue as possible 4. Apply the log records to the local copy of the page

  31. Checkpoint Protocol for Data Pages 5. If step 2~4 are done within the timeout period, put the new version of the page to S3 Otherwise, terminate 6. If step 5 is successful within timeout, delete all log records received in step 3 from PU queue

  32. Checkpoint Protocol for B-trees • Obtain the token from the LOCK queue • Receive log records from the PU queue • Sort the log records by key • Find the first log record and go to the leaf node associated with this record, refresh that leaf node from S3 • Apply all log records that are relevant to that leaf node • If timeout is not expired, put the new version of node to S3 • If timeout is not expired, delete the log records in step 5 • If not all log records have been processed yet, go to step 4Otherwise, terminate

  33. Checkpoint Strategies • A checkpoint on a Page X can be carried out by: • Reader :A reader of X • Writer : A client who just committed updates to X • Watchdog : A process which periodically checks PU queues of X • Owner : A special client which periodically checks PU queues of X • This paper uses writers in general and readers in exceptional cases to carry out checkpoint

  34. Checkpoint Strategies A writer initiates a checkpoint under the following conditions: 1. Each data page records the timestamp of the last checkpoint in its header. 2. When a client commits a log record, it computes the difference between its current time and the timestamp for the last checkpoint. 3. If the absolute value is bigger than a threshold (checkpoint interval ), then the client carries out the checkpoint

  35. Checkpoint Strategies • Flaw in writer-only strategy: it is possible that a page which is updated once and then never again is checkpointed. As a result, the update never becomes visible • Solution: readers can also initiate checkpoints if they see a page whose last checkpoint was a long time ago With a probability proportional to 1/x in which x is the time period since the last checkpoint

  36. Introduction • What is AWS • Using S3 as a disk • Basic commit protocols • Transactional properties • Experiments and results • Conclusion and future works

  37. Atomicity • Problem: the basic commit protocol mentioned above cannot achieve full atomicity • Solution: rather than committing log records to PU queue, the client commits to its ATOMIC queue on SQS first. Every log record is annotated with an ID which uniquely identifies the transactionIf a client fails and comes back: • It checks its ATOMIC queue, log records which carry the same ID will be propagated to PU queue and be deleted from ATOMIC queue after all the records have been propagated • Other log records without the same ID will be deleted immediately from the ATOMIC queue

  38. Consistency Levels • Monotonic Reads :If a client reads the value of a data item x, any successive read operation on x by that client will always return the same value or a more recent value • Monotonic Writes : A write operation by a client on data item x is completed before any successive write operation on x by the same client • Read your writes : The effect of a write operation by a client on data item x will always be seen by a successive read operation on x by the same client • Write follows read : A write operation by a client on data item x following a previous read operation on x by the same client, is guaranteed to take place on the same or a more recent value of x that was read

  39. Isolation: The Limits • The idea of Isolation is to serialize transaction in the order of the time they started • Snapshot Isolation or BOCC can be implemented on S3, but they need to use a global counter which may become the bottleneck of the system

  40. Introduction • What is AWS • Using S3 as a disk • Basic commit protocols • Transactional properties • Experiments and results • Conclusion and future works

  41. Experiment • Naïve way as a baseline: • Write all dirty pages directly to S3 • 3 levels of consistency: • Basic: the protocol only supports commit and checkpoint step, it supports eventual consistency • Monotonicity: the protocol supports full client-side consistency • Atomicity: Above Monotonicity and Basic, it holds the highest level of consistency • The only difference in these 4 variants is the implementation in commit and checkpoint

  42. TPC-W Benchmark • Models an online bookstore with queries asking for the availability of products and an update workload that involves the placement of orders • Retrieve the customer record from the database • Search for 6 specific products • Place orders for 3 of the 6 products

  43. Running Time[secs] • Each transaction simulates about 12 clicks(1 sec each) of a user • The higher the level of consistency, the lower the overall running times • Naïve writes all affected pages while other approaches only propagate log record • Latency can be reduced by sending several messages in parallel to S3 and SQS

  44. Cost • The cost was computed by running a large number of transactions and taking the cost measurements of AWS • Cost increases as the level of consistency increases • Cost can be reduced by setting the checkpoint interval to a larger value

  45. Vary Checkpoint Interval • Checkpoint interval below 10 seconds can effectively initiate a checkpoint for every update that was committed • Increase the checkpoint interval above 10 seconds can decrease the cost

  46. Introduction • What is AWS • Using S3 as a disk • Basic commit protocols • Transactional properties • Experiments and results • Conclusion and future works

  47. Conclusion and future works • This paper focuses on high scalability and availability, but there might be some scenarios where ACID properties are more important • Some new algorithm needs to be devised. For instance, this system is not able to carry out chained I/O in order to scan through several pages on S3 • The right security infrastructure will be crucial for a S3-based information system

  48. Thank you Q&A

More Related