1 / 17

bigdata™

bigdata™. Object Journal. bigdata™ object journal. Maximize write absorption rate Write through persistent cache. Incremental writes absorbed in direct buffer, but commit is synchronous. Append only journal on disk (ring buffer). Concurrent tx with full isolation

yosef
Download Presentation

bigdata™

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. bigdata™ Object Journal

  2. bigdata™ object journal • Maximize write absorption rate • Write through persistent cache. • Incremental writes absorbed in direct buffer, but commit is synchronous. • Append only journal on disk (ring buffer). • Concurrent tx with full isolation • Per tx object index map and slot usage map. • Async. migration of last consistent state to RW DB. • Supports state-based validation and object / index merge.

  3. Client-server protocol (client) direct buffer PO & MD write(tx) buffer full? y evict serialize & buffer prepare(tx) n commit(tx) commit write dirty objects abort(tx) abort begin(tx) begin

  4. Client-server protocol (server) begin(tx) • Begin(tx) – begin transaction on this segment. • Write(tx) – write buffered objects on this segment. Writes are serialized and generally represent incremental writes within native transactions by a client. • Prepare(tx) – prepare this segment for a commit by this transaction (part of the distributed transaction commit process). • Commit(tx) – commit the transaction. Must succeed if Prepare succeeded. • Abort(tx) – abort the transaction. May be received any time after a begin(tx) and before a commit(tx). write(tx) prepare(tx) commit(tx) abort(tx)

  5. Absorb writes error n begun direct buffer pending write list y write(tx) aborted absorb add to write list n y error ok write Writes on the journal are buffered before writing through to disk since they do not need to be durable until a commit. Migration to DB, writes on the journal, and commit processing (prepare, commit, abort) occur in a separate thread. prepare migrate

  6. Prepare transaction for commit. Prepare(tx) places the journal into a state in which only a commit record need be written for the tx to commit. Other writes may proceed concurrently if necessary to avoid buffer exhaustion, but prepare latency should be minimized in order to minimize overall commit latency. prepare(tx) pending write(tx) journal y write buffer n State-base validation and flushing the object index and usage maps both can result in full or partial buffers that need to be written on the journal. validate ok flush maps

  7. Commit processing Write a commit record. This probably needs to include the then current usage map, but we have already performed validation and flushed the object index map associated with the tx. After a commit, migration to the RW DB needs to restart with the now most recent committed state in the journal. This is obtained by processing the root of the object index map in the journal for the last committed state. Object migration from the last consistent state must be suspended at some point and then resume with the new consistent state. After a commit the object index maps in new transactions will track from the last committed state of the object index maps while concurrent transactions must merge to the newly committed object index map during validation. Note that there is still an opportunity for the journal host or client-journal communications to fail after the prepare(tx) has been successfully processed or during the commit(tx). A 3-phase commit is used to work around this. commit(tx) write commit record journal

  8. Abort transaction • An abort is legal iff begin(tx) and ! ( commit(tx) | abort(tx) ) • On restart, assumption is that there are no active transactions, so anything except begin(tx) is illegal. abort(tx)

  9. Restart • Restart consists of: • Reading the alternating journal commit records (vs transaction commit records, which are per-tx); • Determining which slot is the root of the last committed object index map; • Reading the last committed usage map.

  10. Migrate to database • Migration reads rows from the last committed object index map, updates their corresponding rows on the read-write database, and then logically deletes the object on the journal (both its slot and its entry in the object index map). • Migration can be incremental, i.e., processing N objects at a time. • Migration can be concurrent with most other journal activities, especially when the journal is wired into memory (so the disk write head continues to be append only). • Migration should (and perhaps must) be suspended during the final stages of prepare(tx) processing so that the object index map is stable on disk until after the follow-on commit(tx) or abort(tx). • Objects to be migrated are part of a committed transaction and hence are already validated, merged and consistent. • If read activity predominates for the segment, then the journal should swiftly flush through to the database and its extent may be reduced to free memory for read caching of the database. • Migration may require reading objects from the journal on disk if the journal is not wired into a direct buffer, which would negatively effect disk head positioning.

  11. Logically deleting objects • Objects may be logically deleted once: • They are no longer the last committed state of the object; and • There is no current transaction (in the distributed database) that could read from that historical state of the object. • The history retention mechanism is distinct from the object deletion mechanism • History is either in the RW database or logged onto a history journal from the RW database. • These are NOT the same as the pre-conditions for object migration to the RW database.

  12. State-based validation • Validation occurs during prepare(tx) • Supports merging object states across concurrent or distributed transactions. • Example is bank credit/debt history • Supports merging objects based on identity across concurrent or distributed transactions • Example: Two objects created for the same URI within an RDF graph. • Supporting object metadata may be required: • Example: drop/add records for link set indices. • Backward state-based validation technique • Constant cost in the #of concurrent transactions (forward validation in linear in the #of concurrent transactions).

  13. Object index map • B+-tree data structure mapping object identifiers to slots in the journal. • Object identifier is int32 (logical page and slot in the RW database). • Map is copy on write. • Transaction begins with root node from the last committed transaction. • Updates force copy, which percolates up the tree to the root.

  14. Slot usage map(slot allocation structure) • Reports ( free | used ) for each slot. • Reports whether used by the calling tx. • Supports VLR tx by reuse of slots allocated and latter logically deleted within the same tx. • Bitmap index is a possible design • Might not support slot reuse in VLR tx. • 25k bitmap for journal with 200k slots. • Incremental writes may not be possible. • One alternative is btree • slot : < tx, (free|used|reusable-by-tx) > • Copy on write semantics similar to object index map.

  15. Provisioning the journal • A journal is provisioned for a slot size and (initial) extent. • The slot size should be set based on the expected average or median object size for the database segment. • The slot size is fixed unless the journal is completely flushed to the DB and re-provisioned. • The extent may grow due to logical overflow and can be shrunk by compacting slots from the tail of the journal into free slots elsewhere in the journal. • The initial extent size depends on the expected burst write rates on the segment. • Ideally, the journal should be on its own disk for pure sequential access. • The journal map be: • Wired into a direct memory buffer for fastest processing; • Memory mapped; or • Used on disk, probably relying on the file system to handle caching. • Only a commit(tx) operation needs to flush to disk • Experiment with write through option for the file system cache.

  16. Overflow • Normal processing will cause the journal to overflow its extent, at which point we begin to reuse slots from the head of the file. • Migration to the DB is used to release slots logically on the journal for reuse • But we never migrate a slot until its tx has committed. • Very long transactions can cause physical overflow of the journal • (Temporarily?) extend the journal; or • Begin updating rows directly in the DB, which requires exclusive row locking and BFIM logging. • Extremely long transactions should not force main memory overflow of the journal since we limit the #of keys or logical pages of the journal and we can reuse slots for objects that are overwritten multiple times within the same tx (just not slots for other tx’s that have not committed). • Very large objects should not be written on the journal. They should either use exclusive row locking and BFIM logging or simply be atomic extra-transactional writes.

  17. History • This design does not preserve historical states of objects • History preserving options include • Extend GPO to support this within its data record; • Extend RW DB to support this within its slot map and overflowing the logical page to a continuation page as necessary (we still bound segment size, so the database size will have limits determined only in part by the history retention policy); • Write historical states into a journal whose overflow policy is logical deletion of historical states based on the history policy for the objects (vs migration of objects onto the RW database).

More Related