C-Store: Updates. Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009. Architecture of C-Store (Vertica) On a Single Node (or Site). Write Store (WS). WS is also a Column Store, and implements the identical physical DBMS design as RS.
C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009
Write Store (WS) • WS is also a Column Store, and implements the identical physical DBMS design as RS. • WS is horizontally partitioned in the same way as RS. • There is a 1:1 mapping between RS segments and WS segments. • Note such mapping only exists for “HOT” RS segments. • A tuple is identified by a (sid, storage_key) pair in either RS or WS • sid: segment ID • Storage_key: the Storage Key of the tuple
Join Indexes • Every projection is represented as a collection of pairs of segments. • one in WS and one in RS. • For each tuple in the “sender”, we must store the sid and storage key of a corresponding tuple in the “receiver”.
Storage Key in WS • The Storage Key, SK, for each tuple is explicitly stored in each WS segment. • Columns in WS only keep a logical SORT KEY order via SKs. • A unique SK is given to each insert of a logical tuplerin a table T. • The SK of rmust be recorded in each projection that stores data for r. • This SK is an interger.
Storage Representation of Columnsin WS • Every column in a Projection • Represented as a collection of (v, sk) pairs • v :a data value in the column • sk :the storage key (explicitly stored) • Build a B-Tree over the (v, sk) pairs • Use the second field of each pair, sk, as the KEY
Sort Keys of Each Projection in WS • Represented as a collection of (s, sk) pairs • s:a sort key value • sk :the storage key describing where s first appears. • Build a B-Tree over the (s, sk) pairs • Use the first field of each pair, s, as the KEY
Storage Management • This issue is the allocation of segments to nodes in a grid (or cloud computing) system. • C-Store uses a storage allocator. • Some guidelines • All columns in a single segment of a projection should be co-located, i.e., put at the same node. • Join indexes should be co-located with their “sender” segments. • Each WS segment should be co-located with the RS segments that contain the same (sort) key range.
Updates • An update is either an insert or a delete • Insert a (new) tuple • Delete an (existing) tuple • Modify an existing tuple • Delete the existing version of the tuple. • Insert the new version of the tuple.
Allocating a Storage Key in a Grid • Background • All inserts corresponding to a single logical tuple have the same storage key. • Where to allocate a SK • The node at which the insert is received. • Globally Unique Storage Key • Each node maintains a locally unique counter. • The initial value of the counter = 1 + the largest key in RS. • Global SK = Local SK + Node ID.
Realizing Inserts in WS • WS is built on top of BerkeleyDB • Using B-Tree in the package to support inserts. • Every insert to a projection results in a collection of physical inserts on different disk pages. • One insert per column per projection. • Accessing disk pages is expensive. • The solution is using a very large memory buffer to hold “HOT” WS part.
Transaction Framework in C-Store • Large number of read-only transactions, • interspersed with a small number of update transactions covering few tuples. • To avoid substantial lock contention, use snapshot isolation to isolate read-only transactions. • Update transactions continue to set read and write locks and obey strict two-phase locking.
Snapshot Isolation • Basic idea • Allowing read-only transactions to read the snapshots of the database as of some time tin the recent past, • provided before which we can guarantee that there are no uncommitted transactions. • t: called the effective time. • The Key Problem • Determining which of the tuples in WS and RS should be visible to a read-only transaction running at effective time ET. • A tuple is visible if it was inserted before ET and deleted after ET.
Water Marks of Effective Time • High Water Mark (HWM) • The most recent effective time in the past at which snapshot isolation can run. • Low Water Mark (LWM) • The earliest effective time at which snapshot isolation can run. • LWM <= Any Effective Time <= HWM
Insertion Vector (IV) • Maintain an insertion vector for each segment in WS • For each tuple in the segment , the insertion vector contains the epoch in which the tuple was inserted. • Use Tuple Mover to assure that no tuples in RS were inserted after the LWM. • RS does not have insertion vectors.
Deleted Record Vector (DRV) • Maintain also a deleted record vector for each segment in WS • For each tuple, the DRV has one entry, containing • 0, if the tuple has not been deleted; • otherwise, the epoch in which the tuple was deleted. • DRV is very sparse (mostly 0s) • Can be compressed BY Run-Length Encoding. • The runtime system can consult IV and DRV to make the visibility calculation for each query on a tuple-by-tuple basis.
Maintaining the High Water Mark :Some Defintions • the timestamp authority (TA) • one node designated with the responsibility of allocating timestamps to other nodes. • Time is divided into a number of epochs, each epoch is relatively long (e.g., many seconds each). • Epoch number: • The number of epochs that have elapsed since the beginning of time.
HWM Selection Algorithm • Define the initial HWM to be epoch 0; and start current epoch at 1. • Periodically, the TA decides to move the system to the next epoch: • The TA sends a end of epoch message to each node; • Each node increments current epoch from e to e+1, • thus causing new transactions that arrive to be run with a timestamp e+1. • Each node waits for all the transactions that began in epoch e (or an earlier epoch) to complete; and then sends an epoch complete message to the TA. • Once the TA has received epoch complete messages from all nodes for epoch e, it sets the HWM to be e, and sends this value to each node.
LWM “chases” HWM • Periodically, the timestamp authority (TA) sends out to each node a new LWM epoch number. • By fixing a delta between LWM and HWM. • The delta is chosen to mediate between the needs of users who want historical access and the WS space constraint.
Tuple Mover • The job of the tuple mover • to move blocks of tuples in a WS segment to the corresponding RS segment, • Updating any join indexes in the process. • It operates as a background task looking for worthy segment pairs. • When it finds one, it performs a merge-out process, MOP on this (RS, WS) segment pair.
The Merge-Out Process (MOP) • In the chosen WS segment, MOP will find all tuples with an insertion time at or before the LWM. • When the LWM moves on, tuples become “old” enough. • then divides the “old” enough tuples into two groups: • Ones deleted at or before LWM. • These are discarded, because the user cannot run queries as of a time when they existed. • Ones that were not deleted, or deleted after LWM. • These are moved to RS.
Detailed Steps of MOP • First • MOP will create a new RS segment that we name RS'. • Then • it reads in blocks from columns of the RS segment, • deletes any RS tuples with a value in the DRV less than or equal to the LWM, • and merges in column values from WS. • The merged data is then written out to the new segment RS'. • Tuples receive new storage keys in RS', thereby requiring join indexes maintenance. • Once RS' contains all the WS data and join indexes are modified on RS' , the system cuts over from RS to RS'.
References • Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran and Stan Zdonik. C-Store: A Column Oriented DBMS VLDB, pages 553-564, 2005. • VERTICA DATABASE TECHNICAL OVERVIEW WHITE PAPER. http://www.vertica.com/php/pdfgateway?file=VerticaArchitectureWhitePaper.pdf