C store updates
1 / 24

C-Store: Updates - PowerPoint PPT Presentation

  • Uploaded on

C-Store: Updates. Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009. Architecture of C-Store (Vertica) On a Single Node (or Site). Write Store (WS). WS is also a Column Store, and implements the identical physical DBMS design as RS.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' C-Store: Updates' - fiona-preston

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
C store updates

C-Store: Updates

Jianlin Feng

School of Software


May. 15, 2009

Architecture of c store vertica on a single node or site
Architecture of C-Store (Vertica)On a Single Node (or Site)

Write store ws
Write Store (WS)

  • WS is also a Column Store, and implements the identical physical DBMS design as RS.

  • WS is horizontally partitioned in the same way as RS.

    • There is a 1:1 mapping between RS segments and WS segments.

    • Note such mapping only exists for “HOT” RS segments.

  • A tuple is identified by a (sid, storage_key) pair in either RS or WS

    • sid: segment ID

    • Storage_key: the Storage Key of the tuple

Join indexes
Join Indexes

  • Every projection is represented as a collection of pairs of segments.

    • one in WS and one in RS.

  • For each tuple in the “sender”, we must store the sid and storage key of a corresponding tuple in the “receiver”.

Storage key in ws
Storage Key in WS

  • The Storage Key, SK, for each tuple is explicitly stored in each WS segment.

    • Columns in WS only keep a logical SORT KEY order via SKs.

  • A unique SK is given to each insert of a logical tuplerin a table T.

    • The SK of rmust be recorded in each projection that stores data for r.

    • This SK is an interger.

Storage representation of columns in ws
Storage Representation of Columnsin WS

  • Every column in a Projection

    • Represented as a collection of (v, sk) pairs

      • v :a data value in the column

      • sk :the storage key (explicitly stored)

  • Build a B-Tree over the (v, sk) pairs

    • Use the second field of each pair, sk, as the KEY

Sort keys of each projection in ws
Sort Keys of Each Projection in WS

  • Represented as a collection of (s, sk) pairs

    • s:a sort key value

    • sk :the storage key describing where s first appears.

  • Build a B-Tree over the (s, sk) pairs

    • Use the first field of each pair, s, as the KEY

Storage management
Storage Management

  • This issue is the allocation of segments to nodes in a grid (or cloud computing) system.

    • C-Store uses a storage allocator.

  • Some guidelines

    • All columns in a single segment of a projection should be co-located, i.e., put at the same node.

    • Join indexes should be co-located with their “sender” segments.

    • Each WS segment should be co-located with the RS segments that contain the same (sort) key range.


  • An update is either an insert or a delete

    • Insert a (new) tuple

    • Delete an (existing) tuple

    • Modify an existing tuple

      • Delete the existing version of the tuple.

      • Insert the new version of the tuple.

Allocating a storage key in a grid
Allocating a Storage Key in a Grid

  • Background

    • All inserts corresponding to a single logical tuple have the same storage key.

  • Where to allocate a SK

    • The node at which the insert is received.

  • Globally Unique Storage Key

    • Each node maintains a locally unique counter.

    • The initial value of the counter = 1 + the largest key in RS.

    • Global SK = Local SK + Node ID.

Realizing inserts in ws
Realizing Inserts in WS

  • WS is built on top of BerkeleyDB

    • Using B-Tree in the package to support inserts.

  • Every insert to a projection results in a collection of physical inserts on different disk pages.

    • One insert per column per projection.

    • Accessing disk pages is expensive.

  • The solution is using a very large memory buffer to hold “HOT” WS part.

Transaction framework in c store
Transaction Framework in C-Store

  • Large number of read-only transactions,

  • interspersed with a small number of update transactions covering few tuples.

  • To avoid substantial lock contention, use snapshot isolation to isolate read-only transactions.

  • Update transactions continue to set read and write locks and obey strict two-phase locking.

Snapshot isolation
Snapshot Isolation

  • Basic idea

    • Allowing read-only transactions to read the snapshots of the database as of some time tin the recent past,

    • provided before which we can guarantee that there are no uncommitted transactions.

    • t: called the effective time.

  • The Key Problem

    • Determining which of the tuples in WS and RS should be visible to a read-only transaction running at effective time ET.

    • A tuple is visible if it was inserted before ET and deleted after ET.

Water marks of effective time
Water Marks of Effective Time

  • High Water Mark (HWM)

    • The most recent effective time in the past at which snapshot isolation can run.

  • Low Water Mark (LWM)

    • The earliest effective time at which snapshot isolation can run.

  • LWM <= Any Effective Time <= HWM

Insertion vector iv
Insertion Vector (IV)

  • Maintain an insertion vector for each segment in WS

    • For each tuple in the segment , the insertion vector contains the epoch in which the tuple was inserted.

  • Use Tuple Mover to assure that no tuples in RS were inserted after the LWM.

    • RS does not have insertion vectors.

Deleted record vector drv
Deleted Record Vector (DRV)

  • Maintain also a deleted record vector for each segment in WS

    • For each tuple, the DRV has one entry, containing

    • 0, if the tuple has not been deleted;

    • otherwise, the epoch in which the tuple was deleted.

  • DRV is very sparse (mostly 0s)

    • Can be compressed BY Run-Length Encoding.

  • The runtime system can consult IV and DRV to make the visibility calculation for each query on a tuple-by-tuple basis.

Maintaining the high water mark some defintions
Maintaining the High Water Mark :Some Defintions

  • the timestamp authority (TA)

    • one node designated with the responsibility of allocating timestamps to other nodes.

  • Time is divided into a number of epochs, each epoch is relatively long (e.g., many seconds each).

  • Epoch number:

    • The number of epochs that have elapsed since the beginning of time.

Hwm selection algorithm
HWM Selection Algorithm

  • Define the initial HWM to be epoch 0; and start current epoch at 1.

  • Periodically, the TA decides to move the system to the next epoch:

    • The TA sends a end of epoch message to each node;

    • Each node increments current epoch from e to e+1,

    • thus causing new transactions that arrive to be run with a timestamp e+1.

  • Each node waits for all the transactions that began in epoch e (or an earlier epoch) to complete; and then sends an epoch complete message to the TA.

  • Once the TA has received epoch complete messages from all nodes for epoch e, it sets the HWM to be e, and sends this value to each node.

  • Lwm chases hwm
    LWM “chases” HWM

    • Periodically, the timestamp authority (TA) sends out to each node a new LWM epoch number.

      • By fixing a delta between LWM and HWM.

    • The delta is chosen to mediate between the needs of users who want historical access and the WS space constraint.

    Tuple mover
    Tuple Mover

    • The job of the tuple mover

      • to move blocks of tuples in a WS segment to the corresponding RS segment,

      • Updating any join indexes in the process.

    • It operates as a background task looking for worthy segment pairs.

      • When it finds one, it performs a merge-out process, MOP on this (RS, WS) segment pair.

    The merge out process mop
    The Merge-Out Process (MOP)

    • In the chosen WS segment, MOP will find all tuples with an insertion time at or before the LWM.

      • When the LWM moves on, tuples become “old” enough.

    • then divides the “old” enough tuples into two groups:

      • Ones deleted at or before LWM.

        • These are discarded, because the user cannot run queries as of a time when they existed.

      • Ones that were not deleted, or deleted after LWM.

        • These are moved to RS.

    Detailed steps of mop
    Detailed Steps of MOP

    • First

      • MOP will create a new RS segment that we name RS'.

    • Then

      • it reads in blocks from columns of the RS segment,

      • deletes any RS tuples with a value in the DRV less than or equal to the LWM,

      • and merges in column values from WS.

      • The merged data is then written out to the new segment RS'.

      • Tuples receive new storage keys in RS', thereby requiring join indexes maintenance.

    • Once RS' contains all the WS data and join indexes are modified on RS' , the system cuts over from RS to RS'.


    • Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran and Stan Zdonik. C-Store: A Column Oriented DBMS VLDB, pages 553-564, 2005.

    • VERTICA DATABASE TECHNICAL OVERVIEW WHITE PAPER. http://www.vertica.com/php/pdfgateway?file=VerticaArchitectureWhitePaper.pdf