1 / 28

Key/Value Stores

Key/Value Stores. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Agenda. HBase Accumulo. Apache HBase. Overview. Distributed, scalable, column-oriented key/value store Implementation of Google’s Big Table for Hadoop

jean
Download Presentation

Key/Value Stores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Key/Value Stores CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

  2. Agenda • HBase • Accumulo

  3. Apache HBase

  4. Overview • Distributed, scalable, column-oriented key/value store • Implementation of Google’s Big Table for Hadoop • Provides random, real-time read/write access to tables • Billions of rows millions by millions of columns on HDFS • Three core components • HBase Master • HBaseRegionServer • ZooKeeper

  5. How is data stored? • Table • Region • Store – One Store per ColumnFamily • MemStore • StoreFile • Block

  6. HBase Architecture ZooKeeper Client Master RegionServer RegionServer Region Region Store Store Store Store MemStore MemStore MemStore MemStore HFile HFile HFile HFile HFile HFile HFile HFile HDFS

  7. Data Model • Column families defined at table creation Value Key Row ID Column Family Timestamp Column Qualifer byte[]

  8. Locality Groups • Locality groups are a means to define different sets of columns that have different access patterns • Done via Column Families • Store metadata in one family, and images in another family • Set the proper column family based on what you need • Physically separated in HDFS to provide faster access times

  9. How is Data Stored? 'profile' Table View 158865339profile:created 1394663741975 1277328831000 158865339profile:followers 1394663741975 233076 158865339profile:name 1394663741975FastCoDesign 244296542profile:created 1394663741996 1296260757000 244296542profile:followers 1394663741996 891288 244296542profile:name 1394663741996CorazoonBipolar 255409050profile:created 1394663742000 1298279244000 255409050profile:followers 1394663742000 320818 255409050profile:name 1394663742000Telkomsel 308214563profile:created 1394663742004 1306804256000 308214563profile:followers 1394663742004 704847 308214563profile:name 1394663742004WorIdComedy Actual View

  10. Regions • Regions are split on row ID • i.e. you cannot have multiple key/value pairs with the same row ID in two regions or HFiles • Regions are indexed and Bloom filtered to give HBaseRegionServers the ability to quickly seek into an HDFS block and get the data

  11. Regions

  12. Bloom Filters and Block Caching • Use these for optimal fetch performance! • Bloom Filters • Stored in memory on each RegionServer • Used as a preliminary test prior to opening a region on HDFS • Very effective for fetches that are likely to have a null value • Block Caching • Configurable number of key/value pairs to read into memory when a RegionServer fetches data • Very effective for multiple fetches with similar keys • Can configure HBase to store all regions in-memory

  13. Compactions • Minor • Picks up a few StoreFiles and merges them together • Can sometimes pick up all the files in the Store and promote itself to a Major compaction • Major • Single StoreFile per Store • All expired cells will be dropped • Does not occur in minor compactions

  14. Creating and Managing Tables • Tables contain Column Families • You can (and should) pre-define your table split keys • Defines the regions of a table • Allows for better data distribution, especially when doing a bulk-load of data • HBase will split regions automatically as needed • Master has no part in this • Lower number of regions preferred, in the range of 20 to low-hundreds per RegionServer • Can split manually

  15. Bulk Importing • Create table • Use MapReduce to generate HFiles in batch • Tell HBase where the table files are • Drastically reduces run-time for table ingestion

  16. What can I do with it? • HBase is designed for fast fetches (~10ms) of your big data sets • Random Inserts/Updates/Deletes of data • Versioning • Changing schemas

  17. What shouldn’t I do with it? • Full-table scans • Slow • Use MapReduce instead (still slow) • High-throughput transactions • Use Redis or another in-memory solution for data sets that can fit in-memory • Monotonically Increasing Row IDs • There are work arounds!

  18. Types of Operations • Three Java objects to work with a table • Put • Get • Delete • Scanning can be done with the 'Scan' object

  19. Table Manipulation • HBaseAdmin • Management commands of creating tables, enabling/disabling tables, deleting tables, etc. • HTable • Actually putting/fetching/deleting/scanning data

  20. Simple Example • A Basic HBase application that demonstrates: • Creating a Table • Deleting a Table • Putting data • Getting data • Scanning data • With a simple Column Family filter

  21. Apache Accumulo

  22. Overview • Google's BigTable for Hadoop w/Security • Similar to HBase • Generally, Accumulo is faster at Writes, HBase is faster at Reads

  23. Accumulo Architecture ZooKeeper Client Master TabletServer TabletServer Tablet Tablet CF CF CF CF MemStore MemStore MemStore MemStore TFile TFile TFile TFile TFile TFile TFile TFile HDFS

  24. Data Model • Identical to HBase, with an additional 'visibility' label • Column families defined dynamically Key Value Row ID Column Family Column Qualifier Visibility Timestamp byte[]

  25. Features Include • Creating/Deleting Tables • Major/Minor Compactions • Bloom Filters/Block Caching • Bulk Importing • Transactions via Mutations • Two Types of Range Scans • Scanner vs Batch Scanner • Iterators

  26. Iterators • Real-Time processing framework • Provide "Reduce-like" functionality, but at very low latency • Iterators occur at compaction time and scan time to modify key/value pairs • AgeOffIterator – automatically age off key/value pairs during scans and compactions

  27. References • http://hbase.apache.org • http://accumulo.apache.org

More Related