Big Data – Distributed Database (HBase)

Big Data – Distributed Database (HBase) Kalapriya Kannan IBM Research Labs July, 2013

No SQL Vs SQL, But wait…. • What is SQL? • Select- From- Where Statements • Multirelational Queries • Subqueries. • Says “what to do” rather than “how to do” • DBMS Engine figures out the `best’ way to execute query. • Called ‘query optimization’

What is RDBMS? • DBMS is Collection of –interrelated data and – set of programs to access the data • Convenient and efficient processing of data Database Application Software • Database: collection of entities and relationship among entities • RESPECTS – ACID • But only when its required • viz banking, finance, safety systems etc., • The kind of systems that people were building with computers decades ago…

RDBMS • Set of relation names: R • Set of attribute names: A • Relation schema: S=(r,{a1, ...,an}) • r relation name in R • {a1, ...,an} subset of A • e.g., (Dog,{Name, Age, Weight, Breed}) Relationship Name DOG

Structure of a SQL statement SELECT desired attributes FROM one or more tables WHERE condition about tuples of the tables

Example query Using University(name, degree) what degrees are offered by IITJodhpur? SELECT name FROM univeristy WHERE name = ’IITJodhpur’; Single relation query Begin with the relation in the “FROM” clause. Apply the selection indicated by the “WHERE” Clause Apply the extended projection indicated by the “SELECT” clause

Operational Semantics Include t.degree in the output • Think of a tuple variable visiting each tuple of the relation mentioned in FROM. • Check if the “current” tuple satisfies the WHERE clause. • If so, compute the attributes or expressions of the SELECT clause using the components of this tuple. Tuple variable ‘t’ loops over all tuples Checks if name is IIIT Jodhpur

Multi-relationship queries • Interesting queries often combine data from more than one relation. • We can address several relations in one query by listing them all in the FROM clause. • Distinguish attributes of the same name by “<relation>.<attribute>” . • Example joining two relationship • SELECT UNIVERSITY.name • FROM UNIVERSITY, DEPARMENTS • Where university.degree = departments.degree AND department.name = “computer science”

Operational Semantics • Almost the same as for single-relation queries: • Start with the product of all the relations in the FROM clause. • Apply the selection condition from the WHERE clause. • Project onto the list of attributes and expressions in the SELECT clause.

NoSQL: The Name • “SQL” = Traditional relational DBMS • Recognition over past decade or so: Not every data management/analysis problem is best solved using a traditional relational DBMS • “NoSQL” = “No SQL” = Not using traditional relational DBMS • “No SQL”  Don’t use SQL language • “No SQL” = “Not ONLY SQL”

NoSQL systems • Alternative to traditional relational DBMS • Flexible schema • Quicker/cheaper to set up • Massive scalability • Relaxed consistency  higher performance & availability • No declarative query language  more programming • Relaxed consistency  fewer guarantees • Several incarnations • MapReduce framework • Key-value stores • Document stores • Graph database systems

Different types of NoSQL • Column Store • Column is saved together, as opposed to row data • Super useful for data analytics • Hadoop, Cassandra, hypertable. • Key Value store • A key refers to a payload • Redis, Azure, HBase • Document/Object/XML Store • Key (and possibly other indexes) point at a serialized object • DB can operate against values in document. • Mango DB, Couch DB etc., • Graph Store • Nodes are stored independently, and the relationship between nodes (edges) are stored with data.

Key Value stores • Extremely simple interface • Data model: (key, value) pairs • Operations: Insert(key,value), Fetch(key), • Update(key), Delete(key) • Some allow (non-uniform) columns within value • Some allow Fetch on range of keys • Implementation: efficiency, scalability, fault-tolerance • Records distributed to nodes based on key • Replication • Single-record transactions, “eventual consistency” • Example systems • Google BigTable, Amazon Dynamo, Cassandra,Voldemort, HBase, …

Why NoSQL? • The opposite of ACID is • BASE • Basically Available – guaranteed availability • Soft state – the state of the system may change, even without a query (because of the node updates). • Eventually consistent - the system will eventually become consistent over time. • Eventual Consistency • Because of the distributed model, any server can answer any query • Servers communicate amongst themselves at their own pace (behind the scenes) • The server that answers your query might not have the latest data • Who cares if you see kim’s latest tweet.?

SQL to NoSQL - performance

Distributed HBase • Distributed Hash table • Get, put, delete, scan and CaS • Denormalization is necessary • Not a parallel database, just distributed • Write ahead log/data durability • Master/slave replication • ACID compliance

Rows are sorted lexicographically based on row key  contiguous set of sorted rows Region: HBase - Introduction • A sparse, multi-dimensional, sorted map • {row, column, timestamp} -> cell • Column = Column Family : Column qualifier • Rows are sorted lexicographically based on row key • Region: contiguous set of sorted rows • HBase: a large number of columns, a low number of column • families (2-3)  HBase: a large number of columns, a low number of column families (2 - 3)

Sorted Map Data Store

Column families • Different set of columns may have different properties and access patterns • Configuration by column family: • Compression (none, gzip,LZO) • Version retention policies • Cache priority • CFs stored seperately on disk: access one without wasting IO on the other. How to access HBase • Java API (thick client) • REST/HTTP • Apache Thrift (any language) • Hive/Pig for analytics.

HBase Basics • Operations are based on row keys • Single row operations: • Put • Get • Scan • Multi-row operations: • Scan • Put • No built-in joins (use MapReduce)

Terms and daemons • Region • A subset of a tables row’s, like a range partition • Automatically sharded • RegionServer • Serves data for reads and writes • Master • Responsible for co-ordinating the slaves • Assigns regions, detects failures of Region servers • Controls some admin functions

Cluster Architecture

System Architecture

HBase Vs RDBMS

Some Exercises

Big Data – Distributed Database (HBase)

Big Data – Distributed Database (HBase)

Presentation Transcript

Introduction to Database Systems

DATABASE CONCEPTS

Business Analysis ITEC-630 Fall 2009

K-12 Statewide Longitudinal Data System (SLDS)

Database Systems Data Models

Database Administration

Database Beginnings

Object oriented Database

Lecture XIII: Replication-II

CMSC424: Database Design

Database Relational Model

Database Access using SQL

Chapter 19: Distributed Databases

Hector Garcia-Molina

Database Management Systems

Oracle Database 12c Release 1 (12.1.0.2 )

Distributed Database Management System

Agenda

Client and Service Information (CSI) Database

Session 8-9 Data Resource Management

Data Structures