Cassandra – A Decentralized Structured Storage System

Cassandra – A Decentralized Structured Storage System A. Lakshaman1, P.Malik1 1Facebook SIGOPS ‘10 2011. 03. 18. Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University

The Rise of NoSQL Refer to http://www.google.com/trends?q=nosql Eric Evans, a Rackpage employee, reintroduce the term NoSQL in earlier 2009 when Johan Oskarsson of Last.fm wanted to organize an event to discuss open-source distributed databases. The name attempted to label the emergence of growing distributed data stores that often did not attempt to provide ACID guarantees

NoSQL Database • Based on Key-value • memchached, Dynamo, Volemort, Tokyo Cabinet • Based on Column • Google BigTable, Cloudata, Hbase, Hypertable, Cassandra • Based on Document • MongoDB, CouchDB • Based on Graph • Meo4j, FlockDB, InfiniteGraph

NoSQLBigData Database • Based on Key-Value • memchached, Dynamo, Volemort, Tokyo Cabinet • Based on Column • Google BigTable, Cloudata, Hbase, Hypertable, Cassandra • Based onDocument • MongoDB, CouchDB • Based on Graph • Meo4j, FlockDB, InfiniteGraph

Refer to http://blog.nahurst.com/visual-guide-to-nosql-systems

Contents • Operations • WRITE • READ • Consistency level • Performance Benchmark • Case Study • Conclusion • Introduction • Remind: Dynamo • Cassandra • Data Model • System Architecture • Partitioning • Replication • Membership • Bootstrapping

Remind: Dynamo • Distributed Hash Table • BASE • Basically Available • Soft-state • Eventually Consistent • Client Tunable consistency/availability

Cassandra • Dynamo-Bigtable lovechild • Column-based data model • Distributed Hash Table • Tunable tradeoff • Consistency vs. Latency • Properties • No single point of Failure • Linearly scalable • Flexible partitioning, replica placement • High Availability (eventually consistency)

Data Model Cluster Key Space is corresponding to db or table space Column Family is corresponding to table Column is unit of data stored in Cassandra

System Architecture Partitioning Replication Membership Bootstraping

Partitioning Algorithm N1 N3 N2 Hash(key1) high low N2 is deemed the coordinator of key 1 • Distributed Hash Table • Data and Server are located in the same address space • Consistent Hashing • Key Space Partition: arrangement of the key • Overlay Networking: Routing Mechanism

Partitioning Algorithm (cont’d) N1 N3 N2 N1 N3 N2 N1 N2 N3 N2 N1 N3 N2 N2 • Challenges • Non-uniform data and load distribution • Oblivious to the heterogenity in the performance of nodes • Solutions • Nodes get assigned to multiple positions in the circle (like Dynamo) • Analyze load information on the ring and have lightly loads move on the ring to alleviate heavily loaded nodes (like Cassandra)

Replication Coordinator ofdata 1 A B J data1 C I D H G F E RackUnware RackAware DataCenterShared

Cluster Membership Gossip Protocol is used for cluster membership Super lightweight with mathematically provable properties State disseminated in O(logN) rounds Every T Seconds each member increments its heartbeat counter and selects one other member send its list to A member merges the list with its own list

Gossip Protocol t1 t4 t2 t3 t5 t6 server 1 server 1 server 1 server 1 server 1 server 1 server1: t6 server2: t2 server3 :t5 server1: t4 server2: t2 server3 :t5 server1: t1 server2: t2 server1: t4 server2: t2 server1: t1 server1: t1 server 2 server 2 server 2 server 2 server 2 server1: t6 server2: t6 server3: t5 server1: t4 server2: t2 server1: t4 server2: t2 server2: t2 server2: t2 server 3 server 3 server1: t6 server2: t6 server3: t5 server3: t5

Accrual Failure Detector where Valuable for system management, replication, load balancing Designed to adapt to changing network conditions The value output, PHI, represents a suspicion level Applications set an appropriate threshold, trigger suspicions and perform appropriate actions In Cassandra the average time taken to detect a failure is 10-15 seconds with the PHI threshold set at 5

Bootstraping N1 N2 N1 N3 N2 New node gets assigned a token such that it can alleviate a heavily loaded node

WRITE • Interface • Simple: put(key,col,value) • Complex: put(key,[col:val,…,col:val]) • Batch • WRITE Opertation • Commit log for durability • Configurable fsync • Sequential writes only • MemTable • Nodisk access (no reads and seek) • Sstables are final • Read-only • indexes • Always Writable

READ • Interface • get(key,column) • get_slice(key,SlicePredicate) • Get_range_sllices(keyRange,SlicePredicate) • READ • Practically lock-free • Sstable proliferation • Row cache • Key cache

Consistency Level • Tuning the consistency level for each WRITE/READ operation Write Operation Read Operation

Performance Benchmark • Random and Sequential Writes • Limited by bandwidth • Facebook Inbox Search • Two kinds of Search • Term Search • Interactions • 50+TB on 150 node cluster

vs MySQL with 50GB Data • MySQL • ~300ms write • ~350ms read • Cassandra • ~0.12ms write • ~15ms read

Case Study • Cassandra as primary data store • Datacenter and rack-aware replication • ~1,000,000 ops/s • high shardingand low replication • Inbox Search • 100TB • 5,000,000,000 writes per day

Conclusions • Cassandra • Scalability • High Performance • Wide Applicability • Future works • Compression • Atomicity • Secondary Index

Cassandra – A Decentralized Structured Storage System