270 likes | 477 Views
Cassandra – A Decentralized Structured Storage System. A. Lakshaman 1 , P.Malik 1 1 Facebook SIGOPS ‘10 2011. 03. 18. Summarized and Presented by Sang-il Song , IDS Lab., Seoul National University. The Rise of NoSQL. Refer to http :// www.google.com / trends?q = nosql.
E N D
Cassandra – A Decentralized Structured Storage System A. Lakshaman1, P.Malik1 1Facebook SIGOPS ‘10 2011. 03. 18. Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University
The Rise of NoSQL Refer to http://www.google.com/trends?q=nosql Eric Evans, a Rackpage employee, reintroduce the term NoSQL in earlier 2009 when Johan Oskarsson of Last.fm wanted to organize an event to discuss open-source distributed databases. The name attempted to label the emergence of growing distributed data stores that often did not attempt to provide ACID guarantees
NoSQL Database • Based on Key-value • memchached, Dynamo, Volemort, Tokyo Cabinet • Based on Column • Google BigTable, Cloudata, Hbase, Hypertable, Cassandra • Based on Document • MongoDB, CouchDB • Based on Graph • Meo4j, FlockDB, InfiniteGraph
NoSQLBigData Database • Based on Key-Value • memchached, Dynamo, Volemort, Tokyo Cabinet • Based on Column • Google BigTable, Cloudata, Hbase, Hypertable, Cassandra • Based onDocument • MongoDB, CouchDB • Based on Graph • Meo4j, FlockDB, InfiniteGraph
Refer to http://blog.nahurst.com/visual-guide-to-nosql-systems
Contents • Operations • WRITE • READ • Consistency level • Performance Benchmark • Case Study • Conclusion • Introduction • Remind: Dynamo • Cassandra • Data Model • System Architecture • Partitioning • Replication • Membership • Bootstrapping
Remind: Dynamo • Distributed Hash Table • BASE • Basically Available • Soft-state • Eventually Consistent • Client Tunable consistency/availability
Cassandra • Dynamo-Bigtable lovechild • Column-based data model • Distributed Hash Table • Tunable tradeoff • Consistency vs. Latency • Properties • No single point of Failure • Linearly scalable • Flexible partitioning, replica placement • High Availability (eventually consistency)
Data Model Cluster Key Space is corresponding to db or table space Column Family is corresponding to table Column is unit of data stored in Cassandra
System Architecture Partitioning Replication Membership Bootstraping
Partitioning Algorithm N1 N3 N2 Hash(key1) high low N2 is deemed the coordinator of key 1 • Distributed Hash Table • Data and Server are located in the same address space • Consistent Hashing • Key Space Partition: arrangement of the key • Overlay Networking: Routing Mechanism
Partitioning Algorithm (cont’d) N1 N3 N2 N1 N3 N2 N1 N2 N3 N2 N1 N3 N2 N2 • Challenges • Non-uniform data and load distribution • Oblivious to the heterogenity in the performance of nodes • Solutions • Nodes get assigned to multiple positions in the circle (like Dynamo) • Analyze load information on the ring and have lightly loads move on the ring to alleviate heavily loaded nodes (like Cassandra)
Replication Coordinator ofdata 1 A B J data1 C I D H G F E RackUnware RackAware DataCenterShared
Cluster Membership Gossip Protocol is used for cluster membership Super lightweight with mathematically provable properties State disseminated in O(logN) rounds Every T Seconds each member increments its heartbeat counter and selects one other member send its list to A member merges the list with its own list
Gossip Protocol t1 t4 t2 t3 t5 t6 server 1 server 1 server 1 server 1 server 1 server 1 server1: t6 server2: t2 server3 :t5 server1: t4 server2: t2 server3 :t5 server1: t1 server2: t2 server1: t4 server2: t2 server1: t1 server1: t1 server 2 server 2 server 2 server 2 server 2 server1: t6 server2: t6 server3: t5 server1: t4 server2: t2 server1: t4 server2: t2 server2: t2 server2: t2 server 3 server 3 server1: t6 server2: t6 server3: t5 server3: t5
Accrual Failure Detector where Valuable for system management, replication, load balancing Designed to adapt to changing network conditions The value output, PHI, represents a suspicion level Applications set an appropriate threshold, trigger suspicions and perform appropriate actions In Cassandra the average time taken to detect a failure is 10-15 seconds with the PHI threshold set at 5
Bootstraping N1 N2 N1 N3 N2 New node gets assigned a token such that it can alleviate a heavily loaded node
WRITE • Interface • Simple: put(key,col,value) • Complex: put(key,[col:val,…,col:val]) • Batch • WRITE Opertation • Commit log for durability • Configurable fsync • Sequential writes only • MemTable • Nodisk access (no reads and seek) • Sstables are final • Read-only • indexes • Always Writable
READ • Interface • get(key,column) • get_slice(key,SlicePredicate) • Get_range_sllices(keyRange,SlicePredicate) • READ • Practically lock-free • Sstable proliferation • Row cache • Key cache
Consistency Level • Tuning the consistency level for each WRITE/READ operation Write Operation Read Operation
Performance Benchmark • Random and Sequential Writes • Limited by bandwidth • Facebook Inbox Search • Two kinds of Search • Term Search • Interactions • 50+TB on 150 node cluster
vs MySQL with 50GB Data • MySQL • ~300ms write • ~350ms read • Cassandra • ~0.12ms write • ~15ms read
Case Study • Cassandra as primary data store • Datacenter and rack-aware replication • ~1,000,000 ops/s • high shardingand low replication • Inbox Search • 100TB • 5,000,000,000 writes per day
Conclusions • Cassandra • Scalability • High Performance • Wide Applicability • Future works • Compression • Atomicity • Secondary Index