1 / 24

Cassandra – A Decentralized Structured Storage System

Cassandra – A Decentralized Structured Storage System. A. Lakshaman 1 , P.Malik 1 1 Facebook SIGOPS ‘10 2011. 03. 18. Summarized and Presented by Sang-il Song , IDS Lab., Seoul National University. The Rise of NoSQL. Refer to http :// www.google.com / trends?q = nosql.

viola
Download Presentation

Cassandra – A Decentralized Structured Storage System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cassandra – A Decentralized Structured Storage System A. Lakshaman1, P.Malik1 1Facebook SIGOPS ‘10 2011. 03. 18. Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University

  2. The Rise of NoSQL Refer to http://www.google.com/trends?q=nosql Eric Evans, a Rackpage employee, reintroduce the term NoSQL in earlier 2009 when Johan Oskarsson of Last.fm wanted to organize an event to discuss open-source distributed databases. The name attempted to label the emergence of growing distributed data stores that often did not attempt to provide ACID guarantees

  3. NoSQL Database • Based on Key-value • memchached, Dynamo, Volemort, Tokyo Cabinet • Based on Column • Google BigTable, Cloudata, Hbase, Hypertable, Cassandra • Based on Document • MongoDB, CouchDB • Based on Graph • Meo4j, FlockDB, InfiniteGraph

  4. NoSQLBigData Database • Based on Key-Value • memchached, Dynamo, Volemort, Tokyo Cabinet • Based on Column • Google BigTable, Cloudata, Hbase, Hypertable, Cassandra • Based onDocument • MongoDB, CouchDB • Based on Graph • Meo4j, FlockDB, InfiniteGraph

  5. Refer to http://blog.nahurst.com/visual-guide-to-nosql-systems

  6. Contents • Operations • WRITE • READ • Consistency level • Performance Benchmark • Case Study • Conclusion • Introduction • Remind: Dynamo • Cassandra • Data Model • System Architecture • Partitioning • Replication • Membership • Bootstrapping

  7. Remind: Dynamo • Distributed Hash Table • BASE • Basically Available • Soft-state • Eventually Consistent • Client Tunable consistency/availability

  8. Cassandra • Dynamo-Bigtable lovechild • Column-based data model • Distributed Hash Table • Tunable tradeoff • Consistency vs. Latency • Properties • No single point of Failure • Linearly scalable • Flexible partitioning, replica placement • High Availability (eventually consistency)

  9. Data Model Cluster Key Space is corresponding to db or table space Column Family is corresponding to table Column is unit of data stored in Cassandra

  10. System Architecture Partitioning Replication Membership Bootstraping

  11. Partitioning Algorithm N1 N3 N2 Hash(key1) high low N2 is deemed the coordinator of key 1 • Distributed Hash Table • Data and Server are located in the same address space • Consistent Hashing • Key Space Partition: arrangement of the key • Overlay Networking: Routing Mechanism

  12. Partitioning Algorithm (cont’d) N1 N3 N2 N1 N3 N2 N1 N2 N3 N2 N1 N3 N2 N2 • Challenges • Non-uniform data and load distribution • Oblivious to the heterogenity in the performance of nodes • Solutions • Nodes get assigned to multiple positions in the circle (like Dynamo) • Analyze load information on the ring and have lightly loads move on the ring to alleviate heavily loaded nodes (like Cassandra)

  13. Replication Coordinator ofdata 1 A B J data1 C I D H G F E RackUnware RackAware DataCenterShared

  14. Cluster Membership Gossip Protocol is used for cluster membership Super lightweight with mathematically provable properties State disseminated in O(logN) rounds Every T Seconds each member increments its heartbeat counter and selects one other member send its list to A member merges the list with its own list

  15. Gossip Protocol t1 t4 t2 t3 t5 t6 server 1 server 1 server 1 server 1 server 1 server 1 server1: t6 server2: t2 server3 :t5 server1: t4 server2: t2 server3 :t5 server1: t1 server2: t2 server1: t4 server2: t2 server1: t1 server1: t1 server 2 server 2 server 2 server 2 server 2 server1: t6 server2: t6 server3: t5 server1: t4 server2: t2 server1: t4 server2: t2 server2: t2 server2: t2 server 3 server 3 server1: t6 server2: t6 server3: t5 server3: t5

  16. Accrual Failure Detector where Valuable for system management, replication, load balancing Designed to adapt to changing network conditions The value output, PHI, represents a suspicion level Applications set an appropriate threshold, trigger suspicions and perform appropriate actions In Cassandra the average time taken to detect a failure is 10-15 seconds with the PHI threshold set at 5

  17. Bootstraping N1 N2 N1 N3 N2 New node gets assigned a token such that it can alleviate a heavily loaded node

  18. WRITE • Interface • Simple: put(key,col,value) • Complex: put(key,[col:val,…,col:val]) • Batch • WRITE Opertation • Commit log for durability • Configurable fsync • Sequential writes only • MemTable • Nodisk access (no reads and seek) • Sstables are final • Read-only • indexes • Always Writable

  19. READ • Interface • get(key,column) • get_slice(key,SlicePredicate) • Get_range_sllices(keyRange,SlicePredicate) • READ • Practically lock-free • Sstable proliferation • Row cache • Key cache

  20. Consistency Level • Tuning the consistency level for each WRITE/READ operation Write Operation Read Operation

  21. Performance Benchmark • Random and Sequential Writes • Limited by bandwidth • Facebook Inbox Search • Two kinds of Search • Term Search • Interactions • 50+TB on 150 node cluster

  22. vs MySQL with 50GB Data • MySQL • ~300ms write • ~350ms read • Cassandra • ~0.12ms write • ~15ms read

  23. Case Study • Cassandra as primary data store • Datacenter and rack-aware replication • ~1,000,000 ops/s • high shardingand low replication • Inbox Search • 100TB • 5,000,000,000 writes per day

  24. Conclusions • Cassandra • Scalability • High Performance • Wide Applicability • Future works • Compression • Atomicity • Secondary Index

More Related