Mark Feltner (Distributed) (Structured) Storage Systems
Big Data • 2.5 Petabytes/day: Wal-Mart's transaction database • 40 Terabytes/second: CERN • 1 Terabyte/day: NYSE Trading data • 10 billion: Facebook photos
Overview • Theory • Algorithms • Implementations & Technology
Atomicty • All-or-nothing
Consistency • Data is always in a valid state
Isolation • Serially executed transactions result in same state as concurrent transactions
Durability • COMMIT means transaction is permanent across all clients
Fallacies of Distributed Computing • The network is reliable. • Latency is zero. • Bandwidth is infinite. • The network is secure. • Topology doesn't change. • There is one administrator. • Transport cost is zero. • The network is homogeneous.
Consistency • Eventual consistency “…there must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.” (Gilbert, Lynch)
Availability “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” (Gilbert, Lynch)
Partition Tolerance “In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost”(Gilbert, Lynch)
Row-oriented Data Storage Model: Breaking the Law Judas Priest British Steel 1980 Aces High Iron Maiden Powerslave 1984 Kickstart My heart Motley Crue Dr. Feelgood 1989 Raining Blood Slayer Reign in Blood 1986 I Wanna Be Somebody W.A.S.P. W.A.S.P. 1984
Column-oriented Data Storage Model: Breaking the Law Aces High Kickstart My Heart Raining Blood I Wanna Be Somebody Judas Priest Iron Madien Motley Crue Slayer W.A.S.P. British Steel Powerslave Dr. Feelgood Reign in Blood W.A.S.P. 1980 1984 1989 1986 1984
Comparison of Row- vs. Column-Orientation • CREATE • SELECT • MAX, MIN, SUM, AVG, …
BigTable • High performance • MapReduce • Powers: Google Reader, Maps,Book Search, YouTube, Gmail, …
Hadoop • MapReduce • Yahoo! • World Record Holder!
Cassandra • Key-value • MapReduce • Facebook • Eventual consistency • Scalable, fault-tolerant
MySQL • Relational • LAMP
Redis • Key-value • What is lacks in durability, it makes up for in speed / simplicity.
HBase • MapReduce • Hadoop + HDFS • Java and REST API • Column-oriented • Excellent fault-tolerance • Replication • Streaming
Neo4J • Graph Database
MongoDB • Document-oriented
Conclusions • Pick the right tool for the job.