1 / 39

(Distributed) (Structured) Storage Systems

Mark Feltner. (Distributed) (Structured) Storage Systems. Big Data. 2.5 Petabytes/day: Wal-Mart's transaction database 40 Terabytes/second: CERN 1 Terabyte/day: NYSE Trading data 10 billion: Facebook photos. Overview. Theory Algorithms Implementations & Technology.

lola
Download Presentation

(Distributed) (Structured) Storage Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mark Feltner (Distributed) (Structured) Storage Systems

  2. Big Data • 2.5 Petabytes/day: Wal-Mart's transaction database • 40 Terabytes/second: CERN • 1 Terabyte/day: NYSE Trading data • 10 billion: Facebook photos

  3. Overview • Theory • Algorithms • Implementations & Technology

  4. Relational databases

  5. ACID

  6. Atomicty • All-or-nothing

  7. Consistency • Data is always in a valid state

  8. Isolation • Serially executed transactions result in same state as concurrent transactions

  9. Durability • COMMIT means transaction is permanent across all clients

  10. Non-relational databases

  11. Key-value

  12. Document-oriented

  13. Graphs

  14. Distributed Systems

  15. Fallacies of Distributed Computing • The network is reliable. • Latency is zero. • Bandwidth is infinite. • The network is secure. • Topology doesn't change. • There is one administrator. • Transport cost is zero. • The network is homogeneous.

  16. CAP Theorem

  17. Consistency • Eventual consistency “…there must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.” (Gilbert, Lynch)

  18. Availability “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” (Gilbert, Lynch)

  19. Partition Tolerance “In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost”(Gilbert, Lynch)

  20. (CA || CP || AP) ?

  21. Algorithms

  22. Row- versus column- orientation

  23. Row-oriented Data Storage Model: Breaking the Law Judas Priest British Steel 1980 Aces High Iron Maiden Powerslave 1984 Kickstart My heart Motley Crue Dr. Feelgood 1989 Raining Blood Slayer Reign in Blood 1986 I Wanna Be Somebody W.A.S.P. W.A.S.P. 1984

  24. Column-oriented Data Storage Model: Breaking the Law Aces High Kickstart My Heart Raining Blood I Wanna Be Somebody Judas Priest Iron Madien Motley Crue Slayer W.A.S.P. British Steel Powerslave Dr. Feelgood Reign in Blood W.A.S.P. 1980 1984 1989 1986 1984

  25. Comparison of Row- vs. Column-Orientation • CREATE • SELECT • MAX, MIN, SUM, AVG, …

  26. MapReduce

  27. Technology

  28. Implementations

  29. BigTable • High performance • MapReduce • Powers: Google Reader, Maps,Book Search, YouTube, Gmail, …

  30. Hadoop • MapReduce • Yahoo! • World Record Holder!

  31. Cassandra • Key-value • MapReduce • Facebook • Eventual consistency • Scalable, fault-tolerant

  32. MySQL • Relational • LAMP

  33. Redis • Key-value • What is lacks in durability, it makes up for in speed / simplicity.

  34. HBase • MapReduce • Hadoop + HDFS • Java and REST API • Column-oriented • Excellent fault-tolerance • Replication • Streaming

  35. Neo4J • Graph Database

  36. MongoDB • Document-oriented

  37. Conclusions • Pick the right tool for the job.

More Related