(Distributed) (Structured) Storage Systems - PowerPoint PPT Presentation

lola
mark feltner n.
Skip this Video
Loading SlideShow in 5 Seconds..
(Distributed) (Structured) Storage Systems PowerPoint Presentation
Download Presentation
(Distributed) (Structured) Storage Systems

play fullscreen
1 / 39
Download Presentation
(Distributed) (Structured) Storage Systems
71 Views
Download Presentation

(Distributed) (Structured) Storage Systems

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Mark Feltner (Distributed) (Structured) Storage Systems

  2. Big Data • 2.5 Petabytes/day: Wal-Mart's transaction database • 40 Terabytes/second: CERN • 1 Terabyte/day: NYSE Trading data • 10 billion: Facebook photos

  3. Overview • Theory • Algorithms • Implementations & Technology

  4. Relational databases

  5. ACID

  6. Atomicty • All-or-nothing

  7. Consistency • Data is always in a valid state

  8. Isolation • Serially executed transactions result in same state as concurrent transactions

  9. Durability • COMMIT means transaction is permanent across all clients

  10. Non-relational databases

  11. Key-value

  12. Document-oriented

  13. Graphs

  14. Distributed Systems

  15. Fallacies of Distributed Computing • The network is reliable. • Latency is zero. • Bandwidth is infinite. • The network is secure. • Topology doesn't change. • There is one administrator. • Transport cost is zero. • The network is homogeneous.

  16. CAP Theorem

  17. Consistency • Eventual consistency “…there must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.” (Gilbert, Lynch)

  18. Availability “For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response” (Gilbert, Lynch)

  19. Partition Tolerance “In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another. When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost”(Gilbert, Lynch)

  20. (CA || CP || AP) ?

  21. Algorithms

  22. Row- versus column- orientation

  23. Row-oriented Data Storage Model: Breaking the Law Judas Priest British Steel 1980 Aces High Iron Maiden Powerslave 1984 Kickstart My heart Motley Crue Dr. Feelgood 1989 Raining Blood Slayer Reign in Blood 1986 I Wanna Be Somebody W.A.S.P. W.A.S.P. 1984

  24. Column-oriented Data Storage Model: Breaking the Law Aces High Kickstart My Heart Raining Blood I Wanna Be Somebody Judas Priest Iron Madien Motley Crue Slayer W.A.S.P. British Steel Powerslave Dr. Feelgood Reign in Blood W.A.S.P. 1980 1984 1989 1986 1984

  25. Comparison of Row- vs. Column-Orientation • CREATE • SELECT • MAX, MIN, SUM, AVG, …

  26. MapReduce

  27. Technology

  28. Implementations

  29. BigTable • High performance • MapReduce • Powers: Google Reader, Maps,Book Search, YouTube, Gmail, …

  30. Hadoop • MapReduce • Yahoo! • World Record Holder!

  31. Cassandra • Key-value • MapReduce • Facebook • Eventual consistency • Scalable, fault-tolerant

  32. MySQL • Relational • LAMP

  33. Redis • Key-value • What is lacks in durability, it makes up for in speed / simplicity.

  34. HBase • MapReduce • Hadoop + HDFS • Java and REST API • Column-oriented • Excellent fault-tolerance • Replication • Streaming

  35. Neo4J • Graph Database

  36. MongoDB • Document-oriented

  37. Conclusions • Pick the right tool for the job.