1 / 78

CC5212-1 Procesamiento Masivo de Datos Otoño 2014

CC5212-1 Procesamiento Masivo de Datos Otoño 2014. Aidan Hogan aidhog@gmail.com Lecture XI : 2014/06/11. Me vs. Google. RECAP: NoSQL. NoSQL. NoSQL vs. Relational Databases. What are the big differences between relational databases and NoSQL systems? What are the trade-offs?.

sela
Download Presentation

CC5212-1 Procesamiento Masivo de Datos Otoño 2014

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CC5212-1ProcesamientoMasivo de DatosOtoño2014 Aidan Hogan aidhog@gmail.com Lecture XI: 2014/06/11

  2. Me vs. Google

  3. RECAP: NoSQL

  4. NoSQL

  5. NoSQL vs. Relational Databases What are the big differences between relational databases and NoSQL systems? What are the trade-offs?

  6. The Database Landscape Batch analysis of data Not using the relational model Using the relational model Real-time Stores documents (semi-structured values) Relational Databases with focus on scalability to compete with NoSQL while maintaining ACID Not only SQL Maps  Column Oriented Cloud storage Graph-structured data In-Memory

  7. RECAP: KEY–VALUE

  8. Key–Value = a Distributed Map

  9. Amazon Dynamo(DB): Model • Named table with primary key and a value • Primary key is hashed / unordered

  10. Amazon Dynamo(DB): Object Versioning • Object Versioning (per bucket) • PUT doesn’t overwrite: pushes version • GET returns most recent version

  11. Other Key–Value Stores

  12. RECAP: DOCUMENT STORES

  13. Key–Value Stores: Values are Documents • Document-type depends on store • XML, JSON, Blobs, Natural language • Operators for documents • e.g., filtering, inv. indexing, XML/JSON querying, etc.

  14. MongoDB: JSON Based o • Can invoke Javascript over the JSON objects • Document fields can be indexed db.inventory.find({ continent: { $in: [ ‘Asia’, ‘Europe’ ]}})

  15. Document Stores

  16. TABLULAR / COLUMN FAMILY

  17. Key–Value = a Distributed Map Tabular = Multi-dimensional Maps

  18. Bigtable: The Original Whitepaper Why did they write another paper? MapReduce solves everything, right? MapReduce authors

  19. Bigtable: Data Model “a sparse, distributed, persistent,multi-dimensional, sortedmap.” • sparse: not all values form a dense square • distributed: lots of machines • persistent: disk storage (GFS) • multi-dimensional: values with columns • sorted: sorting lexicographically by row key • map: look up a key, get a value

  20. Bigtable: in a nutshell (row, column, time) → value • row: a row id string • e.g., “Afganistan” • column: a column name string • e.g., “pop-value” • time: an integer (64-bit) version time-stamp • e.g., 18545664 • value: the element of the cell • e.g., “31120978”

  21. Bigtable: in a nutshell (row, column, time) → value (Afganistan,pop-value,t4) → 31108077

  22. Bigtable: Sorted Keys S O R T E D Benefits of sorted keys vs. hashed keys?

  23. Bigtable: Tablets Take advantage of locality of processing! A S I A E U R O P E

  24. Bigtable: Distribution Pros and cons versus hash partitioning? Horizontal range partitioning Split by tablet

  25. Bigtable: Column Families • Group logically similar columns together • Accessed efficiently together • Access-control and storage: column family level • If of same type, can be compressed

  26. Bigtable: Versioning • Similar to Apache Dynamo (so no “fancy” slide) • Cell-level • 64-bit integer time stamps • Inserts push down current version • Lazy deletions / periodic garbage collection • Two options: • keep last n versions • keep versions newer than t time

  27. Bigtable: SSTable Map Implementation • 64k blocks (default) with index in footer (GFS) • Index loaded into memory, allows for seeks • Can be split or merged, as needed 0 65536 Index:

  28. Bigtable: Buffered/Batched Writes Merge-sort READ Memtable In-memory GFS Tablet SSTable1 SSTable2 SSTable3 Tablet log WRITE

  29. Bigtable: Redo Log • If machine fails, Memtable redone from log Memtable In-memory GFS Tablet SSTable1 SSTable2 SSTable3 Tablet log

  30. BigTable: Minor Compaction • When full, write Memtable as SSTable Memtable Memtable In-memory GFS Tablet SSTable1 SSTable2 SSTable3 Tablet log SSTable4

  31. BigTable: Merge Compaction • Merge some of the SSTables(and the Memtable) READ Memtable Memtable In-memory GFS Tablet SSTable1 SSTable1 SSTable2 SSTable3 Tablet log SSTable4

  32. BigTable: Major Compaction • Merge allSSTables (and the Memtable) READ Memtable Memtable In-memory GFS Tablet SSTable1 SSTable1 SSTable2 SSTable3 SSTable1 Tablet log SSTable4

  33. Bigtable: Hierarchical Structure

  34. Bigtable: Consistency • CHUBBY: Distributed consensus tool based on PAXOS • Maintains consistent replicas • Five replicas: one master and four slaves • Co-ordinates distributed locks • Stores location of main “root tablet” Do we think it’s a CP system or an AP system?

  35. Bigtable: A Bunch of Other Things • Locality groups: Group multiple column families together; assigned a separate SSTable • Select storage: SSTables can be persistent or in-memory • Compression: Applied on blocks; custom compression can be chosen • Caches: SSTable-level and block-level • Bloom filters: Find negatives cheaply …

  36. Aside: Bloom Filter Reject “empty” queries using very little memory! • Create a bit array of length m (init to 0’s) • Create k hash functions that map an object to an index of m (even distribution) • Index o: set m[hash1(o)], …, m[hashk(o)] to 1 • Queryo: • anym[hash1(o)], …, m[hashk(o)] set to 0 = not indexed • allm[hash1(o)], …, m[hashk(o)] set to 1 = might be indexed

  37. Bigtable: Apache HBase Open-source implementation of Bigtable ideas

  38. The Database Landscape Batch analysis of data Not using the relational model Using the relational model Real-time Stores documents (semi-structured values) Relational Databases with focus on scalability to compete with NoSQL while maintaining ACID Not only SQL Maps  Column Oriented Cloud storage Graph-structured data In-Memory

  39. GRAPH DATABASES

  40. Data = Graph • Any data can be represented as a directed labelled graph (not always neatly)] When is it a good idea to consider data as a graph? • When you want to answer questions like: • How many social hops is this user away? • What is my Erdős number? • What connections are needed to fly to Perth? • How are Einstein and Godel related?

  41. RelFinder

  42. Graph Databases (Fred,IS_FRIEND_OF,Jim) (Fred,IS_FRIEND_OF,Ted) (Ted,LIKES,Zushi_Zam) (Zuzhi_Zam,SERVES,Sushi) …

  43. Graph Databases: Index Nodes (Fred,IS_FRIEND_OF,Jim) (Jim,LIKES,iSushi) Fred ->

  44. Graph Databases: Index Relations (Ted,LIKES,Zushi_Zam) (Jim,LIKES,iSushi) LIKES ->

  45. Graph Databases: Graph Queries (Fred,IS_FRIEND,?friend) (?friend,LIKES,?place) (?place,SERVES,?sushi) (?place,LOCATED_IN,New_York)

More Related