1 / 110

BIG DATA

BIG DATA. Algorithms. Google trend. Big data. everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. Is there anything fundamentally new?. Massive Data vs Big Data The 3 V’s Volume Velocity

shika
Download Presentation

BIG DATA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIG DATA Algorithms

  2. Google trend

  3. Big data • everyone talks about it, • nobody really knows how to do it, • everyone thinks everyone else is doing it, • so everyone claims they are doing it...

  4. Is there anything fundamentally new? • Massive Data vs Big Data • The 3 V’s • Volume • Velocity • Variety

  5. Big data ecosystem

  6. Big data Applications

  7. Big Data Algorithms Distributed Algorithms Data stream Algorithms External memory Algorithms Parallel Algorithms

  8. Computational models for big data • All models are wrong, • But some are useful. • George E. P. Box

  9. What’s the Bottleneck? • CPU speed approaching limit • Does it matter? • From CPU-intensive computing to data-intensive computing • Algorithm has to be near-linear, linear, or even sub-linear! • Data movement, i.e., communication is the bottleneck!

  10. Random Access Machine Model • Standard theoretical model of computation: • Unlimited memory • Uniform access cost • Simple model crucial for success of computer industry R A M

  11. R A M L 1 L 2 Hierarchical Memory • Modern machines have complicated memory hierarchy • Levels get larger and slower further away from CPU • Data moved between levels using large blocks

  12. Slow I/O • Disk access is 106 times slower than main memory access read/write arm “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) track 4835 1915 5748 4125 • Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes) • Important to store/access data to take advantage of blocks (locality) magnetic surface

  13. running time data size Scalability Problems • Most programs developed in RAM-model • Run on large datasets because OS moves blocks as needed • Moderns OS utilizes sophisticated paging and prefetching strategies • But if program makes scattered accesses even good OS cannot take advantage of block access  Scalability problems!

  14. External Memory Model N = # of items in the problem instance B = # of items per disk block M = # of items that fit in main memory I/O: # blocks moved between memory and disk CPU time is ignored Successful model used extensively in massive data algorithms and database communities D Block I/O M P

  15. Fundamental Bounds Internal External • Scanning: N • Sorting: N log N • Permuting • Searching: • Note: • Linear I/O: O(N/B) • Permuting not linear • Permuting and sorting bounds are equal in all practical cases • B factor VERY important:

  16. Queues and Stacks • Queue: • Maintain push and pop blocks in main memory  O(1/B) I/O per operation (amortized) • Stack: • Maintain push/pop block in main memory  O(1/B) I/O per operation (amortized) Push Pop

  17. Sorting • Merge sort: • Create N/M memory sized sorted lists • Repeatedly merge lists together Θ(M/B) at a time  phases using I/Os each  I/Os

  18. Sorting • <M/B sorted lists (queues) can be merged in O(N/B) I/Os M/B blocks in main memory • The M/B head elements kept in a heap in main memory

  19. Toy Experiment: Permuting • Problem: • Input: N elements out of order: 6, 7, 1, 3, 2, 5, 10, 9, 4, 8 • Each element knows its correct position • Output: Store them on disk in the right order • Internal memory solution: • Just scan the original sequence and move every element in the right place! • O(N) time, O(N) I/Os • External memory solution: • Use sorting • O(N log N) time, I/Os

  20. Searching in External Memory • Store N elements in a data structure such that • Given a query element x, find it or its predecessor

  21. B-trees • BFS-blocking naturally corresponds to tree with fan-out • B-trees balanced by allowing node degree to vary • Rebalancing performed by splitting and merging nodes

  22. (a,b)-tree • T is an (a,b)-tree (a≥2 and b≥2a-1) • All leaves on the same level (contain between a and b elements) • Except for the root, all nodes have degree between a and b • Root has degree between 2 and b (2,4)-tree • (a,b)-tree uses linear space and has height •  • Choosing a,b = each node/leaf stored in one disk block •  • O(N/B) space and query

  23. (a,b)-Tree Insert • Insert: Search and insert element in leaf v DO v has b+1 elements/children Splitv: make nodes v’ and v’’ with and elements insert element (ref) in parent(v) (make new root if necessary) v=parent(v) • Insert touch nodes v v’ v’’

  24. (a,b)-Tree Insert

  25. (a,b)-Tree Delete • Delete: Search and delete element from leaf v DO v has a-1 elements/children Fusev with sibling v’: move children of v’ to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and ≤ a+b-1<2b) children split v v=parent(v) • Delete touch nodes v v

  26. (a,b)-Tree Delete

  27. (a,b)-Tree (2,3)-tree • (a,b)-tree properties: • Every update can cause O(logaN) rebalancing operations • If b>2a rebalancing operations amortized • Why? delete insert

  28. Summary/Conclusion: B-tree • B-trees: (a,b)-trees with a,b = • O(N/B) space • O(logBN)query • O(logB N)update • B-trees with elements in the leaves sometimes called B+-tree • Now B-tree and B+tree are synonyms • Construction in I/Os • Sort elements and construct leaves • Build tree level-by-level bottom-up

  29. Basic Structures: I/O-Efficient Priority Queue

  30. Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 100 40 90 40 30 50 29 23 15 65 Insertion

  31. Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 100 40 90 40 30 65 50 23 15 29 Insertion

  32. Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 40 90 40 30 65 50 23 15 29 DeleteMax

  33. Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 90 40 40 30 65 50 23 15 29 DeleteMax

  34. Internal Priority Queues • Operations: • Required: • Insert • DeleteMax • Max • Optional: • Delete • Update • Implementation: • Binary tree • Heap 90 40 65 40 30 50 29 23 15 DeleteMax

  35. How to Make the Heap I/O-Efficient I/O Technique 1: Make it many-way I/O Technique 2: Buffering!

  36. External Heap insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  37. External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  38. External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  39. External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks sift-up may not be half-full Heap property: All elements in a child are smaller than those in its parent

  40. External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks sift-up may not be half-full Heap property: All elements in a child are smaller than those in its parent

  41. External Heap: Insert insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks swap may not be half-full Heap property: All elements in a child are smaller than those in its parent

  42. External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  43. External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  44. External Heap: DeleteMax insert buffer main memory in memory refill heap has fan-out Θ(M/B) each node has Θ(M/B) blocks may not be half-full Heap property: All elements in a child are smaller than those in its parent

  45. External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks refill may not be half-full Heap property: All elements in a child are smaller than those in its parent

  46. External Heap: DeleteMax insert buffer main memory in memory heap has fan-out Θ(M/B) each node has Θ(M/B) blocks refill merge may not be half-full Heap property: All elements in a child are smaller than those in its parent

  47. External Heap: I/O Analysis • What is the I/O cost for a sequence of N mixed insertions / deletemax (analysis in paper too complicated) • Height of heap: Θ(logM/BN/B) • Insertions • Wait until insert buffer is full (served at least Ω(M) inserts) • Then do one (occasionally two) bottom-up chains of sift-ups. • Cost: O(M/B∙logM/BN/B) • Amortized cost per insert: O(1/B∙logM/BN/B) • DeleteMax: • Wait until root is below half full (served at least Ω(M) deletemax) • Then do one, two, sometimes a lot of refills… dead • Do one sift-up: this is easy

  48. External Heap: I/O Analysis • Cost of all refills: • Need a global argument • Idea: trace individual elements • Total amount of “work”: O(N logM/BN/B) • One unit of work: move one element up one level • Refills do positive work • sift-ups do both positive and negative work • |positive work done by refills| + |positive works done by sift-ups| – |negative work done by sift-ups| = O(N logM/BN/B) • But note: |positive works done by sift-ups| >|negative work done by sift-ups| • So, |positive work done by refills| = O(N logM/BN/B)

More Related