1 / 21

Paolo Ferragina Dipartimento di Informatica Università di Pisa

IR. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Paradigm shift:. Web 2.0 is about the many. Do big DATA need big PC s ??. an Italian Ad of the ’80 about a BIG brush or a brush BIG. big DATA  big PC ?. We have three types of algorithms:

Download Presentation

Paolo Ferragina Dipartimento di Informatica Università di Pisa

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IR Paolo Ferragina Dipartimento di Informatica Università di Pisa

  2. Paradigm shift: Web 2.0 is about the many

  3. Do big DATAneed big PCs ?? an Italian Ad of the ’80 about a BIG brush or a brush BIG....

  4. big DATA big PC ? • We have three types of algorithms: • T1(n) = n, T2(n) = n2, T3(n) = 2n ... and assume that 1 step = 1 time unit • How many input data n each algorithm may process within t time units? • n1 = t, n2 = √t, n3 = log2 t • What about a k-times faster processor? ...or, what is n, when the available time is k*t ? • n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

  5. A new scenario for Algorithmics • Data are more available than even before n ➜ ∞ ... is more than a theoretical assumption • The RAM model is too simple Step cost is w(1)

  6. net L2 RAM HD CPU L1 registers Few Tbs Few Gbs Tens of nanosecs Some words fetched Cache Few Mbs Some nanosecs Few words fetched Few millisecs B = 32K page Many Tbs Even secs Packets The memory hierarchy 1 RAM CPU

  7. Does Virtual Memory help ? • M = memory size, N = problem size • p = prob. of memory access [0,3÷0,4 (Hennessy-Patterson)] • C = cost of an I/O [105 ÷ 106 (Hennessy-Patterson)] If N ≤ M, then the cost per step is 1 If N=(1+e) M, then the avg cost per step is: 1 + C * p * e/(1+e) This is at least > 104 * e/(1+e) If e = 1/1000 ( e.g. M = 1Gb, N = 1Gb + 1Mb ) Avg step-cost is > 20

  8. read/write head track read/write arm HD RAM CPU magnetic surface The I/O-model Spatial locality or Temporal locality B Count I/Os 1 “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) Less and faster I/Os caching

  9. Other issues  other models • Random vs sequential I/Os • Scanning is better than jumping • Not just one CPU • Many PCs, Multi-cores CPUs or even GPUs • Parameter-free algorithms • Anywhere, anytime, anyway... Optimal !! Streaming algorithms Parallel or Distributed algorithms Cache-oblivious algorithms

  10. What about energy-consumption ? [Leventhal, CACM 2008] ≈10 IO/s/W ≈6000 IO/s/W

  11. ? Crawler Web Page Analizer Query resolver Ranker Our topics, on an example Page archive Hashing Linear Algebra Clustering Classification Query Indexer Sorting Dictionaries Which pages to visit next? text auxiliary Data Compression Structure

  12. Warm up... • Take Wikipedia in Italian, and compute word freq: • Few GBs  n  109 words • How do you proceed ?? • Tokenize into a sequence of strings • Sortthe strings • Create tuples < word, freq >

  13. 1 2 8 10 7 9 13 19 1 Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Divide Conquer Combine 2 7 Merge is linear in the #items to be merged

  14. But... Few key observations: • Items = (short) strings = atomic... • Q(n log n) memory accesses (I/Os ??) • [5ms] * n log2 n ≈ 3years In practice it is a “faster”, why?

  15. 4 15 2 10 13 19 1 5 7 9 1 2 5 7 9 10 13 19 3 8 3 4 6 8 11 12 15 17 12 17 6 11 1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 19 12 7 15 4 8 3 13 11 9 6 1 5 2 10 17 Implicit Caching… 2 passes (one Read/one Write) = 2 * (N/B) I/Os Log2 (N/M) 2 passes (R/W) 2 passes (R/W) log2 N M N/M runs, each sorted in internal memory (no I/Os) — I/O-cost for binary merge-sort is ≈ 2 (N/B) log2 (N/M)

  16. 1 2 4 7 9 10 13 19 3 5 6 8 11 12 15 17 A key inefficiency After few steps, every run is longer than B !!! Output Run 1, 2, 3 B B B Output Buffer Disk 1, 2, 3 4, ... B We are using only 3 pages But memory contains M/B pages ≈ 230/215 = 215

  17. Multi-way Merge-Sort • Sort N items with main-memory M and disk-pages B: • Pass 1: Produce (N/M) sorted runs. • Pass i: merge X =M/B-1 runs  logX N/M passes Pg for run1 . . . . . . Pg for run 2 . . . Out Pg Pg for run X Disk Disk Main memory buffers of B items

  18. Cost of Multi-way Merge-Sort • Number of passes = logX N/M  logM/B (N/M) • Total I/O-cost is Q( (N/B) logM/B N/M ) I/Os N/B logM/B M = logM/B [(M/B)*B] = (logM/BB) + 1 • In practice • M/B ≈ 105#passes =1 few mins Tuning depends on disk features • Large fan-out (M/B) decreases #passes • Compression would decrease the cost of a pass!

  19. I/O-lower bound for Sorting Every I/O fetches B items, in memory M Decision tree with fan out: There are N/B steps in which xB! cmp-outcomes Find t > N/B such that: We get t = W( (N/B) logM/B N/B ) I/Os

  20. Keep attention... • If sorting needs to manage arbitrarily long strings Key observations: • Array A is an “array of pointers to objects” • For each object-to-object comparison A[i] vs A[j]: • 2 random accesses to 2 memory locations A[i] and A[j] • Q(n log n) random memory accesses (I/Os ??) Indirectsort Memory containing the strings A Again chaching helps, But it may be less effective than before

More Related