1 / 57

Sublinear

Sublinear. Algorithms. Sloan Digital Sky Survey. 4 petabytes (~1MG). 10 petabytes/yr. Biomedical imaging. 150 petabytes/yr. Data. Data. massive input. output. Sample tiny fraction. Sublinear algorithms. Approximate MST. [CRT ’01]. Optimal!.

keegan-rice
Download Presentation

Sublinear

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sublinear Algorithms

  2. Sloan Digital Sky Survey 4 petabytes (~1MG) 10 petabytes/yr Biomedical imaging 150 petabytes/yr

  3. Data

  4. Data

  5. massive input output Sample tiny fraction Sublinear algorithms

  6. ApproximateMST [CRT ’01] Optimal!

  7. Reduces to counting connected components

  8. E = no. connected components 2 var << (no. connected components)

  9. Shortest Paths [CLM ’03]

  10. Ray Shooting [CLM ’03] Optimal! Volume Intersection Point location

  11. Self-Improving Algorithms

  12. 011010110110101010110010101010110100111001101010010100010 low-entropy data • Takens embeddings • Markov models (speech)

  13. Self-Improving Algorithms Arbitrary, unknown random source Sorting Matching MaxCut All pairs shortest paths Transitive closure Clustering

  14. Self-Improving Algorithms Arbitrary, unknown random source 1. Run algorithm for best worst-case behavior or best under uniform distribution or best under some postulated prior. 2. Learning phase: Algorithm finetunes itself as it learns about the random source through repeated use. 3. Algorithm settles to stationary status: optimal expected complexity under (still unknown) random source.

  15. Self-Improving Algorithms 0110101100101000101001001010100010101001 time T1 time T2 time T3 time T4 time T5 E Tk  Optimal expected time for random source

  16. Sorting (x1, x2, … , xn) each xi independent from Di H = entropy of rank distribution Optimal!

  17. Clustering K-median (k=2)

  18. d Minimize sum of distances Hamming cube {0,1}

  19. d Minimize sum of distances Hamming cube {0,1} NP-hard

  20. d Minimize sum of distances Hamming cube {0,1} [KSS]

  21. How to achieve linear limiting expected time? dn Input space {0,1} Identify core Use KSS prob < O(dn)/KSS Tail:

  22. NP vs P: input vicinity  algorithmic vicinity How to achieve linear limiting expected time? Store sample of precomputed KSS nearest neighbor Incremental algorithm

  23. Main difficulty: How to spot the tail?

  24. Online Data Reconstruction

  25. 011010110110101010110010101010110100111001101010010100010 011010110***110101010110010101010***10011100**10010***010 1. Data is accessible before noise 2. Or it’s not 2. Or ?

  26. 011010110***110101010110010101010***10011100**10010***010 1. Data is accessible before noise

  27. 011010110110101010110010101010110100111001101010010100010 010*10*0**001 decode encode error correcting codes

  28. 011010110110101010110010101010110100111001101010010100010 Data inaccessible before noise Assumptions are necessary !

  29. 011010110110101010110010101010110100111001101010010100010 Data inaccessible before noise 1. Sorted sequence 2. Bipartite graph, expander 3. Solid w/ angular constraints 4. Low dim attractor set

  30. 011010110110101010110010101010110100111001101010010100010 Data inaccessible before noise data must satisfy some property P but does not quite

  31. f(x) = ? f =access function x data f(x) But life being what it is…

  32. f(x) = ? x data f(x)

  33. Humans Define distance from any object to data class

  34. no undo f(x) = ? filter x x1, x2,… g(x) f(x1), f(x2),… g is access function for:

  35. Similar to Self-Correction [RS96, BLR’93] except: about data, not functions error-free allows O(distance to property)

  36. d Monotone function: [n]  R Filter requires polylog (n) queries

  37. Offline reconstruction

  38. Offline reconstruction

  39. Online reconstruction

  40. Online reconstruction

  41. Online reconstruction don't mortgage the future

  42. Online reconstruction early decisions are crucial !

  43. monotonefunction

  44. Frequency of a point x Smallest interval I containing > |I|/2 violations involving f(x)

  45. Frequency of a point

  46. Given x: 1. estimate its frequency 2. if nonzero, find “smallest” interval around x with both endpoints having zero frequency 3. interpolate between f(endpoints)

  47. To prove: 1. Frequencies can be estimated in polylog time 2. Function is monotone over zero-frequency domain 3. ZF domain occupies (1-2 ) fraction

  48. Bivariate concave function Filter requires polylog (n) queries

More Related