430 likes | 528 Views
Data structures: time, I/Os, entropy, joules. Paolo Ferragina Dipartimento di Informatica Università di Pisa. ... but do NOT forget practice ;-). Our driving moral. Big steps come from theory. Strings... why?. Ubiquitous: any datum is a sequence of bits, hence a string
E N D
Data structures: time, I/Os, entropy, joules Paolo Ferragina Dipartimento di Informatica Università di Pisa
... but do NOT forget practice ;-) Our driving moral... Big steps come from theory
Strings... why? • Ubiquitous: any datum is a sequence of bits, hence a string • Spur new problems in many areas: • Geometry • String-similarity search Points in high-dim space and NN-search • Lower/upper-bounds to indexing via reductions to geo-problem • Graphs • Doc-doc similarity graph ubiquitous in Text/Web mining • Query-log graphs edge iff 2 queries clicked on the same res-page • Data compression Shortest paths on char-based weighted graphs • [Ferragina et al, SODA 09, ESA 09]
(String-)Dictionary Problem Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix-searches for a pattern P. Exact search Hashing Mitzenmacher, ESA invited ‘09
[Fredkin, CACM 1960] 2 2 0 5 1 1 4 5 6 7 2 3 Dominated the string-matching scene in the ‘80s-90s with its suffix-version: the Suffix Tree (Compacted) Trie • Performance: • Search≈ O(|P|) time • Space≈ O(K + N) s y z omo aibelyite stile zyg (2; 3,5) czecin etic ygy ial systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo
Timeline: theoryandpractice... What about Software Engineers ?? Suffix Tree Trie ‘60 ’90 ’70-’80
[Fredkin, CACM 1960] 2 2 0 5 1 1 4 5 6 7 2 3 Dominated the string-matching scene in the ‘80s-90s with its suffix-version: the Suffix Tree (Compacted) Trie • Performance: • Search≈ O(|P|) time • Space≈ O(K + N) s y z • ... But in practice… • Search: random memory accesses • Space: len + pointers + strings omo aibelyite stile zyg (2; 3,5) czecin etic ygy ial systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo
Used the Compacted trie, of course, but with 2 other concerns because of large data • What did systems implement?
5 5 2 3345% 0 http://checkmate.com/All/Natural/Washcloth.html... 1° issue: space concern Front Coding ….systile syzygetic syzygial syzygy…. 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ... Bender et al., PODS 2006 Ferragina et al., PODS 2008
track 2° issue: hierarchical memory M Spatial locality or Temporal locality caching: less I/Os Less and Faster I/Os 1 HD B CPU Internal Memory Count I/Os
Internal Memory Disk 2-level indexing • 2 advantages: • Search≈ typically 1 I/O • Space≈ Front-coding over buckets CT on a sample • 2 limitations: • Sampling rate≈ lengths of sampled strings • Trade-off ≈ speed vsspace (because of bucket size) systileszaielyite (Prefix) B-tree ….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….
Timeline: theoryandpractice... Space + Hierarchical Memory Do we need to trade space by I/Os ? 2-level indexing Suffix Tree String B-tree Trie ‘60 1995 ’90 ’70-’80
2 2 0 5 1 1 4 5 6 7 2 3 [Morrison, J.ACM 1968] An old idea: Patricia Trie s y z omo aibelyite stile zyg czecin etic ygy ial
2 2 0 1 5 • Search(P): • Phase 1: tree navigation 5 2 1 0 [Ferragina-Grossi, J.ACM 1999] A new search Three-phase search: P = syzyyea s • Phase 2: Compute LCP y z • Phase 3: tree navigation g < y z a o s Only 1 string is checked Trie Space ≈ #strings, NOT their length c P’s position e y i ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
+ • Search(P) • O((p/B) logB K) I/Os • O(occ/B) I/Os It is dynamic... 1 string checked : O(p/B) O(logB K) levels PT PT PT PT PT PT PT PT PT PT 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 29 2 26 13 20 25 6 18 3 14 21 23 21 17 23 [Ferragina-Grossi, J.ACM 1999] The String B-tree > 15 US-patents cite it !! [Handbook of Comp. Biology, 2009] 29 13 20 18 3 23 Lexicographic position of P Knuth, vol 3°, pag. 489: “elegant”
I/O-aware algorithms & data structures I/Os was the main concern [CACM 1988] [2006] Huge literature !!
net L2 RAM HD CPU L1 registers Cache Timeline: theoryandpractice... Not just 2 memory levels 2-level indexing Suffix Tree Trie ‘60 ’90 ’70-’80 String B-tree Space 1999 1995 • Parameter-free solutions • Anywhere, anytime, anyway... I/O-optimal !! Cache-oblivious Algo. and Data Str. See chap by Arge, Brodal, Fagerberg
Some precious achievements... • Cache-oblivious trie • Static dictionary of strings [Brodal et al, SODA 2006] • Cache-oblivious String B-tree • Dynamic dictionary of strings [Bender et al, PODS 2006] • Cache-oblivious tree mapping • Split-and-Refine that applies to any B-fixed tree partitioning • [Alstrup et al, manuscript 2003] • Worst-case solution [Demaine et al, manuscript 2004] Patricia Trie
Timeline: theoryandpractice... Not just 2 memory levels Cache-oblivious data structures 2-level indexing Suffix Tree Trie Compressed data structures ‘60 ’90 String B-tree ’70-’80 Space 1995 1999
A challenging question [Ken Church, AT&T 1995] Soft. Eng. use many “squeezing heuristics” that compress data and still support fast access to them Can we “automate” and “guarantee” the process ?
Opportunistic Data Structures with Applications P. Ferragina, G. Manzini Aka: Compressed self-indexes ...now, J.ACM 2005 • Space for text + (full-text) index compressed text ( Hk) • Query/Decompression time theoretically (quasi-)optimal
# mississipp i i #mississipp i ppi#mississ i ssippi#miss i ssissippi# m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i [Burrows-Wheeler, 1994] The big (unconscious) step... Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Can we compress it ?
i ssippi#miss i ssissippi# m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i [Burrows-Wheeler, 1994] The big (unconscious) step... bwt(T) Let us given a text T = mississippi# mississippi# # mississipp i ississippi#m i #mississipp ssissippi#mi i ppi#mississ sissippi#mis issippi#miss ssippi#missi T sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi bzip2 = BWT + other simple compressors
i ssippi#miss i ssissippi#m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i [Burrows-Wheeler, 1994] The big (unconscious) step... bwt(T) Let us given a text T = mississippi# mississippi# #mississipp i ississippi#m i #mississipp ssissippi#mi i ppi#mississ sissippi#mis issippi#miss ssippi#missi T sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Suffix Array bzip2 = BWT + other simple compressors
i ssippi#miss i ssissippi#m m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i From practice to theory... [Ferragina-Manzini, Focs ‘00] bwt(T) #mississipp i i #mississipp i ppi#mississ • FM-index = BWT is searchable • ...or Suffix Array is compressible • Space = l |T| Hk + o(|T|) bits • Search(P) = O(p + occ * polylog(|T|)) Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007]
Compressed & Searchable data formats Texts FOCS 2000 SODA 2003, 04 SODA 2007 SPIRE 2007 CPM 2008 CPM 2010 ICALP 2010 Integer Sets SODA 2002 … FOCS 2008 STACS 2009 Trees SODA 2002 SODA 2007 ICALP 2007 SWAT 2008 ICALP 2009 SODA 2010 Graphs DCC 2001 WWW 2004 ISAAC 2007 ESA 2008 FOCS 2009 Labeled Trees SODA 2002 FOCS 2005 WWW 2006 SODA 2007 ICDE 2010 Functions ICALP 2003, 04 SODA 2004 ICALP 2008 ESA 2009 LATIN 2010 Point Sets SODA 2003 TALG 2007 WADS 2009 SODA 2009 Images DCC 2008
[December 2003] [January 2005]
> 103 faster than Smith-W. >102 faster than SOAP & Maq What about the Web ? [Ferragina-Manzini, ACM WSDM 2010]
Where we are nowadays Cache-oblivious data structures 2-level indexing Suffix Tree Trie Compressed data structures ‘60 ’90 String B-tree ’70-’80 Something is known... yet very preliminary [PODS ‘08, Navarro, Vitter, ...] Bellazougui et al, this ESA 1995 1999
What else... • [E. Gal, S. Toledo. ACM Comp. Surv., 2005] [Ajwani et al, WEA 2009] • Solid-state disks: no mechanical parts • ... very fast reads, but slow writes & wear leveling • Self-adjusting or Weighted design • Time ops depend on some (un/known) distribution • Challenging : no pointers, self-adjust (perf) vs compression (space)
A bigger challenge: from micro to macro ! IEEE Computer, 2007
Approach #1 (engineering oriented) • News: Proper system components + specific algorithms • Sanders & Meyer’s groups, IEEE Conf. on Green Comp. 2010 [SSDisks + Atom + Sort]
Approach #2 (Manage resources) • Goal: Develop on-line algorithms that dynamically manage power by trading off performance, energy and reliability • Susanne Albers, Comm. ACM 2010
Approach #3 (models and algorithms) IEEE Computer, 2009 “Algorithmics offers benefits that extend far beyond TCS into the design of systems.” Workshop in IEEE Conf. on Green Comp. 2010
Energy-aware Algo+Ds ? Memory-level impacts Locality pays off I/Os and compression are obviously important BUT here there is a new twist
Battery life !! MIPS per Watt ? Approach in a principled way Who cares whether your application: is y% slower than optimal, but it is more energy efficient ? occupies x% more space than optimal, but decompr is faster ?
Battery life !! MIPS per Watt ? Stay tuned: Algorithm Library for Mobile Phones Idea: Multi-objective optimization in data-structure design
v Hbase - Hadoop BigTable, 2006 Cosmos HyperTable Cassandra Real-time search Q&A social search Knowledge search
Many ingredients • Items are graphs, vectors, strings, … • Number and size are VERY large • Involve many resources to be optimized: • Time (speed/patience) • Space (#disks/management costs) • Bandwidth (speed/€) • Energy (€) Multi-objective optimization in data-structure design!
That’s all ! • Look at my paper in the proceedings