Algorithms and data structures for big data , what ’ s next?

Algorithms and data structures for big data, what’s next? Paolo Ferragina University of Pisa

Is Big Data a buzz word ?

“Big Data”vs“Grid Computing”

VLDB does exist since 1992

Big data, big impact !

Big data are everywhere !

No SQL [Procs OSDI 2006] Hadoop Cassandra HyperTable Cosmos

From macro to micro-users Energy is related to time/memory-accesses in an intricated manner, so the issue “algo + memory levels” is a key for everyday users, not only big players

... but do NOT forget practice ;-) Our driving moral... Big steps come from theory

Our running example

(String-)Dictionary Problem Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P. Exact search  Hashing

[Fredkin, CACM 1960] 2 2 0 5 1 3 1 4 5 6 7 2 Dominated the string-matching scene in the ‘80s-90s Most known is the Suffix Tree (Compacted) Trie • Performance: • Search≈ O(|P|) time • Space≈ O(N) s y z • Software engineers objected: • Search: random memory accesses • Space: pointers + strings omo aibelyite stile zyg (2; 3,5) czecin etic Lexicographic search P = systo ygy ial systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo

Timeline: theoryandpractice... What about Software Engineers ?? Suffix Tree Trie ‘60 ’90 ’70-’80

Used the Compacted trie, of course, but with 2 other concerns because of large data • What did systems implement?

5,ial 5,y 2,zygetic 3345% 0 http://checkmate.com/All/Natural/Washcloth.html... 1° issue: space concern Front Coding systile syzygetic syzygial syzygy…. 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ...

track 2° issue: Disk memory B • 2 main features: • Seek time = I/Os are costly • Blocked access =B items per I/O Count I/Os Why are stringschallenging ? 1 CPU Internal Memory Strings may be arbitrarily long

Internal Memory Disk 2-level indexing • 2 advantages: • Search≈ typically 1 disk access • Space≈ Front-coding over buckets CT on a sample One main limitation: Sampling rate &lengths of sampled strings Trade-offbtw speed vsspace (because of bucket size) systileszaielyite (Prefix) B-tree B B ….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….

Timeline: theoryandpractice... Space + Hierarchical Memory Do we need to trade space by I/Os ? 2-level indexing Suffix Tree String B-tree Trie ‘60 1995 ’90 ’70-’80

5 2 2 0 1 [Morrison, J.ACM 1968] An old idea: Patricia Trie s y z stile zyg omo aibelyte etic y ial czecin Disk ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

2 2 0 1 5 • Search(P): • Phase 1: tree navigation 5 0 1 2 [Ferragina-Grossi, J.ACM 1999] A new (lexicographic) search Lexicographic search: P = syzytea s • Phase 2: Compute LCP y z • Phase 3: tree navigation yg z a o s Lexicographic position c e y Only 1 string is checked on disk Trie Space ≈ #strings, NOT their length i Disk ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

+ • Search(P) • O((p/B) logB K) I/Os O(occ/B) I/Os Itis dynamic... Check 1 string = O(p/B) I/Os O(logB K) levels PT PT PT PT PT PT PT PT PT PT 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 29 2 26 13 20 25 6 18 3 14 21 23 21 17 23 [Ferragina-Grossi, J.ACM 1999] The String B-tree > 15 US-patents cite it !! 29 13 20 18 3 23 Lexicographic position of P Knuth, vol 3°, pag. 489: “elegant”

I/O-aware algorithms & data structures I/Os was the main concern [CACM 1988] [2006] Huge literature !!

net L2 RAM HD CPU L1 registers Cache Timeline: theoryandpractice... Not just 2 memory levels 2-level indexing Suffix Tree Trie ‘60 ’90 ’70-’80 String B-tree 1999 1995 • Cache-oblivious solutions, aka parameter-free algo+ds • Anywhere, anytime, anyway... I/O-optimal !!

Timeline: theoryandpractice... Not just 2 memory levels Cache-oblivious data structures 2-level indexing Suffix Tree Trie Compressed data structures ‘60 ’90 String B-tree ’70-’80 Space 1999 1995

A challenging question [Ken Church, AT&T 1995] Software Engineers use “squeezing heuristics” that compress data and still support fast access to them Can we “automate” and “guarantee” the process ?

Opportunistic Data Structures with Applications P. Ferragina, G. Manzini Aka: Compressed self-indexes ...now, J.ACM 2005 • Space for text+index space for compressed text only ( Hk) • Query/Decompression time  theoretically (quasi-)optimal

# mississipp i i #mississipp i ppi#mississ i ssippi#miss i ssissippi# m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i [Burrows-Wheeler, 1994] The big (unconscious) step... Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Highly compressible, but…

i ssippi#miss i ssissippi# m Sort the rows m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i [Burrows-Wheeler, 1994] The big (unconscious) step... bwt(T) Let us given a text T = mississippi# mississippi# # mississipp i ississippi#m i #mississipp ssissippi#mi i ppi#mississ sissippi#mis issippi#miss ssippi#missi T sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi bzip2 = BWT + other simple compressors

5 issippi#miss 2 ississippi#m 1 mississippi# 10 pi#mississi p 9 ppi#mississi 7 sippi#missis 4 sissippi#mis 6 ssippi#missi 3 ssissippi#mi From practice to theory... [Ferragina-Manzini, IEEE Focs ‘00] bwt(T) sa(T) 12 #mississippi 11 i#mississipp 8 ippi#mississ • FM-index = BWT is searchable • ...or Suffix Array is compressible • Space = l |T| Hk + o(|T|) bits • Search(P) = O(p + occ * polylog(|T|)) Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007]

Compressed & Searchable data formats • After our paper in FOCS 2000, about texts • Wefindnowdayscompressedindexes for: • Trees • Labeled trees and graphs • Functions • Integer Sets • Geometry • Images • ...

From theory to practice… December 2003

ACM J. on Experimental Algorithmics, 2009

> 103 faster than Smith-W. >102 faster than SOAP & Maq

What about the Web ? [Ferragina-Manzini, ACM WSDM 2010]

IEEE FOCS 2005 WWW 2006 J. ACM 2009 US Patent 2012 An XML excerpt <dblp> <book> <author> Donald E. Knuth</author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Donald E. Knuth </author> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293-326 </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> ... </dblp>

A tree interpretation XBW transform • XML document exploration  Tree navigation • XML document search  Labeled subpath searches

XBW Transform: Some performance figures Xerces better on smaller files Xerces worse on larger files Xerces uses 10x space Num searches per second larger and larger datasets

Where we are nowadays Cache-oblivious data structures 2-level indexing Suffix Tree Trie Compressed data structures ‘60 ’90 String B-tree ’70-’80 Something is known... yet very preliminary Lower Bounds derived from Geometry Text search = 2d Range Search 1995 1999

New food for research.. 40Gb, about 100$ • [E. Gal, S. Toledo. ACM Comp. Surv., 2005] [Ajwani et al, WEA 2009] • Solid-state disks: no mechanical parts • ... very fast reads, but slow writes & wear leveling • Self-adjusting or Weighted design • Time ops depend on some (un/known) distribution • Challenge: no pointers, self-adjust (perf) vs compression (space) [Ferragina et al, ESA 2011]

The energy challenge IEEE Computer, 2007

Browsing a web site The most used!

Yet today, it is a problem... Apple is still working on the battery life problem: “The recent iOS software update addressed many of the battery issues that some customers experienced on their iOS 5 devices. We continue to investigate a few remaining issues.” (nov 2011, wired.com) “Windows 8's power hygiene: the scheduler will ignore the unused software”(Feb 2012, MSDN)

Energy-aware Algo+Ds ? Memory-level impacts Locality pays off I/Os and compression are obviously important BUT here there is a new twist

Battery life !! MIPS per Watt ? Idea: Multi-objective optimization in data-structure design Approach in a principled way Who cares whether your application: is y% slower than optimal, but it is more energy efficient ? takes x% more space than optimal, butitis more energyefficient ?

A preliminary step Took inspiration from BigTable(Google), ... Design a compressed storage scheme that can trade in a principled waybetween space vs decompression time [vs energy efficiency] Requirements: gzip-like compression [like Snappy or lz4by Google] Goal: Fix the space occupancy, find the best compressionthat achieves that space and minimizes the decompression time (or vice versa) Copy back new char Copy back [abrac] adabra -> [abrac] (a) (d) (abra) -> [abrac] <2,1> <0,d> <7,4>

A preliminary step... NP-hard in general This special case is POLY: O(n3) • Modeled as a Constrained Shortest Path problem: • Nodes = one per char of the text to be compressed • Edges = single char or copy back substrings • 2 edge weights = decompression time (t) and compressed space (c) n is huge m might be n2 LZ-parsing = Path from 1 to 12 We solved heuristically (Lagrangian Dual) and provably (Path Swap)

A preliminary step...

Algorithms and data structures for big data , what ’ s next?