algorithms and data structures for big data what s next n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Algorithms and data structures for big data , what ’ s next? PowerPoint Presentation
Download Presentation
Algorithms and data structures for big data , what ’ s next?

Loading in 2 Seconds...

play fullscreen
1 / 53

Algorithms and data structures for big data , what ’ s next? - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Algorithms and data structures for big data , what ’ s next?. Paolo Ferragina University of Pisa. Is Big Data a buzz word ?. “ Big Data ” vs “ Grid Computing ”. VLDB does exist since 1992. Big data, big impact !. Big data are everywhere !. No SQL. [Procs OSDI 2006]. Hadoop.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Algorithms and data structures for big data , what ’ s next?' - octavia-green


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide7

No

SQL

[Procs OSDI 2006]

Hadoop

Cassandra

HyperTable

Cosmos

from macro to micro users
From macro to micro-users

Energy is related to time/memory-accesses in an

intricated manner, so the issue “algo + memory levels”

is a key for everyday users, not only big players

string dictionary problem
(String-)Dictionary Problem

Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P.

Exact search  Hashing

compacted trie

[Fredkin, CACM 1960]

2

2

0

5

1

3

1

4

5

6

7

2

Dominated the string-matching scene

in the ‘80s-90s

Most known is the Suffix Tree

(Compacted) Trie
  • Performance:
  • Search≈ O(|P|) time
  • Space≈ O(N)

s

y

z

  • Software engineers objected:
  • Search: random memory accesses
  • Space: pointers + strings

omo

aibelyite

stile

zyg

(2; 3,5)

czecin

etic

Lexicographic search

P = systo

ygy

ial

systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo

timeline theory and practice
Timeline: theoryandpractice...

What about

Software Engineers ??

Suffix Tree

Trie

‘60

’90

’70-’80

slide16

Used the Compacted trie, of course, but with 2 other concerns

because of large data

  • What did systems implement?
1 issue space concern

5,ial

5,y

2,zygetic

3345%

0 http://checkmate.com/All/Natural/Washcloth.html...

1° issue: space concern

Front

Coding

systile syzygetic syzygial syzygy….

0 http://checkmate.com/All_Natural/

33 Applied.html

34 roma.html

38 1.html

38 tic_Art.html

34 yate.html

35 er_Soap.html

35 urvedic_Soap.html

33 Bath_Salt_Bulk.html

42 s.html

25 Essence_Oils.html

25 Mineral_Bath_Crystals.html

38 Salt.html

33 Cream.html

http://checkmate.com/All_Natural/

http://checkmate.com/All_Natural/Applied.html

http://checkmate.com/All_Natural/Aroma.html

http://checkmate.com/All_Natural/Aroma1.html

http://checkmate.com/All_Natural/Aromatic_Art.html

http://checkmate.com/All_Natural/Ayate.html

http://checkmate.com/All_Natural/Ayer_Soap.html

http://checkmate.com/All_Natural/Ayurvedic_Soap.html

http://checkmate.com/All_Natural/Bath_Salt_Bulk.html

http://checkmate.com/All_Natural/Bath_Salts.html

http://checkmate.com/All/Essence_Oils.html

http://checkmate.com/All/Mineral_Bath_Crystals.html

http://checkmate.com/All/Mineral_Bath_Salt.html

http://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html

...

2 issue disk memory

track

2° issue: Disk memory

B

  • 2 main features:
  • Seek time = I/Os are costly
  • Blocked access =B items per I/O

Count I/Os

Why are stringschallenging ?

1

CPU

Internal

Memory

Strings may be arbitrarily long

2 level indexing

Internal

Memory

Disk

2-level indexing
  • 2 advantages:
  • Search≈ typically 1 disk access
  • Space≈ Front-coding over buckets

CT

on a sample

One main limitation:

Sampling rate &lengths of sampled strings

Trade-offbtw speed vsspace (because of bucket size)

systileszaielyite

(Prefix) B-tree

B

B

….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….

timeline theory and practice1
Timeline: theoryandpractice...

Space

+

Hierarchical Memory

Do we need to trade

space by I/Os ?

2-level indexing

Suffix Tree

String B-tree

Trie

‘60

1995

’90

’70-’80

an old idea patricia trie

5

2

2

0

1

[Morrison, J.ACM 1968]

An old idea: Patricia Trie

s

y

z

stile

zyg

omo

aibelyte

etic

y

ial

czecin

Disk

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

a new lexicographic search

2

2

0

1

5

  • Search(P):
  • Phase 1: tree navigation

5

0

1

2

[Ferragina-Grossi, J.ACM 1999]

A new (lexicographic) search

Lexicographic search:

P = syzytea

s

  • Phase 2: Compute LCP

y

z

  • Phase 3: tree navigation

yg

z

a

o

s

Lexicographic

position

c

e

y

Only 1 string is checked on disk

Trie Space ≈ #strings, NOT their length

i

Disk

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

the string b tree

+

  • Search(P)
  • O((p/B) logB K) I/Os

O(occ/B) I/Os

Itis dynamic...

Check 1 string = O(p/B) I/Os

O(logB K) levels

PT

PT

PT

PT

PT

PT

PT

PT

PT

PT

29 1 9 5 2

26 10 4 7 13

20 16 28 8 25

6 12 15 22 18

3 27 24 11 14

29 2 26 13

20 25 6 18

3 14 21 23

21 17 23

[Ferragina-Grossi, J.ACM 1999]

The String B-tree

> 15 US-patents cite it !!

29 13 20 18 3 23

Lexicographic position of P

Knuth, vol 3°, pag. 489: “elegant”

i o aware algorithms data structures
I/O-aware algorithms & data structures

I/Os

was the

main concern

[CACM 1988]

[2006]

Huge literature !!

timeline theory and practice2

net

L2

RAM

HD

CPU

L1

registers

Cache

Timeline: theoryandpractice...

Not just 2 memory levels

2-level indexing

Suffix Tree

Trie

‘60

’90

’70-’80

String B-tree

1999

1995

  • Cache-oblivious solutions, aka parameter-free algo+ds
    • Anywhere, anytime, anyway... I/O-optimal !!
timeline theory and practice3
Timeline: theoryandpractice...

Not just 2 memory levels

Cache-oblivious

data structures

2-level indexing

Suffix Tree

Trie

Compressed

data structures

‘60

’90

String B-tree

’70-’80

Space

1999

1995

slide27

A challenging question [Ken Church, AT&T 1995]

Software Engineers use “squeezing heuristics” that

compress data and still support fast access to them

Can we “automate” and “guarantee” the process ?

aka compressed self indexes

Opportunistic Data Structures with Applications

P. Ferragina, G. Manzini

Aka: Compressed self-indexes

...now, J.ACM 2005

  • Space for text+index space for compressed text only ( Hk)
  • Query/Decompression time  theoretically (quasi-)optimal
the big unconscious step

# mississipp i

i #mississipp

i ppi#mississ

i ssippi#miss

i ssissippi# m

Sort the rows

m ississippi#

p i#mississi p

p pi#mississ i

s ippi#missi s

s issippi#mi s

s sippi#miss i

s sissippi#m i

[Burrows-Wheeler, 1994]

The big (unconscious) step...

Let us given a text T = mississippi#

mississippi#

ississippi#m

ssissippi#mi

sissippi#mis

issippi#miss

ssippi#missi

sippi#missis

ippi#mississ

ppi#mississi

pi#mississip

i#mississipp

#mississippi

Highly compressible, but…

the big unconscious step1

i ssippi#miss

i ssissippi# m

Sort the rows

m ississippi#

p i#mississi p

p pi#mississ i

s ippi#missi s

s issippi#mi s

s sippi#miss i

s sissippi#m i

[Burrows-Wheeler, 1994]

The big (unconscious) step...

bwt(T)

Let us given a text T = mississippi#

mississippi#

# mississipp i

ississippi#m

i #mississipp

ssissippi#mi

i ppi#mississ

sissippi#mis

issippi#miss

ssippi#missi

T

sippi#missis

ippi#mississ

ppi#mississi

pi#mississip

i#mississipp

#mississippi

bzip2 = BWT + other simple compressors

from practice to theory

5 issippi#miss

2 ississippi#m

1 mississippi#

10 pi#mississi p

9 ppi#mississi

7 sippi#missis

4 sissippi#mis

6 ssippi#missi

3 ssissippi#mi

From practice to theory...

[Ferragina-Manzini, IEEE Focs ‘00]

bwt(T)

sa(T)

12 #mississippi

11 i#mississipp

8 ippi#mississ

  • FM-index = BWT is searchable
  • ...or Suffix Array is compressible
  • Space = l |T| Hk + o(|T|) bits
  • Search(P) = O(p + occ * polylog(|T|))

Nowadays tons of papers: theory & experiments

[Navarro-Makinen, ACM Comp. Surveys 2007]

compressed searchable data formats
Compressed & Searchable data formats
  • After our paper in FOCS 2000, about texts
  • Wefindnowdayscompressedindexes for:
    • Trees
    • Labeled trees and graphs
    • Functions
    • Integer Sets
    • Geometry
    • Images
    • ...
slide35

> 103 faster than Smith-W.

>102 faster than SOAP & Maq

slide36

What about the Web ?

[Ferragina-Manzini, ACM WSDM 2010]

an xml excerpt

IEEE FOCS 2005

WWW 2006

J. ACM 2009

US Patent 2012

An XML excerpt

<dblp>

<book>

<author> Donald E. Knuth</author>

<title> The TeXbook </title>

<publisher> Addison-Wesley </publisher>

<year> 1986 </year>

</book>

<article>

<author> Donald E. Knuth </author>

<author> Ronald W. Moore </author>

<title> An Analysis of Alpha-Beta Pruning </title>

<pages> 293-326 </pages>

<year> 1975 </year>

<volume> 6 </volume>

<journal> Artificial Intelligence </journal>

</article>

...

</dblp>

a tree interpretation
A tree interpretation

XBW

transform

  • XML document exploration  Tree navigation
  • XML document search  Labeled subpath searches
xbw transform some performance figures
XBW Transform: Some performance figures

Xerces better on

smaller files

Xerces worse on

larger files

Xerces uses

10x space

Num searches per second

larger and larger datasets

where we are nowadays
Where we are nowadays

Cache-oblivious

data structures

2-level indexing

Suffix Tree

Trie

Compressed

data structures

‘60

’90

String B-tree

’70-’80

Something is known... yet very preliminary

Lower Bounds derived from Geometry

Text search = 2d Range Search

1995

1999

new food for research
New food for research..

40Gb, about 100$

  • [E. Gal, S. Toledo. ACM Comp. Surv., 2005]

[Ajwani et al, WEA 2009]

  • Solid-state disks: no mechanical parts
    • ... very fast reads, but slow writes & wear leveling
  • Self-adjusting or Weighted design
    • Time ops depend on some (un/known) distribution
    • Challenge: no pointers, self-adjust (perf) vs compression (space)

[Ferragina et al, ESA 2011]

the energy challenge
The energy challenge

IEEE Computer, 2007

browsing a web site
Browsing a web site

The most used!

yet today it is a problem
Yet today, it is a problem...

Apple is still working on the battery life problem: “The recent iOS software update addressed many of the battery issues that some customers experienced on their iOS 5 devices. We continue to investigate a few remaining issues.” (nov 2011, wired.com)

“Windows 8's power hygiene: the scheduler will ignore the unused software”(Feb 2012, MSDN)

energy aware algo ds
Energy-aware Algo+Ds ?

Memory-level impacts

Locality pays off

I/Os and compression

are obviously important

BUT

here there is a new twist

mips per watt

Battery life !!

MIPS per Watt ?

Idea:

Multi-objective optimization in data-structure design

Approach in a

principled way

Who cares whether your application:

is y% slower than optimal, but it is more energy efficient ?

takes x% more space than optimal, butitis more energyefficient ?

a preliminary step
A preliminary step

Took inspiration from BigTable(Google), ...

Design a compressed storage scheme that can trade in a principled waybetween

space vs decompression time [vs energy efficiency]

Requirements: gzip-like compression [like Snappy or lz4by Google]

Goal: Fix the space occupancy, find the best compressionthat achieves that space and minimizes the decompression time (or vice versa)

Copy back

new char

Copy back

[abrac] adabra -> [abrac] (a) (d) (abra) -> [abrac] <2,1> <0,d> <7,4>

a preliminary step1
A preliminary step...

NP-hard in general

This special case is POLY: O(n3)

  • Modeled as a Constrained Shortest Path problem:
    • Nodes = one per char of the text to be compressed
    • Edges = single char or copy back substrings
    • 2 edge weights = decompression time (t) and compressed space (c)

n is huge

m might be n2

LZ-parsing = Path from 1 to 12

We solved heuristically (Lagrangian Dual) and provably (Path Swap)

slide51

Wemainlycommented:

1990s: Data Bases

Hierarchical memories and I/Os

String Matching

RAM model, char cmp and time

2000s: Data Compression

Space reduction in indexes

and entropy space-bounds

2010s:Computational Geometry

Lower bounds on indexes

New upper bounds on I/Os, entropy

Nowadays…

Graph Theory

Space reduction in compressors

Optimization

Multi-objective design and joules

a quote to conclude

... but do NOT forget practice ;-)

A quote to conclude

“The distance between theory

and practice is closer in theory

than in practice”

[Y. Matias, Google]

Big steps come from theory