Towards a synopsis warehouse
Download
1 / 51

Towards a Synopsis Warehouse - PowerPoint PPT Presentation


  • 198 Views
  • Uploaded on

Towards a Synopsis Warehouse. Peter J. Haas IBM Almaden Research Center San Jose, CA. Acknowledgements:. Kevin Beyer Paul Brown Rainer Gemulla (TU Dresden) Wolfgang Lehner (TU Dresden) Berthold Reinwald Yannis Sismanis. Search. Business Intelligence. Enterprise Repository. Content

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Towards a Synopsis Warehouse' - salena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Towards a synopsis warehouse

Towards a Synopsis Warehouse

Peter J. Haas

IBM Almaden Research Center

San Jose, CA


Acknowledgements

Acknowledgements:

Kevin Beyer

Paul Brown

Rainer Gemulla (TU Dresden)

Wolfgang Lehner (TU Dresden)

Berthold Reinwald

Yannis Sismanis


Information discovery for the enterprise

Search

BusinessIntelligence

Enterprise Repository

Content

MetadataBusiness objects

Account

Order

Analyze, Integrate

Crawl, ETL

Customer

ERP (SAP), CRM, WBIBPM, SCM

ECM (reports, spreadsheets, Financial docs (XBRL))

Office documentsE-Mail, Product Manuals

Structured

Semi-Structured

Unstructured

Crawlable/deep Web

Company Data

Syndicated Data Provider

Information Discovery for the Enterprise

Query: “Explain the product movement, buyer behavior, maximize the ROI on my product campaigns.”

Query: “The sales team is visiting company XYZ next week. What do they need to know about XYZ?”

Business-Object Discovery

Data Analysis &Similarity


Motivation continued
Motivation, Continued

  • Challenge: Scalability

    • Massive amounts of data at high speed

      • Batches and/or streams

    • Structured, semi-structured, unstructured data

  • Want quick approximate analyses

    • Automated data integration and schema discovery

    • “Business object” identification

    • Quick approximate answers to queries

    • Data browsing/auditing

  • Our approach: a warehouse of synopses

    • For scalability and flexibility


A synopsis warehouse

Full-Scale

Warehouse Of

Data Partitions

Synop.

Synop.

Synop.

Warehouse

of Synopses

S1,1

S1,2

Sn,m

merge

S*,*

S1-2,3-7

etc

A Synopsis Warehouse


Outline
Outline

  • Synopsis 1: Uniform samples

    • Background

    • Creating and combining samples

      • Hybrid Bernoulli and Hybrid Reservoir algorithms

    • Updating samples

      • Stable datasets: random pairing

      • Growing datasets: resizing algorithms

      • Maintaining Bernoulli samples of multisets

  • Synopsis 2: AKMV samples for DV estimation

    • Base partitions: KMV synopses

      • DV estimator and properties

    • Compound partitions: augmentation

      • DV estimator and closure properties


Synopsis 1 uniform samples

Uniform Sample

Stratified Samples, Etc.

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

Synopsis 1: Uniform Samples

Other Synopses

Statistical

Procedures

  • Design goals

    • True uniformity

    • Bounded memory

    • Keep sample full

    • Support for compressed samples

      • 80% of 1000 customer datasets had < 4000 distinct values

Mining Algorithms

x

x

x

x


Classical uniform methods
Classical Uniform Methods

  • Bernoulli sampling

    • Coin flip: includes each element with prob = q

    • Random, unbounded (binomial) sample size

    • Easy to merge: Bern(q)  Bern(q) = Bern(q)

  • Reservoir sampling

    • Creates uniform sample of fixed size k

      • Insert first k elements into sample

      • Then insert ith element with prob. pi = k / i

    • Variants and optimizations (e.g., Vitter)

    • Merging is harder

x6 x5 x4 x3 x2 x1

Include with prob. 3/5

x4

x2

Sample size = 3

x1


Drawback of basic methods
Drawback of Basic Methods

  • Neither method is very compact

    • Ex: dataset = (<A,500>,<B,300>)

    • Stored as (A,A,…,A,B,B,…B) - 800 chars

  • Concise sampling (GM 98)

    • Compact: purge Bern(q) sample S if too large

      • Bern(q’/q) subsample of S Bern(q’) sample

    • Not uniform (rare items under-represented)


New sampling methods icde 06
New Sampling Methods (ICDE ’06)

  • Two flavors:

    • Hybrid reservoir (HR)

    • Hybrid Bernoulli (HB)

  • Properties

    • Truly uniform

    • Bounded footprint at all times

    • Will store exact distribution if possible

    • Samples stored in compressed form

    • Merging algorithms available


Hybrid reservoir hr sampling

+a +a

{<a,2>}

+a

{<a,3>}

+b

+b

{<a,3>,b}

{<a,3>,<b,1>}

{a,<b,2>}

(subsample)

+b

{<a,3>,<b,2>}

{a,b,b}

(expand)

{c,b,b}

(reservoir sampling)

+d

{c,b,d}

+a

{a,a,a}

{<a,3>}

(compress)

Hybrid Reservoir (HR) Sampling

Ex: Sample capacity = two <v,#> pairs or three values

Phase 1 (Maintain exact frequency distribution)

+c

Phase 2 (Reservoir sampling)

done


Hybrid bernoulli
Hybrid Bernoulli

  • Similar to Hybrid Reservoir except

    • Expand into Bernoulli sample in Phase 2

    • Revert to Reservoir sample in Phase 3

  • If termination in Phase 2

    • Uniform sample

    • “Almost” a Bernoulli sample(controllable engineering approximation)


Merging samples
Merging Samples

  • Both samples in Phase 2 (usual case)

    • Bernoulli: equalize q’s and take union

      • Take subsample to equalize q’s

    • Reservoir: take subsamples and merge

      • Random (hypergeometric) subsample size

  • Corner cases

    • One sample in Phase 1, etc.

    • See ICDE ’06 paper for details


Hb versus hr
HB versus HR

  • Advantages:

    • HB samples are cheaper to merge

  • Disadvantages:

    • HR sampling controls sample size better

    • Need to know partition size in advance

      • For subsampling during sample creation

    • Engineering approximation required


Speedup hb sampling
Speedup: HB Sampling

You derive “speed-up” advantages from parallelism with up to about 100 partitions.


Speedup hr sampling
Speedup: HR Sampling

Similar results to previous slide, but merging HR samples is more complex than HB samples.


Linear scale up
Linear Scale-Up

HB Sampling

HR Sampling


Updates within a partition

X

Expensive!

Updates Within a Partition

  • Arbitrary inserts/deletes (updates trivial)

  • Previous goals still hold

    • True uniformity

    • Bounded sample size

    • Keep sample size close to upper bound

  • Also: minimize/avoid base-data access

(updates), deletes, inserts

Full-Scale

Warehouse

Partition

Synopsis

Warehouse

Sample


New algorithms vldb 06
New Algorithms (VLDB ’06+)

  • Stable datasets: Random pairing

    • Generalizes reservoir/stream sampling

      • Handles deletions

      • Avoids base-data accesses

    • Dataset insertions paired randomly with “uncompensated deletions”

      • Only requires counters (cg, cb) of “good” and “bad” UD’s

      • Insert into sample with probability cb / (cb + cg)

    • Extended sample-merging algorithm (VLDBJ ’07)

  • Growing datasets: Resizing

    • Theorem: can’t avoid base-data access

    • Main ideas:

      • Temporarily convert to Bern(q): may require base-data access

      • Drift up to new size (stay within new footprint at all times)

      • Choose q optimally to reduce overall resizing time

        • Approximate and Monte Carlo methods


Bernoulli samples of multisets pods 07
Bernoulli Samples of Multisets (PODS ’07)

  • Bernoulli samples over multisets (w. deletions)

    • When boundedness is not an issue

    • Compact, easy to parallelize

    • Problem: how to handle deletions (pairing?)

  • Idea: maintain “tracking counter”

    • # inserts into DS since first insertion into sample (GM98)

  • Can exploit tracking counter

    • To estimate frequencies, sums, avgs

      • Unbiased (except avg) and low variance

    • To estimate # distinct values (!)

  • Maintaining tracking counter

    • Subsampling: new algorithm

    • Merging: negative result


Outline1
Outline

  • Synopsis 1: Uniform samples

    • Background

    • Creating and combining samples

      • Hybrid Bernoulli and Hybrid Reservoir algorithms

    • Updating samples

      • Stable datasets: random pairing

      • Growing datasets: resizing algorithms

      • Maintaining Bernoulli samples of multisets

  • Synopsis 2: AKMV samples for DV estimation

    • Base partitions: KMV synopses

      • DV estimator and properties

    • Compound partitions: augmentation

      • DV estimator and closure properties


Akmv samples sigmod 07
AKMV Samples (SIGMOD ’07)

  • Goal: Estimate # distinct values

    • Dataset similarity (Jaccard distance)

    • Key detection

    • Data cleansing

  • Within warehouse framework

    • Must handle multiset union, intersection, difference


Kmv synopsis
KMV Synopsis

  • Used for a base partition

  • Synopsis: k smallest hashed values

    • vs bitmaps (e.g., logarithmic counting)

      • Need inclusion/exclusion to handle intersection

      • Less accuracy, poor scaling

    • vs sample counting

      • Random size K (between k/2 and k)

    • vs Bellman [DJMS02]

      • minHash for k independent hash functions

      • O(k) time per arriving value, vs O(log k)

  • Can view as uniform sample of DV’s


The basic estimator
The Basic Estimator

  • Estimate:

    • U(k) = kth smallest (normalized) hashed value

  • Properties (theory of uniform order statistics)

    • Normalized hashed values “look like” i.i.d. uniform[0,1] RVs

  • Large-D scenario (simpler formulas)

    • Theorem: U(k) approx.= sum of k i.i.d exp(D) random variables

    • Analysis coincides with [Cohen97]

    • Can use simpler formulas to choose synopsis size


Compound partitions
Compound Partitions

  • Given a multiset expression E

    • In terms of base partitions A1,…,An

    • Union, intersection, multiset difference

  • Augmented KMV synopsis

    • KMV synopsis for

    • Counters: cE(v) = multiplicity of value v in E

    • AKMV synopses are closed under multiset operations

  • Estimator (unbiased) for # DVs in E:

KE =# positive counters


Experimental comparison
Experimental Comparison

0.1

0.08

Absolute Relative Error

0.06

0.04

0.02

0

SDLogLog

Unbiased

Sample-Counting

Unbiased-baseline


For more details
For More Details

  • "Toward automated large scale information integration and discovery." P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. In Data Management in a Connected World, T. Härder and W. Lehner, eds. Springer-Verlag, 2005.

  • “Techniques for warehousing of sample data”. P. G. Brown and P. J. Haas. ICDE ‘06.

  • “A dip in the reservoir: maintaining sample synopses of evolving datasets”. R. Gemulla, W. Lehner, and P. J. Haas. VLDB ‘06.

  • “Maintaining Bernoulli samples over evolving multisets”.

    R. Gemulla, W. Lehner, and P. J. Haas. PODS ‘07.

  • “On synopses for distinct-value estimation under multiset operations” K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. SIGMOD ‘07.

  • “Maintaining bounded-size sample synopses of evolving multisets” R. Gemulla, W. Lehner, P. J. Haas. VLDB Journal, 2007.



Bernoulli sampling

q = 1/3

+

t

1

1

1

2

2

1

2

1

2

2

1

2

1

2

1

1

/

/

/

/

/

/

/

/

/

/

/

/

/

/

3

3

3

3

3

3

3

3

3

3

3

3

3

3

+

t

1

1

1

2

2

2

1

2

3

3

2

+

t

3

1

2

3

30%

15%

15%

7%

15%

7%

7%

4%

3

1

2

Bernoulli Sampling

  • Bern(q) independently includes each element with probability q

  • Random, uncontrollable sample size

  • Easy to merge Bernoulli samples: union of 2 Bern(q) samp’s = Bern(q)


Reservoir sampling example

1

/

3

1

/

3

1

/

3

+

t

1

2

3

2

1

3

3

+

t

+

t

1

2

1

2

2

/

4

1

/

4

1

/

4

2

/

4

1

/

4

1

/

4

2

/

4

1

/

4

1

/

4

100%

+

t

1

2

4

2

1

4

3

2

4

2

3

4

1

3

4

3

1

4

4

33%

33%

33%

16

%

8

%

8

%

16

%

8

%

8

%

16

%

8

%

8

%

+

+

t

t

+

+

t

t

1

1

2

2

1

1

2

2

1

/

3

1

/

3

1

/

3

+

t

1

2

3

2

1

3

3

Reservoir Sampling (Example)

  • Sample size M = 2


Concise sampling example
Concise-Sampling Example

  • Dataset

    • D = { a, a, a, b, b, b }

  • Footprint

    • F = one <value, #> pair

  • Three (possible) samples of size = 3

    • S1 = { a, a, a }, S2 = { b, b, b }, S3 = { a, a, b }.

    • S1 = {<a,3>}, S2 = {<b,3>}, S3 = {<a,2>,<b,1>}.

  • Three samples should have equal likelihood

    • But Prob(S1) = Prob(S2) > 0 and Prob(S3) = 0

  • In general:

    • Concise sampling under-represents ‘rare’ population elements


Hybrid bernoulli algorithm
Hybrid Bernoulli Algorithm

  • Phase 1

    • Start by storing 100% sample compactly

    • Termination in Phase 1  exact distribution

  • Abandon Phase 1 if footprint too big

    • Take subsample and expand

    • Fall back to Bernoulli sampling (Phase 2)

    • If footprint exceed: revert to reservoir sampling (Phase 3)

  • Compress sample upon termination

  • If Phase 2 termination: (almost) Bernoulli sample

  • If Phase 3 termination: Bounded reservoir sample

  • Stay within footprint at all times

    • Messy details


Subsampling in hb algorithm
Subsampling in HB Algorithm

  • Goal: find q such that P{|S| > nF} = p

  • Solve numerically:

  • Approximate solution (< 3% error):


Merging hb samples
Merging HB Samples

  • If both samples in Phase 2

    • Choose q as before (w.r.t. |D1 U D2|)

    • Convert both samples to compressed Bern(q)

      [Use Bern(q’/q) trick as in Concise Sampling]

    • If union of compressed samples fits in memory

      then join and exitelse use reservoir sampling (unlikely)


Merging a pair of hr samples
Merging a Pair of HR Samples

  • If both samples in Phase 2

    • Set k = min(|S1|, |S2|)

    • Select L elements from S1 and k – L from S2

      • L has hypergeometric distribution on {0,1,…,k}

        • Distribution depends on |D1|, |D2|

      • Take (compressed) reservoir subsamples of S1, S2

      • Join (compressed union) and exit


Generating realizations of l
Generating Realizations of L

L is a random variable with probability mass function

P(l) = P{ L=l } given by:

for l = 0, 1, …. k-1

  • Simplest implementation

    • Compute P recursively

    • Use inversion method (probe cumulative distribution at each merge)

  • Optimizations when |D|’s and |S|’s unchanging

    • Use alias methods to generate L from cached distributions in O(1) time


Na ve prior approaches
Naïve/Prior Approaches

Algorithm

Technique

Comments

(RS with deletions)

conduct deletions, continue with smaller sample

unstable

Naïve

use insertions to immediately refill the sample

not uniform

RS with resampling

let sample size decrease, but occasionally recompute

expensive, unstable

CAR(WOR)

immediately sample from base data to refill the sample

stable but expensive

Bernoulli sampling with purging

“coin flip” sampling with deletions, purge if too large

Not uniform (!)

Passive sampling

developed for data streams (sliding windows only)

special case of our RP algorithm

tailored for multiset populations

Distinct-value sampling

expensive, low space efficiency in our setting

Modification of concise sampling

Not uniform

Counting samples




A negative result
A Negative Result

  • Theorem

    • Any resizing algorithm MUST access base data

  • Example

    • data set

    • samples of size 2

    • new data set

    • samples of size 3

Not uniform!


Resizing phase 1
Resizing: Phase 1

Conversion to Bernoulli sample

  • Given q, randomly determine sample size

    • U = Binomial(|D|,q)

  • Reuse S to create Bernoulli sample

    • Subsample if U < |S|

    • Else sample additional tuples (base data access)

  • Choice of q

    • small  less base data accesses

    • large  more base data accesses


Resizing phase 2
Resizing: Phase 2

Run Bernoulli sampling

  • Include new tuples with probability q

  • Delete from sample as necessary

  • Eventually reach new sample size

  • Revert to reservoir sampling

  • Choice of q

    • small  long drift time

    • large  short drift time


Choosing q inserts only
Choosing q (Inserts Only)

  • Expected Phase 1 (conversion) time

  • Expected Phase 2 (drifting) time

  • Choose q to minimize E[T1] + E[T2]


Resizing behavior
Resizing Behavior

  • Example (dependence on base-access cost):

    • resize by 30% if sampling fraction drops below 9%

    • dependent on costs of accessing base data

Low costs

Moderate costs

High costs

immediate resizing

combined solution

degenerates to Bernoulli sampling


Choosing q w deletes
Choosing q (w. Deletes)

  • Simple approach (insert prob. = p > 0.5)

    • Expected change in partition size (Phase 2)

      • (p)(1)+(1-p)(-1) = 2p-1

    • So scale Phase 2 cost by 1/(2p-1)

  • More sophisticated approach

    • Hitting time of Markov chain to boundary

    • Stochastic approximation algorithm

      • Modified Kiefer-Wolfowitz


The rpmerge algorithm
The RPMerge Algorithm

  • Conceptually: defer deletions until after merge

  • Generate Yi’s directly

    • Can assume that deletions happen after the insertions


New maintenance method

Insertion of t

Deletion of t

Insert t into sample

With prob. q

Delete t from sample

With prob. (Xj(t) – 1) / (Yj(t) – 1)

Nj(t) copies of item t in dataset

Data

Sample

Xj(t) copies of item t in sample

New Maintenance Method

  • Idea: use tracking counters

    • After j-th transaction, augmented sample Sj isSj = { (Xj (t),Yj (t)): tT and Xj (t) > 0}

      • Xj(t) = frequency of item t in the sample

      • Yj(t) = net # of insertions of t into R since t joined sample


Frequency estimation
Frequency Estimation

  • Naïve (Horvitz-Thompson) unbiased estimator

  • Exploit tracking counter:

  • Theorem

  • Can extend to other aggregates (see paper)


Estimating distinct value counts
Estimating Distinct-Value Counts

  • If usual DV estimators unavailable (BH+07)

  • Obtain S’ from S: insert t D(S) with probability

  • Can show: P(t S’) = q fort  D(R)

  • HT unbiased estimator: = |S’| / q

  • Improve via conditioning (Var[E[U|V]] ≤ Var[U]):


Estimating the dv count
Estimating the DV Count

  • Exact computation via sorting

    • Usually infeasible

  • Sampling-based estimation

    • Very hard problem (need large samples)

  • Probabilistic counting schemes

    • Single-pass, bounded memory

    • Several flavors (mostly bit-vector synopses)

      • Linear counting (ASW87)

      • Logarithmic counting (FM85,WVT90,AMS, DF03)

      • Sample counting (ASW87,Gi01, BJKST02)


Intuition
Intuition

  • Look at spacings

    • Example with k = 4 and D = 7:

    • E[V]  1 / D so that D 1 /E[V]

    • Estimate D as 1 / Avg(V1,…,Vk)

    • I.e., as k / Sum(V1,…,Vk)

    • I.e., as k / u(k)

    • Upward bias (Jensen’s inequality) so change k to k-1