Full disjunctions polynomial delay iterators in action
This presentation is the property of its rightful owner.
Sponsored Links
1 / 60

Full Disjunctions : Polynomial-Delay Iterators in Action PowerPoint PPT Presentation


  • 41 Views
  • Uploaded on
  • Presentation posted in: General

Sara Cohen. Itzhak Fadida. Yaron Kanza. Technion Israel. Technion Israel. University of Toronto Canada. Benny Kimelfeld. Yehoshua Sagiv. Hebrew University Israel. Hebrew University Israel. VLDB 2006 Seoul, Korea. Full Disjunctions : Polynomial-Delay Iterators in Action.

Download Presentation

Full Disjunctions : Polynomial-Delay Iterators in Action

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Full disjunctions polynomial delay iterators in action

Sara Cohen

Itzhak Fadida

Yaron Kanza

Technion Israel

Technion Israel

University of Toronto

Canada

Benny Kimelfeld

Yehoshua Sagiv

Hebrew University

Israel

Hebrew University

Israel

VLDB 2006

Seoul, Korea

Full Disjunctions:Polynomial-Delay Iterators in Action


Computing full disjunctions

Computing Full Disjunctions

  • The full disjunction is a relational operator that maximally combines data from several relations

    • It extends the natural join by allowing incompleteness

    • It extends the binaryouterjoin to many relations

  • This paper presents algorithms and optimizations for computing full disjunctions

    • Theoretically, full disjunctions are more tractable than previously known

    • Practically, a significant improvement over the state-of-art, an iterator-like evaluation


Contents

Contents

  • Full Disjunctions

    • Complexity

  • Contributions

  • Algorithms

    • Algorithm NLOJ for Tree-Structured Schemes

    • Algorithm PDelayFD for General Schemes

    • Algorithm BiComNLOJ − Main Algorithm

  • Experimental Results

  • Conclusion


Contents1

Contents

  • Full Disjunctions

    • Complexity

  • Contributions

  • Algorithms

    • Algorithm NLOJ for Tree-Structured Schemes

    • Algorithm PDelayFD for General Schemes

    • Algorithm BiComNLOJ − Main Algorithm

  • Experimental Results

  • Conclusion


The natural join operator

The Natural Join Operator

Climates

Accommodations

Sites

ClimatesAccommodationsSites


The natural join misses information

The Natural Join Misses Information

Climates

Accommodations

Sites

Bahamas is not in Sites, so the natural join misses it

ClimatesAccommodationsSites


The natural join misses information1

The Natural Join Misses Information

Empty space means nullvalue

Climates

Accommodations

Bahamas is not in Sites, so the natural join misses it

Mouth Logan is not in a city, hence missed

ClimatesAccommodationsSites


The natural join misses information2

The Natural Join Misses Information

A looser notion of join is needed—one that enables joining tuples from some of the tables

Climates

Accommodations

Bahamas is not in Sites, so the natural join misses it

Mouth Logan is not in a city, hence missed

ClimatesAccommodationsSites


The natural join operator1

The Natural Join Operator

A tuple of the join corresponds to a set of tuples from the source relations

Climates

Accommodations

Sites

Join consistent

Connected

No Cartesian product

Complete

One tuple from each relation

ClimatesAccommodationsSites


Join consistent sets of tuples

Join-Consistent Sets of Tuples

A set T of tuples is join-consistent if every two tuples of T are join-consistent

Two tuples t1 and t2 are join-consistent

if for every common attribute A:

1. t1[A] and t2[A] are non-null

2.t1[A] = t2[A]


Connected sets of tuples

Connected Sets of Tuples

A set of tuples is connected if its join graph is connected

The join graph of a setT of tuples:

  • The nodes are the tuples of T

  • An edge between every two tuples with a common attribute


Natural join w o cartesian product

Natural Join (w/o Cartesian Product)

T is join consistent

1.

JCC

2.

3.

T is connected

No Cartesian product

T is complete

One tuple from each relation

Each tuple of the result corresponds to a

set T of tuples from the source relations


Full disjunction galindo legaria 1994

FullDisjunction (Galindo-Legaria 1994)

JCC

2.

3.

3.

T is connected

No Cartesian product

T is complete

One tuple from each relation

T is maximal

Not properly contained in any JCC set

Each tuple of the result corresponds to a

set T of tuples from the source relations

T is join consistent

1.


An example of a full disjunction

An Example of a Full Disjunction

Climates

Accommodations

Sites

R

FD(R)


An example of a full disjunction1

An Example of a Full Disjunction

Climates

Accommodations

Sites

R

FD(R)


An example of a full disjunction2

An Example of a Full Disjunction

Climates

Accommodations

Sites

R

FD(R)


An example of a full disjunction3

An Example of a Full Disjunction

Climates

Accommodations

Sites

R

FD(R)


An example of a full disjunction4

An Example of a Full Disjunction

Climates

Accommodations

Sites

R

FD(R)


An example of a full disjunction5

An Example of a Full Disjunction

Climates

Accommodations

Sites

R

FD(R)


Padding joined tuple sets with nulls

Padding Joined Tuple Sets with Nulls


The outerjoin operator

The Outerjoin Operator

R1R2

The natural joinR1 R2 and, in addition,

all dangling tuplespadded with nulls

Theouterjoinof two relations R1andR2


Example of an outerjoin

Example of an Outerjoin

Climates Accommodations

Climates

Accommodations


Combining relations using outerjoins

Combining Relations using Outerjoins

The outerjoin operator is not associative

For more than two relations, the result depends on the order in which the outerjoin is applied

In general, outerjoins cannot maximally combine relations (no matter what order is used)

Outerjoin is not suitable for combining more than two relations!


Contents2

Contents

  • FullDisjunctions

    • Complexity

  • Contributions

  • Algorithms

    • Algorithm NLOJ for Tree-Structured Schemes

    • Algorithm PDelayFD for General Schemes

    • Algorithm BiComNLOJ − Main Algorithm

  • Experimental Results

  • Conclusion


Efficiency of evaluation

Efficiency of Evaluation

The full-disjunction operator (as well as other operators

like the Cartesian product or the natural join) can generate

an exponential(in the input size)number of tuples

Polynomial running time is not a suitable yardstick

The usual notion:

Polynomial time in the combined size of the input and the output


History of algorithms for full disjunctions

History of Algorithms for Full Disjunctions

Source

Time

Databases

g-acyclic

RU96

O(n+F2)

KS03

O(n5N2F2)

general

O(n3NF2)“incremental polynomial”

CS05

general

This paper:linear dependence on F

number of relations

number of tuples in the DB

number of tuples in the FD

F is typically very large Can be exponential in the size of the database

n:

N:

F:


Polynomial delay

Polynomial Delay

time

One way to obtain an evaluation with a running time

linear in the output is to devise an algorithm that acts

as an iterator with an efficient next() operator, that is,

An enumeration algorithm that

runs with polynomial delay

An enumeration algorithm runs with polynomial delay if the time between every two successive answers is polynomial in the size of the input


Other benefits of polynomial delay

Other Benefits of Polynomial Delay

  • Incremental evaluation

    • First tuples are generated quickly

      • Full disjunctions are large, yet the user need not wait for the whole result to be generated

    • Suitable for Web applications, where users expect to get the first few pages quickly

      • In addition, the user can decide anytime that enough information has been shown

  • Enable parallel query processing

    • While one processor generates the FD tuples, other processors apply further processing


Contents3

Contents

  • Full Disjunctions

    • Complexity

  • Contributions

  • Algorithms

    • Algorithm NLOJ for Tree-Structured Schemes

    • Algorithm PDelayFD for General Schemes

    • Algorithm BiComNLOJ − Main Algorithm

  • Experimental Results

  • Conclusion


Main contributions

Main Contributions

Substantial improvement over the state-of-art

is proved theoretically and experimentally

1.First algorithm for computing full disjunctions withpolynomial delay

2. First algorithm for computing full disjunctions in time linear in the output

3. A general optimizationtechniquefor computing full disjunctions

Division into biconnected components


Contents4

Contents

  • FullDisjunctions

    • Complexity

  • Contributions

  • Algorithms

    • Algorithm NLOJ for Tree-Structured Schemes

    • Algorithm PDelayFD for General Schemes

    • Algorithm BiComNLOJ − Main Algorithm

  • Experimental Results

  • Conclusion


Our algorithms

Our Algorithms

Combine

Algorithm NLOJ

Tree Schemes

Algorithm PDelayFD

GeneralSchemes

Division into Biconnected Components

Optimization

Algorithm BiComNLOJ

Main Algorithm− GeneralSchemes


Contents5

Contents

  • Full Disjunctions

    • Complexity

  • Contributions

  • Algorithms

    • Algorithm NLOJ for Tree-Structured Schemes

    • Algorithm PDelayFD for General Schemes

    • Algorithm BiComNLOJ − Main Algorithm

  • Experimental Results

  • Conclusion


Tree schemes

Tree Schemes

R1

R5

R2

R3

R6

R7

R4

Scheme graphs w/o cycles

In the scheme graph, the relation schemes are the nodes and there is an edge between every two schemes with one or more common attributes


Left deep sequence of outerjoins

Left-Deep Sequence of Outerjoins

AlgorithmNLOJ (Nested Loop OuterJoin)

R: a set of relations with a tree scheme

R1,…,Rn: a connected-prefix order of R

Proposition:

FD(R) = (…((R1R2) R3) …) Rn

1. Compute a connected-prefix order of R

2. Apply outerjoins in a left-deep order


Connected prefix order of relations

Connected-Prefix Order of Relations

Aconnected-prefixorder of relations:

Each prefix forms a (connected) subtree

R1

R5

R2

R3

R6

R7

R4

R1

R3

R2

R7

R4

R5

R6


Achieving polynomial delay

Achieving Polynomial Delay

R2

R3

Rn-1

Rn

Already exponential size!

AlgorithmNLOJ (Nested Loop OuterJoin)

1. Compute a connected-prefix order of R

2. Apply outerjoins in a left-deep order

R1

Problem: exp. delay

Solution: use iterators


Iterators

Iterators

To obtain polynomial delay, we use iterators

  • Operate on top of an enumeration algorithm

  • Implement next() by controlling the execution

Algorithm

Iterator

next()


Using iterators for outerjoins

Using Iterators for Outerjoins

Rn-1

Iterator 1

Iterator 2

R2

R3

Rn

Iterator n-1

Iterator n

R1


Outerjoins are not always applicable

Outerjoins are not Always Applicable

It is not always possible to formulate a

full disjunction as a left-deep sequence

of outerjoins

Rajaraman and Ullman[PODS 96]:

Some full disjunctions cannot be formulated as expressions of outerjoins (i.e., with arbitrary placement of parentheses)


Contents6

Contents

  • Full Disjunctions

    • Complexity

  • Contributions

  • Algorithms

    • AlgorithmNLOJfor Tree-Structured Schemes

    • Algorithm PDelayFDforGeneralSchemes

    • Algorithm BiComNLOJ − Main Algorithm

  • Experimental Results

  • Conclusion


About the algorithm

About the Algorithm

  • Unlike NLOJ, the next algorithm, PDelayFD, is applicable to all schemes (and not just trees)

  • Algorithm PDelayFD has a polynomial delay, but the delay islargerthan that of NLOJ

  • Nevertheless, PDelayFD by itself is a significant improvement over the state-of-art


Shifting a maximal jcc tuple set t

Shifting a Maximal JCC Tuple Set T

t-shifting T:

T

1.Add t to T

2.Extractmax. JCC subset containing t

3. Extend to a maximal JCC set

t-shift of T

t

t

t


Algorithm pdelayfd

Algorithm PDelayFD

Theorem:

Validate that the t-shift is not already in Q orC

1. Generate a max. JCC set T0

2.Insert T0 into Q

PDelayFD(R) computes

FD(R)with polynomial delay

C

Q

Repeat until Q is empty:

1. Move some T from Q toC

2.Print the join ofT, padded with nulls

3.Insert into Qa t-shift of Tfor all

tuples t in the database

Output:


Contents7

Contents

  • Full Disjunctions

    • Complexity

  • Contributions

  • Algorithms

    • AlgorithmNLOJfor Tree-Structured Schemes

    • AlgorithmPDelayFD for General Schemes

    • AlgorithmBiComNLOJ− Main Algorithm

  • Experimental Results

  • Conclusion


Nloj vs pdelayfd

NLOJ vs. PDelayFD

R3

R3

R2

R7

R2

R7

R8

R10

R1

R8

R10

R1

R6

R4

R5

R9

R6

R4

R5

R9

R3

R2

R7

R8

R10

R1

R6

R4

R5

R9

?

PDelayFD

NLOJ

  • Shorter delays

  • Less space

  • Simpler to impl.

Our approach: divide and conquer


Biconnected components

Biconnected Components

R1

R2

R5

R4

R3

R6

R8

R7

R9

R1

R5

R2

R3

R8

R6

R7

R4

Biconnected component:

A maximal subset Bof relations, s.t. the scheme graph has

two (or more) disjoint paths between every two relations ofB


Left deep sequence of outerjoins1

Left-Deep Sequence of Outerjoins

R: a set of relations

Theorem:

There exists an (efficiently computable) order B1,…,Bk of the biconnected components ofR, s.t.

FD(R) = (…((FD(B1) FD(B2)) …) FD(Bk)

Optimized Algorithm:

1. Compute the biconnected components of R

2. Compute the full disjunction of each component

3. Apply outerjoins in a suitable order


Bicomnloj a na ve attempt

BiComNLOJ: a Naïve Attempt

Iterator

Iterator

Iterator

Each FD(Bi) can be exponential in the input

1.DivideRinto biconnected components

→B1,…Bk in a suitable order

Non-polynomial delay!

2.ComputeFD(B1),…,FD(Bk)

− using PDelayFD

3. Using NLOJ, compute

(…((FD(B1) FD(B2)) …) FD(Bk)

Solution:


Retaining polynomial delay 1 st problem

Retaining Polynomial Delay: 1st Problem

R2

R6

For simplification, assume only two components

R1

R3

R5

R7

R4

R8

B1

B2

  • After generating a tuple t of FD(B1), we need to generate all tuples of FD(B2) that can join t

  • Non-polynomial delay if all of FD(B2) is computed for finding these tuples!

  • Solution:

  • PDelayFD can be modified so that it generates only those tuples of FD(B2) that can join t

Details in the proceedings…


Retaining polynomial delay 2 nd problem

Retaining Polynomial Delay: 2nd Problem

R2

R6

For simplification, assume only two components

R1

R3

R5

R7

R4

R8

B1

B2

  • The last step is to generate all tuples of FD(B2) that cannot be joined with tuples of FD(B1)

  • However, this task is by itself NP-hard!

  • Solution: When generating all tuples of FD(B2) that can be joined with some tuple of FD(B1), we collect enough information for generating the remaining tuples of FD(B2)

Details in the proceedings…


Contents8

Contents

  • Full Disjunctions

    • Complexity

  • Contributions

  • Algorithms

    • AlgorithmNLOJfor Tree-Structured Schemes

    • AlgorithmPDelayFD for General Schemes

    • Algorithm BiComNLOJ − Main Algorithm

  • Experimental Results

  • Conclusion


Experimental setting

Experimental Setting

Implementation

R6

R2

R2

R7

R2

R3

R5

R7

R9

R5

R1

R3

R7

R9

R1

R3

R6

R4

R8

R10

R5

R4

R9

R6

R8

R10

R8

R10

R1

R4

Scheme S2

Scheme S1

Scheme S3

Algorithms:PDelayFD,BiComNLOJ(main)

IncrementalFD(CS05, state-of-art)

PosgreSQL(open source)

HW: Pentium4, 1.6GHZ, 512MB RAM

  • Syntheticdata(randomly generated)

  • Fixed schemes


State of art vs main algorithm

State-of-Art vs. Main Algorithm

Scheme 1

Scheme 2

Scheme 3

IncrementalFD

(state of art, CS05)

BiComNJOJ

our main algorithm

Average Delay

(msec)

Number of Tuples in each Relation

BiComNLOJ is a substantial improvement over the state-of-art


Division into biconnected components

Division into Biconnected Components

Scheme 1

Scheme 2

Scheme 3

PDelayFD

(no division to b.c.c.)

BiComNJOJ

our main algorithm

Average Delay

(msec)

Number of Tuples in each Relation

Division reduces delays

(amount depends on the scheme)


Behavior of delay

Behavior of Delay

Measure the delay before each generated tuple

IncrementalFD

(state of art, CS05)

BiComNJOJ

our main algorithm

Delay (msec)

Tuple Number

While IncrementalFDhas a slowdown, the delay of BiComNLOJ remains almost constant


Contents9

Contents

  • Full Disjunctions

    • Complexity

  • Contributions

  • Algorithms

    • AlgorithmNLOJfor Tree-Structured Schemes

    • AlgorithmPDelayFD for General Schemes

    • Algorithm BiComNLOJ − Main Algorithm

  • Experimental Results

  • Conclusion


Summary

Summary

Full Disjunction:

An associative extension of the outerjoin operator to an arbitrary number of relations

3 Algorithms for computing FD:

PDelayFD

Polynomial-DelayFull Disjunction

General Schemes

NLOJ

Nested-Loop Outerjoin

Tree-Structured Schemes

BiComNLOJ

Combine first 2, deploy div. into biconnected components

General Schemes


Contributions

Contributions

  • Substantial improvement of evaluation time over the state-of-art

    • Proved theoretically and experimentally

  • Full disjunctions can be computed with polynomial delay and in time linear in the output size

  • Optimization techniques for computing FDs

  • Implementation within PostgreSQL (ongoing…)

  • Incorporating our algorithms into an SQL optimizer

    • E.g., some operators can be pushed through the FD

    • Not discussed here, appears in the proceedings…


Full disjunctions polynomial delay iterators in action

Thank you.

Questions?


  • Login