Full Disjunctions : Polynomial-Delay Iterators in Action

Download Presentation

Full Disjunctions : Polynomial-Delay Iterators in Action

Loading in 2 Seconds...

- 48 Views
- Uploaded on
- Presentation posted in: General

Full Disjunctions : Polynomial-Delay Iterators in Action

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Sara Cohen

Itzhak Fadida

Yaron Kanza

Technion Israel

Technion Israel

University of Toronto

Canada

Benny Kimelfeld

Yehoshua Sagiv

Hebrew University

Israel

Hebrew University

Israel

VLDB 2006

Seoul, Korea

Full Disjunctions:Polynomial-Delay Iterators in Action

- The full disjunction is a relational operator that maximally combines data from several relations
- It extends the natural join by allowing incompleteness
- It extends the binaryouterjoin to many relations

- This paper presents algorithms and optimizations for computing full disjunctions
- Theoretically, full disjunctions are more tractable than previously known
- Practically, a significant improvement over the state-of-art, an iterator-like evaluation

- Full Disjunctions
- Complexity

- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ − Main Algorithm

- Experimental Results
- Conclusion

- Full Disjunctions
- Complexity

- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ − Main Algorithm

- Experimental Results
- Conclusion

Climates

Accommodations

Sites

ClimatesAccommodationsSites

Climates

Accommodations

Sites

Bahamas is not in Sites, so the natural join misses it

ClimatesAccommodationsSites

Empty space means nullvalue

Climates

Accommodations

Bahamas is not in Sites, so the natural join misses it

Mouth Logan is not in a city, hence missed

ClimatesAccommodationsSites

A looser notion of join is needed—one that enables joining tuples from some of the tables

Climates

Accommodations

Bahamas is not in Sites, so the natural join misses it

Mouth Logan is not in a city, hence missed

ClimatesAccommodationsSites

A tuple of the join corresponds to a set of tuples from the source relations

Climates

Accommodations

Sites

Join consistent

Connected

No Cartesian product

Complete

One tuple from each relation

ClimatesAccommodationsSites

A set T of tuples is join-consistent if every two tuples of T are join-consistent

Two tuples t1 and t2 are join-consistent

if for every common attribute A:

1. t1[A] and t2[A] are non-null

2.t1[A] = t2[A]

A set of tuples is connected if its join graph is connected

The join graph of a setT of tuples:

- The nodes are the tuples of T
- An edge between every two tuples with a common attribute

T is join consistent

1.

JCC

2.

3.

T is connected

No Cartesian product

T is complete

One tuple from each relation

Each tuple of the result corresponds to a

set T of tuples from the source relations

JCC

2.

3.

3.

T is connected

No Cartesian product

T is complete

One tuple from each relation

T is maximal

Not properly contained in any JCC set

Each tuple of the result corresponds to a

set T of tuples from the source relations

T is join consistent

1.

Climates

Accommodations

Sites

R

FD(R)

Climates

Accommodations

Sites

R

FD(R)

Climates

Accommodations

Sites

R

FD(R)

Climates

Accommodations

Sites

R

FD(R)

Climates

Accommodations

Sites

R

FD(R)

Climates

Accommodations

Sites

R

FD(R)

R1R2

The natural joinR1 R2 and, in addition,

all dangling tuplespadded with nulls

Theouterjoinof two relations R1andR2

Climates Accommodations

Climates

Accommodations

The outerjoin operator is not associative

For more than two relations, the result depends on the order in which the outerjoin is applied

In general, outerjoins cannot maximally combine relations (no matter what order is used)

Outerjoin is not suitable for combining more than two relations!

- FullDisjunctions
- Complexity

- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ − Main Algorithm

- Experimental Results
- Conclusion

The full-disjunction operator (as well as other operators

like the Cartesian product or the natural join) can generate

an exponential(in the input size)number of tuples

Polynomial running time is not a suitable yardstick

The usual notion:

Polynomial time in the combined size of the input and the output

Source

Time

Databases

g-acyclic

RU96

O(n+F2)

KS03

O(n5N2F2)

general

O(n3NF2)“incremental polynomial”

CS05

general

This paper:linear dependence on F

number of relations

number of tuples in the DB

number of tuples in the FD

F is typically very large Can be exponential in the size of the database

n:

N:

F:

time

One way to obtain an evaluation with a running time

linear in the output is to devise an algorithm that acts

as an iterator with an efficient next() operator, that is,

An enumeration algorithm that

runs with polynomial delay

An enumeration algorithm runs with polynomial delay if the time between every two successive answers is polynomial in the size of the input

- Incremental evaluation
- First tuples are generated quickly
- Full disjunctions are large, yet the user need not wait for the whole result to be generated

- Suitable for Web applications, where users expect to get the first few pages quickly
- In addition, the user can decide anytime that enough information has been shown

- First tuples are generated quickly
- Enable parallel query processing
- While one processor generates the FD tuples, other processors apply further processing

- Full Disjunctions
- Complexity

- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ − Main Algorithm

- Experimental Results
- Conclusion

Substantial improvement over the state-of-art

is proved theoretically and experimentally

1.First algorithm for computing full disjunctions withpolynomial delay

2. First algorithm for computing full disjunctions in time linear in the output

3. A general optimizationtechniquefor computing full disjunctions

Division into biconnected components

- FullDisjunctions
- Complexity

- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ − Main Algorithm

- Experimental Results
- Conclusion

Combine

Algorithm NLOJ

Tree Schemes

Algorithm PDelayFD

GeneralSchemes

Division into Biconnected Components

Optimization

Algorithm BiComNLOJ

Main Algorithm− GeneralSchemes

- Full Disjunctions
- Complexity

- Contributions
- Algorithms
- Algorithm NLOJ for Tree-Structured Schemes
- Algorithm PDelayFD for General Schemes
- Algorithm BiComNLOJ − Main Algorithm

- Experimental Results
- Conclusion

R1

R5

R2

R3

R6

R7

R4

Scheme graphs w/o cycles

In the scheme graph, the relation schemes are the nodes and there is an edge between every two schemes with one or more common attributes

AlgorithmNLOJ (Nested Loop OuterJoin)

R: a set of relations with a tree scheme

R1,…,Rn: a connected-prefix order of R

Proposition:

FD(R) = (…((R1R2) R3) …) Rn

1. Compute a connected-prefix order of R

2. Apply outerjoins in a left-deep order

Aconnected-prefixorder of relations:

Each prefix forms a (connected) subtree

R1

R5

R2

R3

R6

R7

R4

R1

R3

R2

R7

R4

R5

R6

R2

R3

Rn-1

Rn

Already exponential size!

AlgorithmNLOJ (Nested Loop OuterJoin)

1. Compute a connected-prefix order of R

2. Apply outerjoins in a left-deep order

R1

…

Problem: exp. delay

Solution: use iterators

To obtain polynomial delay, we use iterators

- Operate on top of an enumeration algorithm
- Implement next() by controlling the execution

Algorithm

Iterator

next()

Rn-1

Iterator 1

Iterator 2

R2

R3

Rn

Iterator n-1

Iterator n

R1

…

It is not always possible to formulate a

full disjunction as a left-deep sequence

of outerjoins

Rajaraman and Ullman[PODS 96]:

Some full disjunctions cannot be formulated as expressions of outerjoins (i.e., with arbitrary placement of parentheses)

- Full Disjunctions
- Complexity

- Contributions
- Algorithms
- AlgorithmNLOJfor Tree-Structured Schemes
- Algorithm PDelayFDforGeneralSchemes
- Algorithm BiComNLOJ − Main Algorithm

- Experimental Results
- Conclusion

- Unlike NLOJ, the next algorithm, PDelayFD, is applicable to all schemes (and not just trees)
- Algorithm PDelayFD has a polynomial delay, but the delay islargerthan that of NLOJ
- Nevertheless, PDelayFD by itself is a significant improvement over the state-of-art

t-shifting T:

T

1.Add t to T

2.Extractmax. JCC subset containing t

3. Extend to a maximal JCC set

t-shift of T

t

t

t

Theorem:

Validate that the t-shift is not already in Q orC

1. Generate a max. JCC set T0

2.Insert T0 into Q

PDelayFD(R) computes

FD(R)with polynomial delay

C

Q

Repeat until Q is empty:

1. Move some T from Q toC

2.Print the join ofT, padded with nulls

3.Insert into Qa t-shift of Tfor all

tuples t in the database

…

Output:

- Full Disjunctions
- Complexity

- Contributions
- Algorithms
- AlgorithmNLOJfor Tree-Structured Schemes
- AlgorithmPDelayFD for General Schemes
- AlgorithmBiComNLOJ− Main Algorithm

- Experimental Results
- Conclusion

R3

R3

R2

R7

R2

R7

R8

R10

R1

R8

R10

R1

R6

R4

R5

R9

R6

R4

R5

R9

R3

R2

R7

R8

R10

R1

R6

R4

R5

R9

?

PDelayFD

NLOJ

- Shorter delays
- Less space
- Simpler to impl.

Our approach: divide and conquer

R1

R2

R5

R4

R3

R6

R8

R7

R9

R1

R5

R2

R3

R8

R6

R7

R4

Biconnected component:

A maximal subset Bof relations, s.t. the scheme graph has

two (or more) disjoint paths between every two relations ofB

R: a set of relations

Theorem:

There exists an (efficiently computable) order B1,…,Bk of the biconnected components ofR, s.t.

FD(R) = (…((FD(B1) FD(B2)) …) FD(Bk)

Optimized Algorithm:

1. Compute the biconnected components of R

2. Compute the full disjunction of each component

3. Apply outerjoins in a suitable order

Iterator

Iterator

Iterator

Each FD(Bi) can be exponential in the input

1.DivideRinto biconnected components

→B1,…Bk in a suitable order

Non-polynomial delay!

2.ComputeFD(B1),…,FD(Bk)

− using PDelayFD

3. Using NLOJ, compute

(…((FD(B1) FD(B2)) …) FD(Bk)

Solution:

R2

R6

For simplification, assume only two components

R1

R3

R5

R7

R4

R8

B1

B2

- After generating a tuple t of FD(B1), we need to generate all tuples of FD(B2) that can join t
- Non-polynomial delay if all of FD(B2) is computed for finding these tuples!
- Solution:
- PDelayFD can be modified so that it generates only those tuples of FD(B2) that can join t

Details in the proceedings…

R2

R6

For simplification, assume only two components

R1

R3

R5

R7

R4

R8

B1

B2

- The last step is to generate all tuples of FD(B2) that cannot be joined with tuples of FD(B1)
- However, this task is by itself NP-hard!
- Solution: When generating all tuples of FD(B2) that can be joined with some tuple of FD(B1), we collect enough information for generating the remaining tuples of FD(B2)

Details in the proceedings…

- Full Disjunctions
- Complexity

- Contributions
- Algorithms
- AlgorithmNLOJfor Tree-Structured Schemes
- AlgorithmPDelayFD for General Schemes
- Algorithm BiComNLOJ − Main Algorithm

- Experimental Results
- Conclusion

Implementation

R6

R2

R2

R7

R2

R3

R5

R7

R9

R5

R1

R3

R7

R9

R1

R3

R6

R4

R8

R10

R5

R4

R9

R6

R8

R10

R8

R10

R1

R4

Scheme S2

Scheme S1

Scheme S3

Algorithms:PDelayFD,BiComNLOJ(main)

IncrementalFD(CS05, state-of-art)

PosgreSQL(open source)

HW: Pentium4, 1.6GHZ, 512MB RAM

- Syntheticdata(randomly generated)
- Fixed schemes

Scheme 1

Scheme 2

Scheme 3

IncrementalFD

(state of art, CS05)

BiComNJOJ

our main algorithm

Average Delay

(msec)

Number of Tuples in each Relation

BiComNLOJ is a substantial improvement over the state-of-art

Scheme 1

Scheme 2

Scheme 3

PDelayFD

(no division to b.c.c.)

BiComNJOJ

our main algorithm

Average Delay

(msec)

Number of Tuples in each Relation

Division reduces delays

(amount depends on the scheme)

Measure the delay before each generated tuple

IncrementalFD

(state of art, CS05)

BiComNJOJ

our main algorithm

Delay (msec)

Tuple Number

While IncrementalFDhas a slowdown, the delay of BiComNLOJ remains almost constant

- Full Disjunctions
- Complexity

- Contributions
- Algorithms
- AlgorithmNLOJfor Tree-Structured Schemes
- AlgorithmPDelayFD for General Schemes
- Algorithm BiComNLOJ − Main Algorithm

- Experimental Results
- Conclusion

Full Disjunction:

An associative extension of the outerjoin operator to an arbitrary number of relations

3 Algorithms for computing FD:

PDelayFD

Polynomial-DelayFull Disjunction

General Schemes

NLOJ

Nested-Loop Outerjoin

Tree-Structured Schemes

BiComNLOJ

Combine first 2, deploy div. into biconnected components

General Schemes

- Substantial improvement of evaluation time over the state-of-art
- Proved theoretically and experimentally

- Full disjunctions can be computed with polynomial delay and in time linear in the output size
- Optimization techniques for computing FDs
- Implementation within PostgreSQL (ongoing…)
- Incorporating our algorithms into an SQL optimizer
- E.g., some operators can be pushed through the FD
- Not discussed here, appears in the proceedings…

Thank you.

Questions?