Beyond Relations as Sets: Multisets, Complex Objects, Sequences, and Streams

1 / 51

# Beyond Relations as Sets: Multisets, Complex Objects, Sequences, and Streams - PowerPoint PPT Presentation

Beyond Relations as Sets: Multisets, Complex Objects, Sequences, and Streams. Equivalences Among Relational Expressions; Aho, Sagiv, and Ullman, SIAM J. Computing, 1979 Four Views of Complex Objects: A Sophisticate's Introduction; Hull, LNCS 361 Nested Relations and Complex Objects, 1987

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Beyond Relations as Sets: Multisets, Complex Objects, Sequences, and Streams' - allie

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Beyond Relations as Sets: Multisets, Complex Objects, Sequences, and Streams

Equivalences Among Relational Expressions; Aho, Sagiv, and Ullman, SIAM J. Computing, 1979

Four Views of Complex Objects: A Sophisticate's Introduction; Hull, LNCS 361 Nested Relations and Complex Objects, 1987

SRQL: Sorted Relational Query Language; Ramakrishnan et al., SSDBM 1998

Raghu Ramakrishnan

What’s there Beyond Relations As Sets?
• Relations as multisets (real SQL)
• Object id (Object DBMSs)
• Nested structure (Object-Relational DBMSs)
• Partial structure (e.g., XML)
• Sequences (SQL:1999)
• Temporal databases (versions, “valid” time)
• Streams (data feeds, alert systems)
• data/query duality
Relations-As-Multisets (Bags)
• What if the number of copies of a tuple matters?
• E.g., Find the average salary
• In fact, a relation is a multiset in SQL, by default.
• The results for conjunctive query containment over relations-as-sets no longer hold!
• Containment: No longer in NP (Chaudhuri-Vardi); equivalence is still in NP!
• Containment of unions of conjunctive queries: Undecidable (Ioannidis-Ramakrishnan)

Lesson: Small changes in the data model can have a big impact!

Querying Sequences in SQL:1999
• Trend analysis is difficult to do in SQL-92:
• Find the % change in monthly sales
• Find the top 5 product by total sales
• Find the trailing n-day moving average of sales
• The first two queries can be expressed with difficulty, but the third cannot even be expressed in SQL-92 if n is a parameter of the query.
• The WINDOW clause in SQL:1999 allows us to write such queries over a table viewed as a sequence (implicitly, based on user-specified sort keys)
Stocks(date, symbol, close)

Trans(cust, symbol, date, shares)

Motivation
• Find the trailing weekly moving average for each stock.
• For each date in ‘97, find the ten cheapest stocks.
• Compute the % change of each stock during ‘97, and then find stocks in the top 5% (those that changed the most).
• For each week, find the stock in which Joe had the most invested.
Sequences
• Simple sequence: A relation (logically) sorted on a key; key attributes are called sequencing attributes.
• Composite sequence:A relation that is first partitioned usinggrouping attributes, then sorted within each partition bysequencing attributes.

ord(t): Ordinal # of tuple within its group

Sequence Algebra
• Relational algebra, extended to work over multisets of tuples, is the foundation for SQL.
• We extend the relational operators to work on sorted relations.
• In addition, we define some new operators over sequences. (Only one of these new operators is necessary; the rest are for convenience.)
Extending Relational Algebra
• The operators are: s, p, x, u, -, distinct

scan use conditions over ord attr of input

• To extend these ops to work on sorted relations, we must define the grouping and sequencing attributes of the result.
• We do this simply: the grouping and sequencing attribute lists are empty for the output of the above ops. I.e., the result is unordered.
Extending the Join Operator
• Since cross-product (x) does not propagate the ord values of the input table to its output, we must define join as a primitive op!
• Join ( ): Can refer to ord values of inputs.
• Left-outer join ( ): Left tuples with no matching right tuple appear in result with NULLs in right tuple attrs.
• Result has empty grouping / sequencing lists.
Creating a Sequence
• The sequencing operator y is our fundamental extension to rel algebra.

Applied to a table R,

with grouping attrs g and sequencing attrs s, it returns the corresponding composite sequence.

ShiftAll Operator D
• D(R,I) joins each tuple t in sequence R with all tuples t’ in the same group such that ord(t’) = ord(t)+I.
• If such t’ does not exist, use NULLs.
• Can be defined as left-outer join followed by sequencing; included for convenience.
• Can be extended to use a fixed ordinal such as FIRST or LAST, and also to align each tuple t with more than one other tuple t’.
Shift Operator d
• Modifies each tuple t of R by adding the value of the sequencing attributes of tuple t’ (whose ord value is shifted by I).
• Differs from ShiftAll (followed by projection of unwanted columns) if R has duplicates in the sequencing columns.
• Like ShiftAll, Shift can be defined in terms of left-outer join and sequencing.
SELECT S.g,S.t,S.v,SHIFTALL(S,-1).v,SHIFT(S,1).t

FROM R GROUP BY g, SEQUENCE BY t AS S

• SHIFTALL gives access to all attrs of the tuple(s) at the shifted ordinal position.
• SHIFT gives the sequencing attr values of the tuple at the shifted ord position.
SELECT t, AVG(v) OVER 0 TO 1

FROM R SEQUENCE BY t

POSITIONAL

AGGREGATE

VALUE-BASED

AGGREGATE

SELECT t, AVG(v) OVER VALUES 0 TO 1

FROM R SEQUENCE BY t

SELECT expr-list

FROM table-list

WHERE predicate

GROUP BY expr-list

SEQUENCE BY expr-list

WITH predicate

HAVING predicate

ORDER BY expr-list

SRQL Queries
• FROM clause can contain expressions involving sequencing; first evaluate these and find cross-prod.
• Then apply selections in WHERE clause, and sort as per GROUP BY and SEQUENCE BY clauses to produce a sequence.
• For each tuple (and each agg expr in SELECT), identify window and apply agg fn.
• Apply HAVING clause, finally ORDER BY.
SELECT V.name, Q.name

FROM Volcano AS V, Quake SEQUENCE BY time AS Q

WHERE Q.time <= V.time AND

(SHIFT(Q,1).time > V.time OR

SHIFT(Q,1).time IS NULL)

AND Q.magnitude > 7

• For each volcano eruption where the most recent quake was > 7, find the name of this quake:
• Using some syntactic sugar:

SELECT V.name, Q.name

FROM Volcano AS V, Quake SEQUENCE BY time AS Q

WHERE Q.time PRECEDES= V.time AND

AND Q.magnitude > 7

Aggregate Functions
• As in SQL, we need to extend the algebra to allow the use of aggregate functions like MIN and SUM.
• Must define “window” for each application of an agg fn.
• In SQL, window is a partition created by GROUP BY; agg op applied to each partition, and yields one value per partition.
• In SRQL, for each agg fn, can have a different window for each tuple.
Window Aggregate Op w
• w (R, p, f, V):
• R is a sequence, f is an agg fn, V is an attr of R, p is a selection predicate; result is a sequence with same grouping/seq lists as R
• For each input tuple t, output contains

< t, f(pv ( st.g=R.g and p(t) (R) ) ) >

• I.e., for each tuple t of R, apply p(t) to find its “window” within the group of t, apply agg fn f to the multiset of V-values in window, and include with t in result.
SELECT t, AVG(v) OVER 0 TO 1

FROM R SEQUENCE BY t

POSITIONAL

AGGREGATE

VALUE-BASED

AGGREGATE

SELECT t, AVG(v) OVER VALUES 0 TO 1

FROM R SEQUENCE BY t

SELECT day, AVG(profit) OVER 0 TO 1

FROM Sales

SEQUENCE BY day

WITH vol>100

• Find 2-day moving average of the profit made on sales volume > 100:
SELECT day, AVG(profit) OVER 0 TO 1

FROM Sales

SEQUENCE BY day

WHERE vol>100

• Considering only days with sales volume > 100, find 2-day moving average of the profit:
• Placing selection in WHERE clause affects window for positional aggregation!
SELECT product, day, AVG(vol) OVER 0 TO 1

FROM Sales

GROUP BY product

SEQUENCE BY day

• Find the 2-day moving average of volume sold for each product:
• In effect, creates a sequence by day for each product, and computes the moving average over each of these sequences.
• Observe how this generalizes SQL’s GROUP BY: illustrates power of composite sequences and aggregation.
The WINDOW Clause

SELECT L.state, T.month, AVG(S.sales) OVER W AS movavg

FROM Sales S, Times T, Locations L

WHERE S.timeid=T.timeid AND S.locid=L.locid

WINDOW W AS (PARTITION BY L.state

ORDER BY T.month

RANGE BETWEEN INTERVAL `1’ MONTH PRECEDING

AND INTERVAL `1’ MONTH FOLLOWING)

• Let the result of the FROM and WHERE clauses be “Temp”.
• (Conceptually) Temp is partitioned according to the PARTITION BY clause.
• Similar to GROUP BY, but the answer has one row for each row in a partition, not one row per partition!
• Each partition is sorted according to the ORDER BY clause.
• For each row in a partition, the WINDOW clause creates a “window” of nearby (preceding or succeeding) tuples.
• Can be value-based, as in example, using RANGE
• Can be based on number of rows to include in the window, using ROWS clause
• The aggregate function is evaluated for each row in the partition using the corresponding window.
• New aggregate functions that are useful with windowing include RANK (position of a row within its partition) and its variants DENSE_RANK, PERCENT_RANK, CUME_DIST.
Top N Queries
• If you want to find the 10 (or so) cheapest cars, it would be nice if the DB could avoid computing the costs of all cars before sorting to determine the 10 cheapest.
• Idea: Guess at a cost c such that the 10 cheapest all cost less than c, and that not too many more cost less. Then add the selection cost
• If the guess is right, great, we avoid computation for cars that cost more than c.
• If the guess is wrong, need to reset the selection and recompute the original query.
Top N Queries

SELECT P.pid, P.pname, S.sales

FROM Sales S, Products P

WHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3

ORDER BY S.sales DESC

OPTIMIZE FOR 10 ROWS

• OPTIMIZE FOR construct is not in SQL:1999!
• Cut-off value c is chosen by optimizer.

SELECT P.pid, P.pname, S.sales

FROM Sales S, Products P

WHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3

AND S.sales > c

ORDER BY S.sales DESC

Streams are Forever …
• So no notion of query over “entire” stream
• Queries must operate over parts of a stream
• How do you identify “parts”?
• Common answer: windows of various kinds
• Continuous queries
• Many queries of interest being continually evaluated over a given stream
• What do you index? The collection of queries or the streaming data?
• Duality of data/queries!
• Rate-oriented optimizations
• Can query evaluation/monitoring keep up with arrival rate?
• If not, how do we achieve a good trade-off?
• Stream statistics
• Incremental computation of statistics
• Can actually maintain stats over “entire” stream, but also over “groups” of tuples (window or value-based)
Complex Objects

Arise as a generalization of earlier models; in particular, we study them as extensions of the relational model.

Features:

Richer structuring facilities. e.g. context-dependent information (john, {joanna, goanna})

Precursor to XML’s semi-structured model

Data sharing (in models with object-identity, e.g., LDM)

Inheritance

Two streams of Complex Objects research:

Emphasis on structure of objects—our focus!

Emphasis on models for expressing rich semantics

Look Out For …
• Treatment of identity
• Value-based or object-based?
• Use of sets
• Restrictions make a big difference!
• Also, Type vs. Class.

Starting Point

The relational model:

• Only one basic domain of constants (D)
• Only one type constructor: R:[A1: D, A2:D, …, An:D]
• The only ‘objects’ are tuples.
Complex Objects

Types are constructed as follows:

• Basic: Basic types T1,T2, …, Tn are given. An object of type Ti is an element of (the given) domain Di.
• Record:If T1,T2, …, Tk are types, then T= [L1: T1,L2:T2, …,Lk:Tk] is also a type. An object of type T is an element of DOM(T1) DOM(T2)  ... DOM(Tk). (This set of possible values is DOM(T))
• Set: If T1 is a type, then T = L:{T1} is also a type. An object of type T is a finite subset of DOM(T1). (What is DOM(T)?)

T

T

The Li’s are labels, or field names, drawn from a separate domain. No label is used twice in the same type.

L1

LK

L

T1 TK

T1

#5, 10k, [John, 30], [Joe, 1] [Jill, 30]

Toy,

#7, 20k, [Jack, 32],

Gumbo,

#7, 50k, [Jack, 32],

Example

Instance of a type: A set of objects of that type.

Example:

Basic types: int, char_str

Person = [Name: char_str, Age: int]

Emp = [EID: int, Sal: int, Who:person, Family:{Person}]

Dept = [Dname:char_str, Demps:{Emp}]

Instance of Dept:

274-7473390-2577

Syne, Nancy,

232-2323289=9887

Energy, John

Wefix, John,

URaft, Andy,

939-9983

Update Examples

Corp

Phones

CName

Owner

Phone

Insert [Lang, Sylvie, {989-2232}] at Corp

Delete 390-2577 at Phones – ‘Energy’ tuple is modified

Delete WeFix at CName – ‘WeFix’ tuple is deleted

Replace Andy by Sabra at Owner – Modifies the tuple

Delete Andy at OwnerInsert Sabra (into the person instance)Not ≡ to the ‘replace’ in 4.

NFNF Relations, PNF

NFNF Relation Types

T = [T1, T2, …, Tk], and

(a) S= {S1}  S1 = [ , …, ] (i.e., S1 is a record type)

(b) S = [S1, S2, …, Sm],  Si = {Si’} (i.e., a set type), or a basic type.

for all types S used in defining Ti’s.

Straightforward to obtain ‘equivalent’ NFNF type for any given type T.

Partitioned , or nested, normal form

A type T is in PNF if  type S used to define T: If S = [S1, S2, …, Sm], and S1, S2, …, Sk , k  m, are basic types, then S1, S2, …, Sk includes a key for S.

(Must hold, recursively, at every level of nesting!)

Bach, Dieter, Joseph, Jinyi

DeWitt, Jeff, Raghu

Bach, Dieter, Joseph, Jinyi

DeWitt, Jeff, Raghu

Bach, Jeff

Bach, DeWitt

AT&T

Sperry, Genentech

Examples of NFNF

UW-CS,

“Essentially all NFNF relationship in practice are in PNF.”

e.g.

UW-CS,

UW-CS, Theory,

UW-CS, DB,

UW-CS,

Bach/Jeff is one (of several such) combination(s) that can teach a course involving both theory and database.

But

, ……

UW-CS,

Med_Info_Sys,

Neither Sperry nor Genentech by itself can do the job, but together, they can.

Med_Info_Sys,

A variable that is not ‘bound’ through  or Relational Calculus (TRC)
• Query language for FLAT tables.
• Atomic Formula
• R(S) – S is a tuple in R
• d θ S.L or S.Lθd or S.LθS1.L1
• Formula
• Any atomic formula
• p, pq, pq, pq
• S (P(S)) – S is a variable
• S (P(S)) – S is a variable
• Query
• {u | p(u)}, where u is the only ‘free variable’ in p.
R(S)
• S  O.L
• d θ O.L orO.L θ d or O.L θ O1.L1
• O.L = {u|(u, t1, …, tn)}

The set of all u s.t. is true. u, t1, … tn are free variables of . If O is a variable, it is distinct. This is the biggie.

S is a tuple in R

S is in (the set) O.L

Comparisons , θ is =, <, > etc.

Query Language (Extended Relational Calculus)

Formulas built using , , , , , from atoms of the form:

Anc = {p | t (almost(t)  p t}Example: Transitive Closure

OK(t) =  s [s t ( p(Par(p)  s.1 = p.1)  q(Par(q)  s.2 = q.2))]

trans(t) =  p  q[p t  q t  OK(t)  p.2 = q.1  ( r(r t  r.1 = p.1  r.2 = q.2))]

contains(t) =  p [ Par(p)  p t]

almost(t) = OK(t)  trans(t)  contains(t)

Intersection of models

Note: The variable t takes on relations as values.

This underscores the PNF restriction—the intermediate relations are not in PNF. (Note that t is really t.1—a tuple whose only field takes on relations as values.) This query can’t be expressed if all relations are to be in PNF.

a1 b1 c1

c2

b2 c3

a2 b3 c4

b1 c5

Miscellaneous Remarks

1. PNF condition: [A, {[B, {C}]}]

a1 b1 c1

c2

b2 c3

a2 b3 c4

2. Transitive closure example

variables taking tuple values vs. variables taking relation values(Also, example is due to Vardi; similarly, domain independence is due to Dipaola.)

What if this is b3?

1 NF relational Calculus

e.g. {t |  R(t)}

An expression {t | ψ(t)} is SAFE if:

1. Whenever ψ(t) is true, each component (field) of t is in DOM(ψ).

2. For each sub formula of ψ of the form  u((u)), if any component of u is not in DOM(ψ), then  (u) is true for all values of the other free variables in .

3. For each sub formula of ψ of the form  u((u)), if  (u) is true for any values of the other free variables in , then each component of u is in DOM(ψ).

• What is DOM(ψ)?
• What does “safety” give us?
NFNF Query Languages (Algebra)
• We’ve already seen the calculus, which is as expressive as Datalog.
• NFNF Algebra
• The usual operators (Suitably generalized)
• , ,, , , 

(nest),  (unnest), (powerset)

• Main Result
• Safe calculus  Algebra
Operators - Nest 
• Let R = [R1, …, Rn] = [S1, …, Sk, T1, …, Tm]
• Let I be an instance of R.
• S=(S1…Sk) (I) : “Group S”
• { u |  r  I (

(u.Ti = r.Ti, 1i  m)

(u[S] = {q[ S1, …, Sk] | q  I  (q.Ti = r.Ti, 1  i  m)} ) )

}

Phones

Emps

BName

Owner

{P.No.}

{[Ename, Ejob]}

288-2323

GmbH

Nancy

Karl

Researcher

232-4232

232-2345

Lucy

Designer

E.Inc.

John

Jim

Translator

877-2323

John

Explorer

W.Fix

John

Gail

Attorney

Operators - Nest 

What is : υallmine= (Bname, Phones, Emps) (Business)?

Owner

Allmine

{(Bname, {P.No.}, {[Ename, Ejob]})}

Nancy

GmbH

288-2323

Karl

Researcher

232-4232

John

E.Inc.

232-2345

Lucy

Designer

877-2323

Jim

Translator

John

Explorer

W.Fix

Gail

Attorney

Example

What is : υallmine= (Bname, Phones, Emps) (Business)?

More on Nest
• Can you say anything about the key fields of a nested relation?
• Is υ commutative?Let R = [A, B, C], I be an instance of R.

Is υD=(B) (υE=(C) (I)) = υE=(C) (υD=(B) (I)) ?

Find a (counter) example.

Both expressions have the same type:

⊛ [⊗[A, ⊛[⊗[B]], ⊛[⊗[C]]]]

Operators – Unnest 
• Let R = {[R1, …, S, …,Rn]} where S = {[S1, …, Sm]}
• Let I be an instance of R.

Then S(I) has type {[R1, …, S1, …, Sm, …,Rn]}

S(I) = {t |  r  I ( (t.Ri = r.Ri, 1  i  n) 

t [S1, …, Sm]  r.S)

}

(Phones){P.No.}

Owner

Phone

(Phones){P.No.}

Owner

Owner

Nancy

288-2323

Nancy

288-2323

Nancy

288-2323

232-4232

Nancy

232-4232

232-4232

John

232-2345

John

232-2345

John

232-2345

877-2323

John

877-2323

John

877-2323

John

Example – Cont.

Another instance of projected

Projected

phones(Projected)

Operators – Nest vs. Unnest
• υ and  are (almost) inverses. Let I be an instance of type R.
• When is (Ri) (υ(Ri)I) = I?
• What happens when:

1. The unnested field has different (set-)values for the same values in the other columns?

2. The unnested field has the empty set?

• Can (2) happen following a nest on the same column?
• When is υ(Ri) ((Ri)I) = I?
Operators – Powerset 

Let I be an instance of R.

(I) is of type {[R]}, and contains a tuple for each subset of I.

Express transitive closure using the NFNF algebra!

Theorem:

NFNF algebra is equivalent in expressive power to (safe) NFNF calculus.