Aquery: A DATABASE SYSTEM FOR ORDER. Dennis Shasha, joint work with Alberto Lerner {lerner,[email protected] Motivation The need for ordered data. Queries in finance, signal processing etc. depend on order. SQL 99 has extensions but they are clumsy.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Dennis Shasha, joint work with Alberto Lerner {lerner,[email protected]
Sales(month, total)
SELECT t1.month+1 AS forecastMonth, (t1.total+ t2.total + t3.total)/3 AS 3MonthMovingAverageFROM Sales AS t1, Sales AS t2, Sales AS t3WHERE t1.month = t2.month  1 AND t1.month = t3.month – 2
Can optimizer make a 3way (in general, nway) join linear time?
Ref: Data Mining and Statistical Analysis Using SQL
Trueblood and LovettApress, 2001
Alberto, I suggest query 2 because that may be hard in sql 99.
Sales(month, total)
SELECT month, avgs(8, total)FROM Sales ASSUMING ORDER month
avgs: vectortovector function, orderdependant and sizepreserving
order to be used on vectortovector functions
Employee(ID, salary)
SELECT first(N, salary) FROM Employee ASSUMING ORDER Salary
first: vectortovector function, orderdependant and non sizepreserving
sizepreserving
non sizepreserving
prev, next, $, []
avgs(*), prds(*), sums(*), deltas(*), ratios(*), reverse,
…
drop, first, last
orderdependant
rank, tile
min, max, avg, count
non orderdependant
Given a table with securities’ price ticks, ticks(ID, date, price, timestamp), find the best possible profit if one bought and sold security ‘S’ in a given day
Note that max – min doesn’t work, because must buy before you sell (no selling short).
ChallengeIn a given day, what would be the maximum difference between a buying and selling point of each security?
Ticks(ID, price, tradeDate, timestamp, …)
SELECT ID, max(price – mins(price))FROM Ticks ASSUMING ORDER timestampWHERE tradeDate = ‘99/99/99’GROUP BY ID
max
bestspread
running min
min
SELECT max(price – mins(price))
FROM ticks
ASSUMING ORDER timestamp
WHERE day = ‘12/09/2002’
AND ID = ‘S’
price 15 19 13 7 4 5 4 7 12 2
mins 15 15 13 7 4 4 4 4 4 4
 0 4 0 0 0 1 0 3 8 2
SELECT max(price–mins(price))
FROM ticks ASSUMING timestamp
WHERE ID=“x”
AND tradeDate='99/99/99'
[SQL:1999]
SELECT max(rdif)
FROM (SELECT ID,tradeDate,
price  min(price)
OVER
(PARTITION BY ID
ORDER BY timestamp
ROWS UNBOUNDED
PRECEDING) AS rdif
FROM Ticks ) AS t1
WHERE ID=“x”
AND tradeDate='99/99/99'
Bestprofit query comparisonWhen does the 21day average cross the 5month average?
Market(ID, closePrice, tradeDate, …)TradedStocks(ID, Exchange,…)
INSERT INTO temp FROMSELECT ID, tradeDate, avgs(21 days, closePrice) AS a21, avgs(5 months, closePrice) AS a5, prev(avgs(21 days, closePrice)) AS pa21, prev(avgs(5 months, closePrice)) AS pa5FROM TradedStocks NATURAL JOIN Market ASSUMING ORDER tradeDateGROUP BY ID
Vectorfield
groups in ID and nongrouped column
grouped ID and nongrouped column
two columns withthe same cardinality
Get the result from the resulting non first normal form relation temp
SELECT ID, tradeDateFROM flatten(temp)WHERE a21 > a5 AND pa21 <= pa5
SELECT ts.ID, ts.Exchange, avgs(10, hq.ClosePrice)FROM TradedStocks AS ts NATURAL JOIN HistoricQuotes AS hq ASSUMING ORDER hq.TradeDateGROUP BY Id
avgs
avgs
avgs
gby
sort
gby
op
avgs
op
sort
gby
gby
op
op
sort
op
(1) Sort then joinpreserving order
(2) Preserve existingorder
(3) Join then sortbefore grouping
(4) Join then sortafter grouping
Gene(geneId, seq)SELECT t1.geneId, t2.geneId, dist(t1.seq, t2.seq)FROM Gene AS t1, Gene AS tWHERE dist(t1.seq, t2.seq) < 5 AND posA(t1.seq, t2.seq)
posA asks whether sequences have Nucleo A in same position. Dist gives edit distance between two Sequences.
posA
dist
Switch dynamicallybetween (1) and (2) depending on the execution history
dist
posA
(1)
(2)
(3)
SELECT last(price)FROM ticks t,base b ASSUMING ORDER name, timestampWHERE t.ID=b.ID AND name=“x”
Original Query
last(price)
Name=“x”
sort
name,timesamp
ID
ticks
base
The sort on name can be eliminated because there will be only one name
Then, push sort
sortA(r1 r2) A sortA(r1) lop r2
sortA(r) A r
Last price for a name query
last(price)
sort
name,timesamp
ID
ticks
Name=“x”
base
The projection is carrying an implicit selection: last(price) = price[n], where n is the last index of the price array
f(r.col[i])(r) order(r)f(r.col)(pos()=i(r))
Last price for a name query
price
last(price)
pos()=last
lop
ID
ticks
Name=“x”
base
But why join the entire relation if we are only using the last tuple?
Can we somehow push the last selection down the join?
Last price for a name query
price
pos()=last
lop
ID
ticks
Name=“x”
base
We can take the last position of each ID on ticks to reduce cardinality, but we need to group by ticks.ID first
But trading a join for a group by is usually a good deal?!
One more step: make this an “edge by”
Last price for a name query
price
safety
lop
ID
pos()=last
Name=“x”
Gby
base
ID
ticks
price
xy
z
price
xy
z
prev(price)

xy
result
T
F
T
result

yx
zy
scalar =

=
>
It’s just SQL, really. But columnoriented. cardinality, but we need to group by ticks.ID first
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING
AQuery syntax and executionIt’s just SQL, really. But columnoriented. cardinality, but we need to group by ticks.ID first
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING
Arrables mentioned in the FROM clause are combined using cartesian product. This is operation is orderoblivious
AQuery syntax and executionIt’s just SQL, really. But columnoriented. cardinality, but we need to group by ticks.ID first
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING
The ASSUMING ORDER is enforced. From this point on any orderdependent expression can be used, regardless of the clause.
AQuery syntax and executionIt’s just SQL, really. But columnoriented. cardinality, but we need to group by ticks.ID first
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING
The WHERE expression is executed. It must generate a boolean vector that maps each row of the current arrable to true or false. The false rows are eliminated and the resulting arrable is passed along.
AQuery syntax and executionIt’s just SQL, really. But columnoriented. cardinality, but we need to group by ticks.ID first
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING
The GROUP BY expression is computed. It must map each tuple to a group. The resulting arrable has as many rows as there are groups. The columns that did not participate on the grouping expression now have arrayvalued values
AQuery syntax and executionIt’s just SQL, really. But columnoriented. cardinality, but we need to group by ticks.ID first
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING
The HAVING clause expression must result in a boolean vector that maps each group, i.e, row, to a boolean value. The rows that map to false are excluded.
AQuery syntax and executionIt’s just SQL, really. But columnoriented. cardinality, but we need to group by ticks.ID first
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING
Finally, each expression on the SELECT clause is executed, generating a new arraycolumn. The resulting arrable is achieved by “zipping” these arrays sidebyside.
AQuery syntax and executionSELECT ID, index(flags=‘FIN’)index(flags=‘ACK’) cardinality, but we need to group by ticks.ID first
FROM netmgmt
ASSUMING ORDER timestamp
GROUP BY ID
Translating queries into Operations
each
Orderpreservingportion of the plan
Gby
sort
netmgmt
Early sort and orderpreserving group by cardinality, but we need to group by ticks.ID first
Early, nonorder preserving group by and sort
Pick the best
each
each
GbysessionID
sorttimestamp
each
sorttimestamp
GbysessionID
netmgmt
netmgmt
select lineitem.orderid, avgs(10, lineitem.qty), lineitem.lineid
from order, lineitem assuming order lineid
where order.date > 45
and order.date < 55 and lineitem.orderid = order.orderid